clock menu more-arrow no yes mobile

Filed under:

How a nuclear engineering Ph.D. candidate solves a problem like McFadden

The following post comes courtesy of RTT community member hooper. Do not mess with a guy who's this close to getting a Ph.D. in nuclear engineering. I'm bumping his conclusion to the top because frankly, it's the only part I understand:

If the current information provides any prediction about this game, it’s that Tennessee is predicted to beat Arkansas based on the average rush defense predictor. [The "rush defense predictor" is presumably somewhere below. Kudos to you if you can find it. -- ed.] If the Kentucky game is treated as an anomaly, UT is predicted to have about a 50-50 shot at beating Arkansas based on their average rush defense. That’s not exactly a revelation, but it does verify the stuff you wrote and gives some nice pretty numbers and pictures to play with.

Translation: I was right. Na-na-na-na-na-na. Oh, and Tennessee wins! Woo! Maybe. So whoa on woo.

Anyway, the meat is after the jump, but be warned, wicked math, charts, and graphs ahead. Make sure the safety goggles are snug before proceeding.

For the data given in "How do you solve a problem like McFadden, part II", the best fit of points is by yards:


Linear Fit
Points = -21.45121 + 0.3460472 Yards

Summary of Fit


 

 

RSquare

0.567489

RSquare Adj

0.495404

Root Mean Square Error

12.87058

Mean of Response

42

Observations (or Sum Wgts)

8

Analysis of Variance


Source

DF

Sum of Squares

Mean Square

F Ratio

Model

1

1304.0891

1304.09

7.8725

Error

6

993.9109

165.65

Prob > F

C. Total

7

2298.0000

 

0.0309

Parameter Estimates


Term

 

Estimate

Std Error

t Ratio

Prob>|t|

Intercept

 

-21.45121

23.06764

-0.93

0.3883

Yards

 

0.3460472

0.123333

2.81

0.0309

Just look at the big red text and ignore the rest (the software package [I'm using] gives a lot more, but it’s easier to highlight than to edit further).  Interpretation:

  • R2:  Rush Yardage explains a little over half the variance in Arkansas’s points.
  • Prob > F:  A number below 0.05 is generally considered a sign that the model is statistically significant.  In other words, the model is useful.

However, notice that the two leftmost points really stand apart.  Without them, the remaining points appear to trend very nicely.  Treating them as outliers, I’ll remove them:


Linear Fit
Points = -280.3811 + 1.6089548 Yards

Summary of Fit


 

 

RSquare

0.585571

RSquare Adj

0.481964

Root Mean Square Error

9.124053

Mean of Response

48.5

Observations (or Sum Wgts)

6

Analysis of Variance


Source

DF

Sum of Squares

Mean Square

F Ratio

Model

1

470.50665

470.507

5.6518

Error

4

332.99335

83.248

Prob > F

C. Total

5

803.50000

 

0.0762

Parameter Estimates


Term

 

Estimate

Std Error

t Ratio

Prob>|t|

Intercept

 

-280.3811

138.3889

-2.03

0.1127

Yards

 

1.6089548

0.676782

2.38

0.0762

Removing the Alabama and Auburn results, the model is really no better at explaining things (look at RSquare – it gained almost nothing).  Not only that, the statistical significance is lower (Prob > F is higher, which is bad).  Besides, do you really believe a model that predicts 25 points if Arkansas plays a team who averaged 190 yards of rush defense, but predicts 57 points if Arkansas plays a team who averages 210 rushing yards on defense?  Me neither.

Now, for some real fun (well, a statistician would think so).

Whole Model Test


Model

-LogLikelihood

DF

ChiSquare

Prob>ChiSq

Difference

5.2925058

1

10.58501

0.0011

Full

5.82257e-8

 

 

 

Reduced

5.2925059

 

 

 

 

 

 

RSquare (U)

1.0000

Observations (or Sum Wgts)

8

 

 

Converged by Objective

Parameter Estimates


Term

 

Estimate

Std Error

ChiSquare

Prob>ChiSq

Intercept

 Unstable

603.457952

149969.72

0.00

0.9968

Yards

 Unstable

-3.0404375

749.93609

0.00

0.9968

For log odds of L/W

Ignore the data junk.

This is a logistical test where wins and losses are compared against yardage gained.  (Ignore the vertical stuff and read the left-right of the graph to simplify things.)  The nearly vertical blue line effectively says that if Arkansas plays a team who gives up an average of 200 or more rushing yards, they win, otherwise they lose.  Nifty, huh?  If you read the "Parameter Estimates" piece, you see the word "unstable" twice.  This unfortunately tells you that the model is not reliable.  So, while it sounds good, [it's] really not useful.

The problem is the Kentucky game, where a really bad defense produced the same result as a really good defense.  Since there are so few data points, it’s enough to throw the whole thing off.  Removing Kentucky:

Whole Model Test


Model

-LogLikelihood

DF

ChiSquare

Prob>ChiSq

Difference

4.1878871

1

8.375774

0.0038

Full

3.56579e-8

 

 

 

Reduced

4.1878871

 

 

 

 

 

 

RSquare (U)

1.0000

Observations (or Sum Wgts)

7

 

 

Converged by Objective

Parameter Estimates


Term

 

Estimate

Std Error

ChiSquare

Prob>ChiSq

Intercept

 Unstable

77.8017059

26623.368

0.00

0.9977

Yards

 Unstable

-0.4704935

144.43734

0.00

0.9974

For log odds of L/W

Again, just ignore the data junk and watch the pretty blue line. Without Kentucky, the breakwater lies closer to about 167.  If that number is eerily frightening, it should be; your estimate of Tennessee’s run defense came out to basically exactly this result.  Again, the model has stability problems due to a lack of data points (it’s "underpowered" in stats lingo).  Still, it’s as useful as anything else will be for predicting this game.

Summary? Summary:

If the current information provides any prediction about this game, it’s that Tennessee is predicted to beat Arkansas based on the average rush defense predictor. If the Kentucky game is treated as an anomaly, UT is predicted to have about a 50-50 shot at beating Arkansas based on their average rush defense. That’s not exactly a revelation, but it does verify the stuff you wrote and gives some nice pretty numbers and pictures to play with.

Poll

Thoughts?

This poll is closed

  • 20%
    Exactly!
    (1 vote)
  • 60%
    You should have used a spline.
    (3 votes)
  • 20%
    Huh?
    (1 vote)
5 votes total Vote Now