Filed under:

# How a nuclear engineering Ph.D. candidate solves a problem like McFadden

The following post comes courtesy of RTT community member hooper. Do not mess with a guy who's this close to getting a Ph.D. in nuclear engineering. I'm bumping his conclusion to the top because frankly, it's the only part I understand:

If the current information provides any prediction about this game, it’s that Tennessee is predicted to beat Arkansas based on the average rush defense predictor. [The "rush defense predictor" is presumably somewhere below. Kudos to you if you can find it. -- ed.] If the Kentucky game is treated as an anomaly, UT is predicted to have about a 50-50 shot at beating Arkansas based on their average rush defense. That’s not exactly a revelation, but it does verify the stuff you wrote and gives some nice pretty numbers and pictures to play with.

Translation: I was right. Na-na-na-na-na-na. Oh, and Tennessee wins! Woo! Maybe. So whoa on woo.

Anyway, the meat is after the jump, but be warned, wicked math, charts, and graphs ahead. Make sure the safety goggles are snug before proceeding.

For the data given in "How do you solve a problem like McFadden, part II", the best fit of points is by yards:

Linear Fit
Points = -21.45121 + 0.3460472 Yards

Summary of Fit

 RSquare 0.567489 RSquare Adj 0.495404 Root Mean Square Error 12.87058 Mean of Response 42 Observations (or Sum Wgts) 8

Analysis of Variance

 Source DF Sum of Squares Mean Square F Ratio Model 1 1304.0891 1304.09 7.8725 Error 6 993.9109 165.65 Prob > F C. Total 7 2298.0000 0.0309

Parameter Estimates

 Term Estimate Std Error t Ratio Prob>|t| Intercept -21.45121 23.06764 -0.93 0.3883 Yards 0.3460472 0.123333 2.81 0.0309

Just look at the big red text and ignore the rest (the software package [I'm using] gives a lot more, but it’s easier to highlight than to edit further).  Interpretation:

• R2:  Rush Yardage explains a little over half the variance in Arkansas’s points.
• Prob > F:  A number below 0.05 is generally considered a sign that the model is statistically significant.  In other words, the model is useful.

However, notice that the two leftmost points really stand apart.  Without them, the remaining points appear to trend very nicely.  Treating them as outliers, I’ll remove them:

Linear Fit
Points = -280.3811 + 1.6089548 Yards

Summary of Fit

 RSquare 0.585571 RSquare Adj 0.481964 Root Mean Square Error 9.124053 Mean of Response 48.5 Observations (or Sum Wgts) 6

Analysis of Variance

 Source DF Sum of Squares Mean Square F Ratio Model 1 470.50665 470.507 5.6518 Error 4 332.99335 83.248 Prob > F C. Total 5 803.50000 0.0762

Parameter Estimates

 Term Estimate Std Error t Ratio Prob>|t| Intercept -280.3811 138.3889 -2.03 0.1127 Yards 1.6089548 0.676782 2.38 0.0762

Removing the Alabama and Auburn results, the model is really no better at explaining things (look at RSquare – it gained almost nothing).  Not only that, the statistical significance is lower (Prob > F is higher, which is bad).  Besides, do you really believe a model that predicts 25 points if Arkansas plays a team who averaged 190 yards of rush defense, but predicts 57 points if Arkansas plays a team who averages 210 rushing yards on defense?  Me neither.

Now, for some real fun (well, a statistician would think so).

Whole Model Test

 Model -LogLikelihood DF ChiSquare Prob>ChiSq Difference 5.2925058 1 10.58501 0.0011 Full 5.82257e-8 Reduced 5.2925059

 RSquare (U) 1.0000 Observations (or Sum Wgts) 8

Converged by Objective

Parameter Estimates

 Term Estimate Std Error ChiSquare Prob>ChiSq Intercept Unstable 603.457952 149969.72 0.00 0.9968 Yards Unstable -3.0404375 749.93609 0.00 0.9968

For log odds of L/W

Ignore the data junk.

This is a logistical test where wins and losses are compared against yardage gained.  (Ignore the vertical stuff and read the left-right of the graph to simplify things.)  The nearly vertical blue line effectively says that if Arkansas plays a team who gives up an average of 200 or more rushing yards, they win, otherwise they lose.  Nifty, huh?  If you read the "Parameter Estimates" piece, you see the word "unstable" twice.  This unfortunately tells you that the model is not reliable.  So, while it sounds good, [it's] really not useful.

The problem is the Kentucky game, where a really bad defense produced the same result as a really good defense.  Since there are so few data points, it’s enough to throw the whole thing off.  Removing Kentucky:

Whole Model Test

 Model -LogLikelihood DF ChiSquare Prob>ChiSq Difference 4.1878871 1 8.375774 0.0038 Full 3.56579e-8 Reduced 4.1878871

 RSquare (U) 1.0000 Observations (or Sum Wgts) 7

Converged by Objective

Parameter Estimates

 Term Estimate Std Error ChiSquare Prob>ChiSq Intercept Unstable 77.8017059 26623.368 0.00 0.9977 Yards Unstable -0.4704935 144.43734 0.00 0.9974

For log odds of L/W

Again, just ignore the data junk and watch the pretty blue line. Without Kentucky, the breakwater lies closer to about 167.  If that number is eerily frightening, it should be; your estimate of Tennessee’s run defense came out to basically exactly this result.  Again, the model has stability problems due to a lack of data points (it’s "underpowered" in stats lingo).  Still, it’s as useful as anything else will be for predicting this game.

Summary? Summary:

If the current information provides any prediction about this game, it’s that Tennessee is predicted to beat Arkansas based on the average rush defense predictor. If the Kentucky game is treated as an anomaly, UT is predicted to have about a 50-50 shot at beating Arkansas based on their average rush defense. That’s not exactly a revelation, but it does verify the stuff you wrote and gives some nice pretty numbers and pictures to play with.

### Poll

#### Thoughts?

This poll is closed

(1 vote)