clock menu more-arrow no yes mobile

Filed under:

The Monday Mathematical (Tuesday Edition): Determining Points via Linear Regression

Or, how many points you score

NCAA Football: Tennessee at Texas A&M
A running back who will still be scoring points for the Vols
Jerome Miron-USA TODAY Sports

In the midst of every single thing that has happened in the last 3 days, at least we still have numbers. Who’s with me?! (Note: In this analogy, South Carolina is the Germans and our SEC East title hopes are Pearl Harbor.)

Last week, we took a look at a type of model - a logistic regression - that can help determine binary outcomes like whether or not a drive ends in points. The main takeaway? More or less the same one we’ve had every week: explosiveness matters.

But what if we took a slightly different tact, using a much simpler model to explain the number of points scored?

Back to Middle School

If you’re like me, 8th grade algebra is getting fuzzier and fuzzier given the ever-growing stretch of time that has passed since then. However...if you can shake the rust off those gears, you might remember linear regressions, they of the “Y = MX + B” form.

At its most basic level, a linear regression gives you an idea of how the dependent variable (points scored in our case) changes with the independent variable(s). As the independent variable, X, increases by 1 unit, the “expected” number of points scored, Y, increases by M units. If X is equal to 0, then you would expect a drive to result in B points.

Let’s start again with the simple example that we had last week where we use the presence of an explosive play on a drive, this time to predict the number of points scored on the drive.

Simple linear regression using presence of explosive plays to predict the points scored

I’ve again highlighted the relevant parts of the Stata output. The presence of an explosive play is worth, on average, 2.94 points (the M from the general formula). The second box in the P>|t| column points toward the statistical significance for the explosive play variable. The generally accepted cutoff is .05, so at .00, it’s highly likely that explosive plays increase the number of points scored on a drive.

The other row is the constant (the B in the general formula). If a drive does not have an explosive play, you would still expect to score .805 points on average. That’s...not great. It works out to 1 touchdown every 9 drives, so roughly equivalent to playing Alabama all the time. HAHAHA (it’s important to be able to laugh at ourselves lest we end up crying).

This may seem abstract, but it’s not! The results can easily be compared to the actual. Remember from last week that we had 91 drives with an explosive play and 72 drives without. The model would say then, that we should have scored...

  • Non-explosive drives: Y = MX + B -> 72*(2.94 * 0 + .805) = 58 points
  • Explosive drives: Y = MX + B -> 91*(2.94 * 1 + .805) = 341 points

Guess what? That’s exactly what happened because that’s how linear regressions work!

Or aliens.

Let’s Get Weird Again

Given that a linear regression is a bit more intuitive, it allows us to throw a bunch of stuff against the wall to see what sticks. If I were so inclined (and I am, obviously), I could attempt to model number of points scored as a function of all of the variables I used last week plus a few more things.

  • numrun: The number of running plays on a drive
  • numpass: The number of passing plays on a drive
  • numcomplete: The number of completed passes on a drive
  • numsack: The number of sacks given up on a drive

We are looking for variables that have a statistically meaningful impact on the number of points scored. Stata has a lovely function where it will crank through multiple iterations of a model to weed out the variables that probably don’t matter. Here’s a first pass at that model using a P-value cutoff of .2, which is, admittedly, totally arbitrary.

First pass at a more complex linear model

The first box shows the variables that were not deemed significant. Whether or not we are playing a Competent Opponent does not have a statistically meaningful impact on whether an individual drive ends in points. This is likely a function of the fact that we are predicting on the drive level; within the 15 drives against Bowling Green, this variable never changed even though some drives ended with points and others didn’t. If we were looking at points scored by game, this variable may matter. Similarly, the number of sacks and number of passes get thrown out.

At first glance, the output is solid. Every variable is significant and the model explains about 50% of the variance in our data. That’s great!

Or is it? Notice the second box, where I’ve highlighted the impact of number of plays versus number of runs. The model suggests that having more plays is bad for the number of points, but having more runs is good. That doesn’t make a lot of sense. WHY HAVE YOU FAILED US, MATH?!

The issue is correlation. When variables move in the same direction at the same time, the model can’t really tell what is what. It basically says, “Did the number of plays matter? Or was it the number of runs? ¯\_(ツ)_/¯ ”

Correlation’s like...bad, man

So if we take out all of the “number of” variables and Competent Opponent...well, then we have a very similar model to what we had last week. Negative plays, explosive plays, and field position might not be the most exciting things, but they have predictive power.

Nailed it

GUESS WHAT STILL MATTERS IMMENSELY?!

It’s explosive plays! Having one on a drive was worth an extra 2.76 points last year. Avoiding negative plays? Worth .89 points. An extra 10 yards of field position? Worth .46 points.

We saw first hand on Saturday how hard it can be to score when an offense is pinned deep and stuck behind the chains. Sprinkle in a lack of explosiveness and you lose to Will Muschamp.

These things matter for every team, but they appear to be particularly true for the Butch Jones/Josh Dobbs Vols.