clock menu more-arrow no yes mobile

Filed under:

The Monday Mathematical: Building an Offensive Regression

Or, how to score points

NCAA Football: Alabama at Tennessee
Smokey is always statistically significant
Randy Sartin-USA TODAY Sports

After a bye week and an injury (illness), Volundore, like the Vols, is happy to be back on the field this week (yes, I know the Vols played last weekend; yes, I was there for the 11th straight year; no, for the 10th straight time I don’t want to talk about it).

When last we spoke, we looked at the impact that explosive plays and negative plays have on the offense’s success. What if we zoomed out a bit further to look at a variety of factors that could potentially determine whether the team scored or not?

I’d like that. The question is...would you?

Logistic regression: What does it mean?

If you’re looking to estimate a binary outcome - in our case, did the offense score or not? - one type of model you can build is a logistic regression. The interpretation can get a bit messy (we’ll get into that in a minute), but the “coefficient” on a variable tells you how the odds of scoring go up or down as that variable changes.

Let’s start by explaining odds - I know that sounds basic, but bear with me. Let’s imagine a hypothetical world where the 2016 season was played 11 times. If Tennessee came up as the champion one time in that simulation, we would interpret that as 1 “success” (Woo! Go Vols!) and 10 “failures.” What we commonly see communicated as odds (10:1) is really an odds ratio: it is 10 times more likely that UT will not win the title (10/11) than win the title (1/11).

With that background, let’s look at a simple crosstabulation of UT’s 2015 drives across two dimensions: did they score (yes/no) and did the drive have an explosive play (yes/no).

A simple crosstab of whether a drive ended in points and whether it had an explosive play

UT had 91 drives last year that included an explosive play (run of 12 or more yards, pass of 16 or more yards) as shown in the 2nd outlined column above. They scored on 59 of those drives and failed on 32 for an odds ratio of 59/32 = 1.84. Conversely, on the 72 drives that did not include an explosive play (the first outlined column), they only scored 10 times and failed 62 for an odds ratio of 10/62 = .161.

I provide all of that context to help us understand what follows. Using Stata, my default stats program (because that’s what my grad school used and I haven’t learned R yet), below is a sample output for a logistic regression. Let’s create a one variable model where we’re trying to predict if a drive ended with points based off of whether or not a drive had an explosive play.

The output for a logistic regression using explosive plays to predict whether a drive ended with a score or not

I have highlighted the most relevant piece of the regression. Each variable in our model will get its own row and then a reported odds ratio. How do you interpret the odds ratio? As our variable for explosive plays increases by 1 unit (from 0 to 1), the odds ratio of scoring gets 11.43 times bigger. It becomes much more likely that you will score (this actually maps back to the odds ratios calculated in the crosstab in that 1.84/.161 = 11.43!).

Boom goes the dynamite.

A simpler way to interpret that: an odds ratio greater than 1 means that scoring is more likely as that variable increases while an odds ratio less than 1 means that scoring is less likely as that variable increases. Got it? Good!

Building a more complex model

There are numerous ways for us to go about this, and a lot of it obviously depends on what variables are in your dataset.

Here are the variables that I included in my model, along with their explanations.

  • numplays: The number of offensive snaps on a drive
  • expplayyn: Binary indicator of whether or not drive included an explosive play
  • negplayyn: Binary indicator of whether or not drive included a negative yardage play
  • strtfield: 1 to 100 representation of starting field position (higher numbers closer to the opponent’s goal line)
  • CompetentOpponent: Binary indicator of whether or not opponent was one of Tennessee’s tougher games in 2015 (1 = Oklahoma, Florida, Arkansas, Georgia, Alabama, Northwestern)

If we use those variables to attempt to explain whether or not the Vols scored on an offensive drive, we get the following results.

More in-depth logistic regression attempting to explain whether or not a UT drive ended in points

There’s a slight nuance in interpreting these results compared to the simple model. With multiple variables included, the odds ratio now represents an increase in that variable holding all other variables constant. We can’t answer from the above what happens if you have explosive plays and a long drive, we can only interpret what it means to have a longer drive relative to a shorter drive.

Keeping that in mind, all of these results are relatively intuitive. If you go 3 and out, it’s basically impossible to score, so it makes sense that as you increase the number of plays, you increase the odds that a drive ended in a score. As we just saw above (and looked at in previous weeks), having an explosive play significantly increases your odds of scoring while having a negative play (probably) decreases your odds. Each additional yard of field position isn’t that meaningful (each additional yard gets you a 5% higher odds ratio), but they add up (2 yards is 1.05 * 1.05, 3 yards is 1.05 * 1.05 * 1.05, etc.).

The most interesting variable to interpret is Competent Opponent. Take a look at the last two columns: the 95% confidence interval is somewhere between .26 (meaning playing a tough team makes you much less likely to score) and 1.3 (meaning playing a tough team makes you somewhat more likely to score). Without getting too in the weeds, this is what math geeks are referring to when they talk about “statistically significant.” As modeled, we cannot definitively say whether playing Alabama makes it harder to score than playing Kentucky because the confidence interval encompasses both “less likely” and “more likely” values. It’s PROBABLY true...but it might not be. And that’s the fun of mathematical analyses!

Conclusion

Once again, I caveat all of this in that it doesn’t provide a lot of insight into actual football strategy. Better field position, explosive plays, and bad opponents are better than the alternative. I know...I’m as shocked as you.

That said, it’s fun to poke around and see if the numbers confirm our generally accepted theories about the game. It would be interesting to expand this beyond just the Vols and see if these theories hold true. Maybe next week on the Monday Mathematical!