Welcome to Regression Alert, your weekly guide to using regression to predict the future with uncanny accuracy.
For those who are new to the feature, here's the deal: every week, I dive into the topic of regression to the mean. Sometimes I'll explain what it really is, why you hear so much about it, and how you can harness its power for yourself. Sometimes I'll give some practical examples of regression at work.
In weeks where I'm giving practical examples, I will select a metric to focus on. I'll rank all players in the league according to that metric, and separate the top players into Group A and the bottom players into Group B. I will verify that the players in Group A have outscored the players in Group B to that point in the season. And then I will predict that, by the magic of regression, Group B will outscore Group A going forward.
Crucially, I don't get to pick my samples (other than choosing which metric to focus on). If the metric I'm focusing on is touchdown rate, and Christian McCaffrey is one of the high outliers in touchdown rate, then Christian McCaffrey goes into Group A and may the fantasy gods show mercy on my predictions.
Most importantly, because predictions mean nothing without accountability, I track the results of my predictions over the course of the season and highlight when they prove correct and also when they prove incorrect. Here's a list of my predictions from 2020 and their final results. Here's the same list from 2019 and their final results, here's the list from 2018, and here's the list from 2017. Over four seasons, I have made 30 specific predictions and 24 of them have proven correct, a hit rate of 80%.
The Scorecard
In Week 2, I broke down what regression to the mean really is, what causes it, how we can benefit from it, and what the guiding philosophy of this column would be. No specific prediction was made.
In Week 3, I dove into the reasons why yards per carry is almost entirely noise, shared some research to that effect, and predicted that the sample of backs with lots of carries but a poor per-carry average would outrush the sample with fewer carries but more yards per carry.
In Week 4, I talked about yard-to-touchdown ratios and why they were the most powerful regression target in football that absolutely no one talks about, then predicted that touchdowns were going to follow yards going forward (but the yards wouldn't follow back).
In Week 5, we looked at ten years worth of data to see whether early-season results better predicted rest-of-year performance than preseason ADP and we found that, while the exact details fluctuated from year to year, overall they did not. No specific prediction was made.
In Week 6, I taught a quick trick to tell how well a new statistic actually measures what you think it measures. No specific prediction was made
Statistic for regression | Performance before prediction | Performance since prediction | Weeks remaining |
---|---|---|---|
Yards per Carry | Group A had 10% more rushing yards per game | Group B has 4% more rushing yards per game | None (Win!) |
Yards per Touchdown | Group A scored 9% more fantasy points per game | Group B scores 8% more fantasy points per game | 1 |
I'll be perfectly honest, I was fully prepared this week to write up a postmortem on exactly what went wrong with our yards per carry prediction before doubling down and making another. I was going to go player by player and show that the issue wasn't the yards per carry prediction itself, it's that a bunch of low-workload backs in Group A suddenly started getting big workloads (and Derrick Henry went from a big workload to a monstrous, "Atlas shouldering the entire planet" level workload), and how this was weird, unpredictable, unlikely to continue, and shouldn't dissuade you from the idea that yards per carry really is pseudoscience.
But, well... it was so unlikely to continue that it didn't continue, and Group B managed to rally for an inspiring come-from-behind victory, preserving our perfect record on this particular prediction. Group B averaged 16.7 carries per game, the biggest workload by either group in any week, while Group A averaged just 9.3 carries per game, the lowest workload from either group in any week (and this despite Derrick Henry getting yet another 20-carry game). When the dust settled, Group B averaged slightly more carries per game than Group A over the last four weeks (13.3 to 13.1). And they also averaged slightly more yards per carry than Group A (4.94 to 4.80). Which obviously meant they averaged slightly more rushing yards per game.
It was hardly an individual effort; 80% of backs in Group B topped 4.4 yards per carry, compared to just 55% of backs in Group A. But if there was one standout it was likely Jonathan Taylor, who went from the second-lowest ypc average at the time of the prediction (3.34) to the highest ypc average in the four weeks since (6.64). Which is not all that surprising, because yards per carry in one sample does very little to predict yards per carry in another.
Meanwhile, our second prediction (which looked like it was coasting to victory) has suddenly gotten a lot tighter after a strong touchdown showing from Group A. But Group B is still leading in yards per game (64.6 to 58.8) and touchdowns per game (0.42 to 0.41), and both groups are averaging a near-identical yard-to-touchdown ratio (144 yards per touchdown for Group A, 153 for Group B).
Does "Offensive Identity" Regress?
This week I wanted to try a new prediction. And because one of the aims of this column is to equip you with the tools to make accurate predictions with minimal effort, I wanted to walk you through the entire process from start to finish.
While watching Monday Night Football, someone mentioned that Tennessee's 4 rushing touchdowns gave them 12 for the season, compared to just 6 touchdowns passing. It makes sense that Tennessee would have more rushing touchdowns than most other teams because Tennessee has more Derrick Henry than most other teams. But I wondered first how unusual this split was, and second how unsustainable this split was.
So I headed over to https://www.pro-football-reference.com/, the best repository of NFL statistics on the planet, and I went to the 2021 season summary. Scrolling down, I found that there have been 321 passing touchdowns against 180 rushing touchdowns so far this season, a ratio of 64.1%. To get some context for that number, I checked the 2020 season and found there were 871 passing touchdowns compared to 532 rushing, a ratio of 62.1%. It looks, then, like this league-wide ratio is fairly stable from year to year.
My second step was to copy the 2020 offensive statistics tables into a spreadsheet. From here, I calculated the passing touchdown rate of every team over each of the last two seasons. Last year, 30 out of 32 teams scored between 45% and 75% of their offensive touchdowns via the pass. The two exceptions were the Patriots (who scored 37.5% of their touchdowns through the air thanks to some strong goal-line running by quarterback Cam Newton) and the Texans (who scored 76.7% of their touchdowns through the air, just a hair over our range). It certainly looks like 45-75% represents the "sustainable range".
It's also worth observing that the Titans had Derrick Henry last year and yet they still scored 55.9% of their touchdowns through the air. There's no reason to believe that their "team identity" is so rush-heavy that they'll continue to score two-thirds of their touchdowns on the ground going forward.
Then I looked at the teams from 2021. Since fewer games have been played, I expected a bigger spread in what percent of touchdowns were coming via the pass. Remember, "statistics regress more over larger samples" is one of the biggest rules of this column and why we make predictions for 4 weeks instead of just 1 or 2. (Look at how crucial the extra time was for our yards per carry prediction to pay off.)
Sure enough, 9 teams fall outside of that 45-75% passing touchdown rate that we estimate is sustainable. In addition to the Titans, the Cleveland Browns are scoring just 33% of their touchdowns through the air, and the Chicago Bears are scoring just 30%. Meanwhile, the Minnesota Vikings have 86.7% of their touchdowns from the pass, the Cincinnati Bengals and Atlanta Falcons have 83.3%, the Kansas City Chiefs are at 78.3%, the Denver Broncos are at 76.9%, and the Los Angeles Rams are at 76.2%.
Outliers identified, my goal is to come up with a prediction that these outliers will regress in a meaningful way. There's a bit of a conflict because I want to focus on the biggest outliers (because they'll regress the most), but I also want to include as many teams as possible to reduce the role of random chance. The other conflict with a prediction like this is how do I make the prediction easy to state and easy to track. "The Chicago Bears will see a higher percentage of their touchdowns coming through the air while the Cincinnati Bengals see a lower percentage" is true, but it's also not concrete and trackable.
Because I was initially intrigued by Derrick Henry and the Tennessee Titans, I choose to focus on the low end of the range. (Though the Vikings, Bengals, Falcons, Chiefs, Broncos, and Rams will certainly start running for more touchdowns going forward, too.) Only three teams were below the low end of the sustainable range, but two more (the Giants at 45.5% and the Panthers at 46.7%) are still scoring fewer than half of their touchdowns via the pass. And two more (the Ravens and the Jaguars) have scored exactly half of their touchdowns passing and half rushing.
Those four extra teams might fall at the very edge of the "sustainable range", but remember, the league average is around 63%, so I'd still expect them to regress towards the pass. Just not as strongly.
Now that I have my seven teams (the Bears, the Titans, the Browns, the Giants, the Panthers, the Ravens, and the Jaguars) I can make my prediction. Collectively, those seven teams have 61 rushing touchdowns against just 43 passing, meaning they've rushed for 42% more touchdowns than they've passed for. Over the next four weeks, through the magic of regression, I predict they'll have more passing touchdowns than rushing touchdowns.
One might wonder why I care about a statistic as esoteric as "percentage of touchdowns coming through the air". The simple truth is that how many touchdowns a team scores is a function of how good its offense is, but how it scores those touchdowns is much more random, especially over small samples.
Knowing that Tennessee is scoring "too many" rushing touchdowns has big implications when it comes to predicting the next four weeks for Derrick Henry, Ryan Tannehill, A.J. Brown, Julio Jones, and the rest of the team. Similarly, when setting expectations for Joe Mixon and JaMarr Chase it's worth knowing that the Bengals have an unsustainable rate of passing touchdowns. And so on. The key to harnessing regression to the mean is focusing on things we don't really care about, like yards per carry or touchdown mix, but which impact things we do care about, like rushing yards or rushing touchdowns.