What the Forecasts Got Right, and Wrong

Josh Katz

The results of the midterm elections are in, leaving Republicans with at least 52 Senate seats, along with the likely addition of Alaska, where Dan Sullivan holds a four-point lead. In Louisiana, Mary Landrieu and Bill Cassidy will face off in a runoff election on Dec. 6. And without the possibility of control of the Senate being on the line, Ms. Landrieu can no longer hold out hope that a flood of Democratic support will come to her aid.

By the end of the elections, the G.O.P. will most likely end up with 53 or 54 seats, in what is almost universally being described as a Republican wave.

The results were not unexpected. The set of Senate seats up for election in this cycle favored Republicans. Polling averages gave the G.O.P. nominal leads in enough races to leave them with 53 seats. The Democrats’ hopes to retain control of the Senate had always rested on the possibility that pollsters as a whole were systematically misjudging the electorate in a fundamental way. And, it turns out, they were — only the opposite way.

In the aggregate, pollsters were off — way off. Arkansas was expected to be a Republican win, but not an 18-point blowout for Tom Cotton. Virginia was expected to be a safe Democratic victory, not a nail-biter that, as of Wednesday afternoon, remained too close to call. In races around the country, polls skewed toward the Democrats, and Republicans outperformed their polling averages by substantial margins.

Senate Polls Overstate Democratic Support Across the Board

In almost every race, the polling average skewed Democratic, sometimes by a substantial margin.

Perhaps just as notable as where polling averages missed is where they did not. In Alaska, often trotted out as the prototype of a state that’s difficult to poll, the polling average fell within a single point of the current vote margin. Meanwhile, in Virginia, henceforth to be known as the Alaska of the East, polls underestimated support for the Republican, Ed Gillespie, by about 9 points. Errors in Arkansas and Kentucky were similar.

The numerous Senate forecasting models that sprang up this cycle — including our own, Leo — generally did well. Most will have “picked” the outcome correctly (in the sense of assigning more than 50 percent probability to the winner) in 34 of the 35 races, not including Louisiana. The one surprise was in North Carolina, where the Republican, Thom Tillis, won by 1.7 points in a race the models generally saw as leaning toward the incumbent Democratic senator, Kay Hagan.

But, as we’ve said before, counting up hits and misses in this fashion isn’t the best way to assess a forecast. So what’s a better way?

There are several quantitative methods for scoring forecasts, one of the most common being the Brier score, a metric proposed in 1950 by the meteorologist Glenn Brier. I prefer the logarithmic score for its somewhat firmer grounding in statistical theory, but the methods give broadly similar results. Both score a model based on how much confidence the forecast put into their final prediction. Assigning a very high probability to an event that doesn’t happen loses you points, as does assigning a very low probability to an event that does happen.

By this metric, the various models all performed about the same by Election Day — with models from The Washington Post and Daily Kos on top, and Leo in the middle of the pack. By the time Election Day rolls around, making a good forecast really isn’t all that hard, and most of the models ended up in roughly similar places.

How the Senate Forecast Models Did

As measured by the logarithmic score, one commonly used technique for scoring probabilistic forecasts, on Election Day most of the forecasts performed roughly the same, with the models from the Washington Post and Daily Kos scoring the highest.

Back out to June and August, and the models’ scores start to spread out, with Leo at or near the top of the pack. Leo lost points over the last few weeks for scoring the Georgia race as closer to a tossup than some of the other models; it was using some more conservative assumptions about what would happen in a runoff between David Perdue and Michelle Nunn. If you rank the forecasts based on their daily logarithm scores, and then average these rankings for every day since Aug. 19 (the earliest date for which we have forecasts from six of the seven models), Leo does fairly well.

Ranking the Forecasts: How Have They Done Since August?

Average daily rank for the Senate forecast models, from Aug. 19 through Election Day. A lower value denotes greater relative accuracy.

But all of this is ignoring the fact that things like Brier scores or logarithmic scores, while useful in the long run, contain an irreducible element of chance. In general, putting too much weight on these post hoc evaluations could leave you vulnerable to outcome bias.

Suppose we decide to flip a coin — you say the probability that it lands heads is 50 percent; I say it’s 100 percent. I flip the coin and it lands heads. Whether you use Brier scores or logarithmic scores, my forecast will be judged better than yours. In fact, my forecast will be judged better by almost any reasonable metric. It’s only in the long run, over many repeated flips, that the flaws in my forecast would be revealed, as (eventually) tails would come up, and my logarithmic score would drop to negative infinity.

But we can’t repeat the 2014 election multiple times. The fact that there were 36 Senate races helps somewhat, but the majority of them were not competitive enough to provide useful data, and this still won’t get you past the inherent randomness. It’s only in judging outcomes over a series of elections that these methods are useful. And it’s worth noting that Drew Linzer, the creator of the Daily Kos forecast, did quite well in 2012 as well.

There are other more intangible measures that are also worthwhile — stability and transparency among them. You don’t want a forecast that gets distracted by shiny objects, so to speak, chasing after every new poll that comes in. But nor do you want a forecast so stubborn that it fails to capture meaningful shifts when they do occur. Finding this balance between stability and sensitivity — being able to distinguish signal from noise, which we explored in our comparison of Senate models over time — is part of a good forecast.

In general, the more transparent a forecast is about its methods, the better it serves this goal. We’ve tried to be as up front as we could about the choices underlying Leo and the underlying assumptions, including posting our data and code on GitHub.

But perhaps the truest measure of a forecast is whether it makes you more knowledgeable about the world around you. I hope that Leo achieved this.

The Upshot|What the Forecasts Got Right, and Wrong

The Upshot

Regarding Leo |NYT Now

What the Forecasts Got Right, and Wrong

Senate Polls Overstate Democratic Support Across the Board

How the Senate Forecast Models Did

Ranking the Forecasts: How Have They Done Since August?

Site Index

The Upshot

Senate Polls Overstate Democratic Support Across the Board

How the Senate Forecast Models Did

Ranking the Forecasts: How Have They Done Since August?

More on nytimes.com

Site Index