Models on voter files can be useful, but they are all wrong
Professor George Box, called “one of the great statistical minds of the 20th century,” put it this way, “All models are wrong. Some are useful.”
As we increasingly rely on models to both execute our campaigns and interpret election results, we would do well to remember both halves of Box’s admonition.
Models allow us to target specific appeals to those on the voter file who are likely to be Black, Latino, or AAPI; those who are college graduates and those who aren’t; environmental voters, pro-choice voters and so much more.
The pitfalls of exit polls in analyzing election results have been well known, at least to exit pollsters, for decades. As their drawbacks have become more widely understood, some have argued to throw them out of the interpretive toolbox altogether, replacing them with models of both demography and vote-choice.
As Box suggested, many of these models are very useful. But they are all wrong. Just how wrong? It varies.
Determining the “ground truth” in these situations is often difficult. But we’ll start from the premise that if you ask someone whether they are “Hispanic or Latino” and they say they aren’t, then, in fact, they are not Latino. Similarly, with being Black or Asian.
In other words, if you say you are x, and the model says you’re not x, the model is faulty. If you say you aren’t y, and the model says you are y, the model is wrong. If you and the model agree, the model is accurate.
Let’s start with African Americans. In one large state with a substantial Black population, only 46% of those flagged as Black by the model identify themselves as Black. The majority identify themselves as something else.
Looked at differently, nearly three-quarters of those who identify as Black in this state are not modeled as such.
So, if you send mail tailored to the African American community there, based on the model, about half the people you mail won’t consider themselves Black and you’ll only be reaching about a quarter of the state’s African American community.
Similarly, if you are using this data to assess how Blacks voted you’ll be missing a lot.
In another state, a large city with a very large Black population and a different modeler, it gets better, but still far from perfect. There, a quarter of those modeled as Black say they are not and nearly one-in-five people who say they are Black fail to be captured by the model.
There is an error in locating Latinos as well. In the large state with a large Latino population, 20% of those who are modeled as Latinos say they are not, while 20% of those not modeled as Latinos say they are.
In the city with a substantial Latino electorate, 26% of those who the model predicts are Latino say, “No. Not me,” though only about 16% of those not modeled as Latinos say that is in fact who they are.
With Asian Americans the models perform even less well, with about a third wrongly classified with and 40% missed.
One (but only one) reason for these errors is our failure to recognize that neither ethnicity nor opinions are always clear or enduring. In panel surveys, which interview the same people at different points in time, 11-20% of those who initially identify as non-white change their self-classification.
So it goes. Useful but wrong, whether estimating demography or voters’ issue positions.
For example, in a state where abortion rights were heavily contested, nearly four-in-ten of those modeled as anti-choice responded to our polls saying abortion should be at least mostly legal. About one-in-six of those modeled as pro-choice said abortion should be mostly illegal. That’s better than just guessing at voters’ views, but it’s far from perfect.
Models are by definition simplifications of reality, which is why Box’s aphorism is so apt. Models can be enormously helpful in executing and understanding campaigns, but we need to recognize their limitations.