The Hill and I are taking an end-of-year break. While we’re gone, you’ll no doubt be devouring endless analyses of the 2020 election.
Much of the data being analyzed, and the tools being employed, are laden with problems and pitfalls.
Even at their best, exits are infected by a rarely discussed cluster bias. What makes a poll valid is random sampling which means every voter has an equal probability of being part of the sample.
But once interviewers are assigned to precinct A and not B, voters in B suddenly have zero probability of being polled.
It’s part of what makes exit polls terrible for estimating turnout, and even worse for projecting the vote among smaller, geographically concentrated subgroups.
This year we have two different “exit polls” employing different methodologies, enabling us to compare and contrast.
On many points they agree, which should increase confidence in those results, but on some matters they lead to quite different conclusions.
They also differ on how key segments voted. One shows whites under 30 gave President-elect Joe Biden a 6-point lead while the other indicates it was President Trump who won these same voters, by 9 points.
One survey says Biden won those whose highest degree is a college diploma by 4 points, the other says it was 14.
Did Biden win those who make $100,000 a year or more by 15 points, or lose them by 2? Did voters approve of Trump’s performance by 1 point, or disapprove by 6? Did voters judge Biden and Trump equally able to handle the economy or give Biden a 7-point edge? The exits let you take your pick. (The last almost certainly results from different question wording, a fact lost in most analyses).
Twitter is now littered with another data source: county returns regressed on, or graphed against, demographic data.
The problem here is so significant, and so common, that statisticians have a name for it: the ecological fallacy. As you can tell from the label’s second word, it doesn’t work so well.
Simply put, it is almost impossible to determine individual-level behavior from group-level data.
One early illustration used the 1930 census to examine the relationship between literacy rates and the percent foreign-born in each state. Analysis revealed a rather strong positive correlation, suggesting foreign-born residents were more likely to be literate in English than the native-born.
At the individual level, however, the relationship was the opposite. Those born outside the U.S. were then significantly less likely to be literate in English. The ecological correlation yields the wrong inference, because the foreign-born tended to live in states where the native-born were, relatively, more literate.
The fact that Trump won the country’s 10 poorest states and Biden won nine of the 10 richest, does not mean the poor supported Trump and the rich voted Biden.
Then you’ll read about the models.
Data scientists model individuals’ likelihood of being Black or Jewish or a Biden supporter or whatever. Models like this can be useful, but are often wrong.
They seem like hard science until you learn that only 14 percent of those modeled as Jewish in Florida are, and 25 percent of those modeled as Black across the country say they are not, while 40 percent of those who say they are Black don’t model that way.
The lessons: approach all the analyses with a critical eye. Don’t rely on any one question or one poll. Look for different streams of data employing different methods converging on the same answer.
Eventually we may figure out what happened.
Mellman is president of The Mellman Group and has helped elect 30 U.S. senators, 12 governors and dozens of House members. Mellman served as pollster to Senate Democratic leaders for over 20 years, as president of the American Association of Political Consultants, and is president of Democratic Majority for Israel.