One of the most fascinating aspects of the 2016 election from the standpoint of data literacy is how it has broadened general interest in public opinion polls. While opinion polls are a fixture of life in the political and communications communities, it is every four years during a presidential election that the typical American suddenly becomes fascinated and fixated with polls for a few brief months. While all polls differ a bit, typically there is some general alignment among them, which is what makes 2016 so fascinating: depending on which poll you look at, the candidates are neck-and-neck, Clinton is heading for a wipeout, Trump is heading for a wipeout, or one candidate will win by a comfortable, but not massive margin. How can polls, all of which are supposedly surveying the same American public, capture such wildly different understandings?
The root of this lies in the lack of data literacy among the American public. I recently sat in the audience of an extremely high profile keynote speech by a member of the Washington elite several weeks ago that presented various public opinion polls in the US and Europe regarding key societal topics. Despite many of the polls being conducted by hyper partisan organizations, the speaker presented them as simple fact that could not be argued with. In fact, what struck me was this line by the speaker: “Now, I realize these results might surprise many of you. But, you can’t argue with these findings; they are based on hard data, just like science is. Numbers don’t lie.”
That statement struck me, because it succinctly summarized the false assumption I hear all too often here in Washington: that data is the same thing as truth or at least fact. More amazingly, the rest of the audience wholeheartedly agreed with the speaker, finding no flaw in the notion that because opinion polls generate data, they necessarily represent truth.
The issue is that all datasets in existence reflect the bias of their human creators. Even scientific datasets reflect the biases of experimental design and the limitations of the equipment and sensors used and these biases can change over time, such as a weather station located in a remote cornfield that is later paved over with asphalt to make a parking lot, massively raising the surrounding temperature recorded by the sensor.
In the case of opinion polls, a myriad factors influence the results. Perhaps the biggest is the decision of who to survey and limitations on reaching that sample. For example, always-on-the-go mobile-only millennials can prove more challenging to reach for lengthy hour-long phone polls than their fixed land-line predecessors. Respondents who hold non-conformant views may be reluctant to express those views publicly with a stranger, especially when conducting surveys in countries with repressive governments who are known to conduct societal climatic surveys to ferret out unrest.
When it comes to political polls, this means estimating who is likely to show up at the polls on election day. Some polls punt and simply survey a sample of the American public at large, but this means that many of the people they talk with won’t or can’t vote and thus won’t have an impact on who actually gets elected. Others survey “registered voters” meaning anyone who is legally eligible to show up at a polling station on election day and cast a vote, but a large fraction of eligible voters in the United States do not actually vote. Thus, the final and most common option is to generate a statistical model of who is most likely to show up to vote on election day and poll a sample of just those people.
The problem is that this requires predicting who is likely to go out and vote on election day and like any forecasting algorithm, this is where things break down. One of the biggest factors accounting for the wild differences in the presidential polls is who the pollsters think will turn out. Some believe more Democratic-leaning voters will turn out, while others believe more Republican-leaning voters will turn out. It is these assumptions that bias all conclusions that can be derived from the poll. While most polls publish footnotes that record the number of people surveyed and the breakdown of their political leanings, the average American rarely reads these footnotes and by the time the poll numbers are blared in newspaper headlines or social media posts online, they have lost all connection to these caveats.
This is especially problematic on social media, where the conclusion of the poll is more important than its caveats. A quick glance at Facebook conversation about the election in the news feeds of several colleagues over the past month showed copious mentions of polling numbers, fewer mentions of the actual poll those numbers came from and not a single mention of the caveats associated with the poll. Thus, a typical social media post might be “Trump is up +3 points in the polls since yesterday!” or perhaps “New CNN poll shows Clinton up +2 points!” but never “XYZ poll of 200 people from 11/2/2016 to 11/6/2016, 75% of whom were Democrats, 5% Independents and 20% Republicans shows Candidate X up +5 points.”
This is a broader issue of data more generally in our modern society. We are quick to reshare in a heartbeat conclusions of interest to us, but rarely do we point people to the underlying data or the nitty-gritty complexity of how those conclusions were derived from the given data and how the data itself was collected and the myriad assumptions baked into that entire workflow. Even more rarely do we take the time to sit down and walk through those assumptions and how they might affect our own conclusions we are making based on that data and if there are assumptions in the data that might invalidate them. For example, last month I saw a major US newspaper cite an opinion poll that showed one of the presidential candidates with a huge lead over the other and offered this as proof that voters from the opposing party had crossed party lines to support this candidate. Yet, when I went back to the polling company’s website and pulled up the footnotes for that particular poll, it turns out it had a massive skew towards registered voters of the candidate’s own party and the candidate’s lead was roughly equal to the party breakdown of the survey sample. In short, instead of showing party crossover, the poll actually simply reflected the partisan breakdown of who the pollster chose to sample. In this case, the news outlet didn’t spend the time to read the footnotes and incorrectly cited the poll as evidence to support a conclusion that actually was contrary to what the poll suggested.
On the one hand, presidential election seasons with the attendant public fixation on public opinion polls offer an incredible opportunity every four years here in the US to teach Americans about data literacy and how to think critically about the information available to them. On the other hand, unfortunately, the hyper partisan environment surrounding politics means people are unable to look past their political leanings and actually dive deeply into the biases of the data they rely upon.
In the end, this simply teaches us once again that as society has at its fingertips more and more data we actually understand less and less about the world without data literacy to help us understand what that data is telling us.
This article was written by Kalev Leetaru from Forbes and was legally licensed through the NewsCred publisher network.