One of the most striking things about the “big data” world of today is the focus on mining data over answering questions. So many projects start with asking what data is easily obtainable and which tools are most easily amenable to working with that dataset, rather than asking what question the analysis is trying to answer. The result is a landscape littered with data-intensive projects that prioritize technical prowess of execution over the robustness of the findings derived from the analysis. What does this mean for the future evolution of the field?
Perhaps the biggest challenge of the big data revolution today is that it is being driven to a large extent by computer science, rather than the disciplinary fields whose questions it is attempting to answer. This creates a world in which, for example, the majority of sentiment mining tools come from computer science, rather than psychology. The focus becomes on algorithms and data rather than questions, with the end result that many tools today rely on the same approaches developed half a century ago for punch card computers.
A typical sentiment analysis dictionary today might be induced from vast volumes of social media data collected over a brief period of time, learning that the words “dentist” and “economist” tend to be associated with highly negative emotions or that President Obama is associated with either very positive or very negative emotions, depending on the time frame the training data was compiled. Tools often build upon previous dictionaries, many of which draw from emotional connotations of the 1980’s and 1990’s, leading to poor fits for social media content that emphasizes abbreviations like “lol” and emoticons and emojis.
Alternatively, a researcher might use Amazon Mechanical Turk to ask users to score a set of documents by the level of “anxiety” they provoke, averaging hundreds of thousands or millions of ratings together to build a new “anxiety” sentiment analysis tool. Yet, without a psychological basis or understanding of how “anxiety” is defined or conceptualized and without taking into account socio-cultural differences in the kinds of language and concepts that might trigger anxiety in populations across the world compared with the particular demographics of the raters used in the particular project, it becomes difficult to understand how to utilize the resulting dictionary and what its biases and limitations may be, compared with a dictionary developed from the top down, starting with a precise definition of what particular kind of “anxiety” should be measured by the tool.
Analyses today tend to start with a dataset and emphasize the scale of the dataset and the complexity or novelty of the algorithms used – the larger the dataset and the more complex the method, the more likely an analysis will be published in an academic journal or achieve viral status online. This in turn has created an arms race where the numbers reported are often not the actual numbers used in the analysis itself. A few months ago I saw an analysis that claimed to have performed pattern mining on a ten petabyte dataset. Impressed, I asked what tools the researchers had used to tractably perform complex pattern extraction on a dataset of that size. The answer was that they had performed a simple numeric range search to extract a one gigabyte subset that they actually performed their analysis on. While reporting the analysis as an exploration that reflected the underlying trends of ten petabytes of data, the researchers’ conclusions were in fact drawn from just a one gigabyte subset that was carefully constructed by them to yield results likely to be highly mediagenic.
This illustrates one of the fascinating paradoxes of the emerging “big data” world: despite having exponentially more data available at our fingertips, the amount of data we actually incorporate into our analyses has not increased at the same rate. In the past a researcher might have incorporated one gigabyte of a ten gigabyte dataset into an analysis, while today that researcher might extract one gigabyte from a ten petabyte dataset. In short, the “big data” world has resulted in the accumulation of unimaginable volumes of data, but much of the analysis being done is still locked in the world of “small data.” In some ways, the representativeness of our analyses may actually be decreasing in the big data era as we look at smaller and smaller subsets of larger and larger datasets.
At the same time, as datasets have grown larger and more complex than any human can reasonably manually inspect and new classes of highly uneven data like social media have come into being, our understanding of the nuances and biases of the data we use has decreased. An Excel spreadsheet of a few hundred rows can easily be manually reviewed by a human prior to analysis to fix typographical errors, address missing or corrupted values and look for other outliers. On the other hand, a multi-petabyte database of trillions of rows can only be examined through automated filtering tools.
Spending weeks or months of time carefully examining a dataset for errors and potential bias is a difficult sell in a world where few academic journals publish characterization analyses and researchers in the commercial world find it difficult to spend weeks or months of their time on bias studies that have only long-term payoffs. The result is that datasets become gold standards influencing the findings and theories of countless fields of study with little understanding of their geographic, socio-cultural and other biases and limitations.
Even the definitions of concepts like “influencers” and “active users” have become blurred in the online world. I recently saw a social media analysis that presented Justin Bieber as the most influential person worldwide with respect to the Syrian civil war. While it may be the case that his social media posts garner considerable online visibility and discussion, it is highly unlikely that a tweet by Mr. Bieber with his plan for peace in the Middle East would dramatically alter the landscape of the current conflict.
Driving this is the computer science world’s traditional focus on correctness of output and execution over correctness of fit. If an algorithm compiles and executes on a given dataset without error, there is a tendency to trust the results without asking whether they logically make sense. As an example, I once had a doctoral student on loan to me from one of the top data mining faculty and asked him to write an algorithm that could extract a wide variety of date/time information from a diverse collection of text that included high amounts of OCR error. A few weeks later the student proudly presented his final results, having extracted 40 million dates from a test corpus of just 1 million words. The student argued that because his code executed without error it must be correct and it took almost an hour of discussion for him to finally recognize the logical impossibility of finding 40 date references for every word in the collection. Of course, on the other hand, there are myriad counter-examples of human analysts discarding legitimate machine findings when they disagree with intuition.
Perhaps the greatest challenge facing the big data world is the recognition that data analysis is not the same thing as question answering. In the political science world, human analysts have long been used to read news articles and compile quantitative catalogs of the global activity described within. On the surface, it would seem that using vast teams of humans would yield nearly perfect quality results. Yet, in practice humans are actually quite poor at such quantitative tasks, with intercoder reliability (whether two different people reading the same news article will catalog it the same way) and intracoder reliability (whether the same person will catalog an article the same way when seeing it again a few days later) presenting huge challenges to robust results. It is also difficult to assemble teams with extensive language expertise to be able to catalog material across tens or even hundreds of languages on an ongoing basis.
Moreover, humans tend to resolve task ambiguity in highly distinct ways that draw heavily on their individual backgrounds and experiences. Asking a team of humans whether an article about Mexican drug cartel violence should be cataloged under “military violence” due to the military nature of the equipment they use and their frequent clashes with government troops or “common crime” due to the violence being committed by criminal actors will typically yield a wide range of responses.
Similarly, computerized sentiment analysis is often criticized for failing to recognize sarcasm or humor. Yet, recognizing sarcasm requires having sufficient background knowledge to understand that a given statement is in fact false, while a comment one person finds hilarious might be deeply offensive to another. In a past project I was involved in, we had a team of university students score a set of historical newspaper editorials as either positive, negative or neural in tone. The students failed to recognize many of the known highly sarcastic editorials from previous decades because they lacked the historical knowledge to understand that the statements being made were obviously false and simply treated the articles at the positive face value of their wording, much as a machine would.
Yet, let us assume for a moment that with a large enough team and a sufficient reconciliation workflow, one could achieve 100% accuracy at codifying every activity mentioned in the New York Times, which has been a favored source for political event coding over the decades. Despite superb coverage of global events, the Times simply does not have the reporting staff or editorial bandwidth to report on every single micro-level event that occurred across the entire planet today. Researchers interested in micro-bore protests and routine day-to-day activities in the UK, for example, would likely turn to a British news outlet like the BBC. Visualizing this, the map above compares all of the locations mentioned in New York Times (green) versus BBC (yellow/orange) (this includes all BBC output across all of its local language editions) articles monitored by the GDELT Project during March 2015 in 15 minute increments. While both sources catalog major events across the world, neither perfectly covers the entire planet at micro resolution.
Looking at the map above, it is clear that even a 100% accurate codification of the New York Times will not yield a 100% accurate codification of global society. Therein lies one of the promises of big data: that by blending together the collective output hundreds of thousands of worldwide news outlets, including local outlets in their local languages, one can achieve a collective view of the world that is far more representative and holistic than any single outlet can offer by itself. While messier and noisier than small volume human coded data, big data offers the ability to build composite views of complex environments, looking across massive numbers of sources and languages and to triangulate across their often disparate views.
In contrast, much of the investment over recent years has focused on high quality processing of a relatively small number of outlets using either humans or manually tuned automated coding systems custom built for primarily Western and English language outlets. Again, the emphasis on computer science mindsets has emphasized a focus on software and algorithmic development over investment in data collection and improved understanding of how to access local events and perspectives from the non-Western (and often non-Internet-connected) world.
There has also been an overemphasis on surface over deep analytic techniques. The digital humanists I speak with frequently lament the movement towards surface techniques like word counting and ngrams that have come to dominate areas of the digital humanities landscape. To a literary scholar whose focus is on the thematic undertones of a work or a historian tracing the evolution of a particular worldview over time, counting words and phrases is starkly at odds with the analytic resolution they require. As just one example, to disambiguate a word like “Paris” to determine whether it refers to Paris, Illinois, Paris, France, the socialite Paris Hilton or even the Paris Hilton hotel requires being able to access the surrounding context stretching sometimes for pages around the reference. In this way, ngrams are woefully mismatched for the kind of context-rich analyses that define the humanities.
Putting all of this together, we see a “big data” world today being driven by datasets and algorithms and drawing its inspirations and mindsets from computer science, frequently emphasizing technical prowess over goodness of fit. Even as we have access to more data than ever before, the actual analyses we perform with all of that data tend to access only minute subsets, meaning that as a percentage of available data our analyses are actually becoming less representative. Most importantly, as the field of big data evolves and matures, it must recognize that accuracy of analysis is very different than accuracy of result – 100% accurate codification of a single dataset may actually yield a far less useful insight into the question of global stability trends than a noisier and more error-prone analysis that incorporates hundreds of thousands of local sources in local languages to build a composite view of global society. This is promise of big data – the ability to rise above the single-source analyses of the past towards composite understandings of society, but to do so we must take the time to understand the nuances of our new datasets and to recognize that answering a question of interest may take more than plugging a dataset into an algorithm.
This article was written by Kalev Leetaru from Forbes and was legally licensed through the NewsCred publisher network.