One of the greatest challenges facing the emerging world of “big data” analysis is the degree to which the results we see are dependent on the data we look at. Put another way, does big data analysis reveal “universal truths” that capture fundamentally new knowledge about society, or are all of our findings merely artifacts of the data we look at? Such questions are not limited to the world of “big data,” with journalistic artifacts, limited geographic reach, and language barriers all impacting even small-scale human-driven understanding of the world. Yet, these questions have become ever more pertinent in an era where our analyses increasingly rely on datasets too large for humans to fully assess and understand.
Certainly there is ample evidence that the findings of the digital era are heavily influenced by the data we have access to. Digital data reflects only that which has been captured into the digital world, either by being created as a digital record to start with or by being digitized from the physical world. Thus, a young educated middle class American living in an urban area is far more likely to have his or her life well-represented in the digital traces of a myriad platforms like Facebook, Twitter, Google, Uber, or Foursquare, while a farmer living in poverty in rural Ethiopia, a country with just 2% internet penetration for the entire nation, is unlikely to have any measurable footprint in the digital world. More than 80% of the content on the Internet at large is in one of just 10 languages out of a universe of more than 6,000 spoken today. Changing behavioral trends, politically-driven laws and influence, and even philosophical beliefs all quietly shape the data landscape available to us and its reflectiveness of society in ways we are only beginning to understand.
Despite their critical importance, relatively few studies have attempted to quantify the impact of such phenomena at scale, especially in our understanding of macro-level patterns. Within the digital humanities this has led to growing resistance in some quarters to the use of large digitized historical book archives due to unanswered questions around how representative the datasets are of the overall universe of published books. While it is difficult to assess the degree to which current digitized book archives are representative of every book ever published in the history of humanity, one question that can be examined is the degree to which results vary across the collections that exist today.
Towards that end I recently compared three of the largest digitized book collections, the Internet Archive’s English-language American libraries collection, the HathiTrust English-language research collection, and the Google Books Ngrams collection. All English-language works published 1800 to present were examined to determine whether they would yield the same result when tracing a given topic over time, and what the causatives of any differences might be.
The timeline below shows the percent of all books in Google Books (green), HathiTrust (blue) and the Internet Archive (red) collections that mention “Abraham Lincoln” at least once in the text from 1800 to present. Other than a substantially higher peak in Internet Archive books in 1865, searching any of the three collections would yield nearly identical results on the popularity of Lincoln as a topic from 1800 until 1939. However, beginning around 1939 the HathiTrust and Internet Archive collections show a rapid inverse bell curve of decreasing interest, while Google Books shows a much more gradual and far less substantive decrease in interest.
Repeating this process, the timeline below shows the percent of books in each of the three collections mentioning “Charles Darwin” at least once in the text. Once again all three collections offer nearly identical results through 1922, but diverge sharply thereafter. An analysis based on Google Books would show interest ramping up steadily through present day, falling off in just the last few years. Using the Internet Archive’s collection would show a sharp drop off, a leveling off, and then a sharp increase in interest over the last 20 years, while using HathiTrust’s archive would show a steady linear decrease through 1965, then a bottoming out of interest through present.
Taken together, these two graphs show that tracing the popularity of a topic in books over time will yield nearly identical results regardless of what digitized book archive you analyze, as long as you only examine the time period from 1800 to 1922. Analyses of the period 1923 to present, on the other hand, will result in sharply divergent results depending on which of the three datasets you use. Why might this be? Books published 1923 and later largely remain protected by copyright in the United States, meaning that libraries must obtain permission to digitize them from their publishers, while copyright protections on books published 1800-1922 have largely expired, allowing them to be freely redistributed in the public domain.
It appears that the three book archives have adopted markedly different acquisition strategies in the copyright era, with Google Books having no observable difference in composition between the public domain and copyright eras, while the Internet Archive has emphasized digitizing college and university publications like student newspapers and yearbooks, and HathiTrust has focused on US Government publications like legal and budgetary collections.
The complexity and nuance of the composition of the Internet Archive and HathiTrust book archives can be seen in the timeline above, which plots the percent of books by year in the two collections that are US Government publications. Here, the percentage increases from around 7% in 1922 to almost 30% in 1924, increasing rapidly to 50% by 1944. At that point, however, the two collections deviate sharply, with the Internet Archive rapidly shifting away from Government publications to focus on higher education, while HathiTrust jumps to 90% of its materials coming from the Government in 1964 and is just over 99% today.
Of course, one might reasonably ask why it matters whether different digitized book archives yield different results or who might care that in the copyright era one collection decided to focus on higher education publications while the other emphasized US Government materials. The answer is that these three collections offer a view in miniature to the broader state of big data analysis today. So much of the “big data” analysis today involves simply grabbing the most accessible dataset and using it to derive conclusions about the world without taking the time to try and understand the complex nuances and composition of the underlying data and how those might affect the conclusions. Nor do many analyses attempt to triangulate their results across multiple datasets – the trend today is to use the dataset with the easiest API and publish the results it yields as universal truth.
As the charts above demonstrate, this is a dangerous proposition in that our findings may indeed appear to be replicable across datasets when examining certain time periods, while analyses of the same datasets, but different time periods, may yield wildly different results entirely reflective of the biases of the underlying data. Taken together, this suggests that for the field of “big data” to continue to mature to where it can be used to derive robust actionable global findings about the world, we must spend a lot more effort to understand the data we are analyzing.
I would like to thank Google, Clemson University, the Internet Archive, HathiTrust, and OCLC for their assistance and support in this analysis. The full analysis and preliminary report is available online.
This article was written by Kalev Leetaru from Forbes and was legally licensed through the NewsCred publisher network.