Why It’s So Important To Understand What’s In Our Web Archives


Kalev Leetaru, Contributor

November 28, 2015

Last week I explored what precisely makes up the 20 year archive of the web held in the Internet Archive’s Wayback Machine. Several of those findings have spawned considerable discussion over the past week within the library and web archival communities about what it means to archive the web, how much documentation and metadata is enough, the tradeoffs in completeness vs reach, and how to better engage with the myriad constituencies served by web archives.

Why is it so important to understand what’s in our web archives? Perhaps the most important reason is that as an infinite and ever-changing landscape, it is simply impossible to archive the “entire internet” and perfectly preserve every change to every page in existence. Web archives are by their very nature an imperfect record of the web and constructing them is an exercise in countless tradeoffs of how to preserve an infinite stream with finite resources.

At the most basic level there is the question of how to seed the crawlers of an archive and what kinds of websites they should prioritize. Should an archive prioritize government content as a lens onto the information a government provides its citizens, educational websites as the output of the nation’s centers of learning, commercial websites as an indicator of commercial use of the web, or personal websites as a window onto civil society? Should an archive focus its efforts on preserving at least one copy of every page in existence to capture the breath of the web, or should it focus on regular continued snapshots of a smaller set of pages over time to capture the evolution of the web?

To what degree should an archive attempt to preserve the content and experience of dynamic websites such as databases and search engines or interactive and personalized websites? When examining change over time, should only changes to the text of a page be archived or should any change to the template of a page or the selection of advertisements displayed on it count?

There is no single “right” answer to any of these questions. Each of the many constituencies of web archives have their own unique needs that can be at odds with each other. The newspaper division of a national library might be interested in preserving at least one copy of every article published by an online news website in the country. A political communications scholar, on the other hand, might want to track how government press releases are being modified over time or the evolution of a major political blog over many years. The former devotes all crawling activity to finding new links, while the latter requires precise continuous high-density snapshots over decades.

In terms of interface and metadata, an ordinary citizen user might simply want to look up the last available version of a page that is no longer accessible. A scholar, on the other hand, might want to understand why a particular site was crawled more frequently during a particular period and why some highly-linked inner pages of the site are absent from the archive. A lawyer might need to authenticate the precise moment that a page was captured and from where.

Understanding the decisions made by an archive’s crawlers is perhaps the most important obstacle to large-scale scholarly use of web archives. A researcher examining the evolution of digital humanities on the web, for instance, needs to understand whether the archive being examined had collection policies or crawler heuristics that might bias it away from crawling such websites, considerably skewing the underlying sample. Alternatively, a scholar studying online news needs to understand how the Archive handles cookie-mediated metered access and news paywalls and how it traverses news sites and whether those characteristics might skew it towards certain sections of the outlet.

Few web archives today provide such transparency into the operation of their collection policies and technologies. This is problematic in that many studies and publications using collections like the Wayback Machine make assumptions about characteristics like inclusion criteria and recrawl rates. In my own interactions with academic researchers, I have heard it frequently asserted that the Wayback Machine’s recrawl rate can be used as a direct measure of the update speed of a website. However, the Archive has never provided guidance on how they determine their recrawl rate and my findings of last week suggest the rate at which the Archive recrawls pages is not highly correlated to a page’s expected rate of content updates.

At the same time, few scholars have the ability or expertise to study web archives at these scales. Like with other kinds of “big data” such as social media, most researchers tend to extract small samples of data from web archives to derive conclusions from. Such samples are often too small to reveal the kinds of macro-level biases that can only be observed when looking at the dataset as a whole. Without knowing what’s in our archives, we are simply stumbling through the dark.

Given that no web archive will ever be perfect, what is the purpose of studying the limitations and biases in today’s archives? Within the academic community there has been a growing discourse and unease in certain quarters about the lack of visibility into how the datasets we use have been constructed and how those decisions might bias the results we draw from them. For example, a recent paper questioned whether findings derived from the Google Ngrams collection may be heavily skewed towards scientific and medical literature, potentially biasing or invalidating certain results derived from the collection. Without spending the time to understand what makes up the web archives we use, we are doomed to repeat this same exercise with web research.

The web archives of today were never designed for high-accuracy stable-snapshot research on the evolution of the web, offering all the more reason to bolster our understanding of what is in them. Indeed, the “big data” era as a whole has come to be defined by the use of data in novel ways it was never designed for. However, doing so often entails breaking critical assumptions the builders of the datasets had in how they might be used and their expectations of its limitations and the impact of any potential biases. Moreover, certain kinds of biases won’t manifest themselves until datasets are used in certain novel ways, meaning that locating and addressing such biases is an ongoing process.

That is not to say that web archives should support large-scale web research using their data only after their nuances are known. Rather, it suggests that archives should make such analyses a priority, partnering with scholars who specialize in such at-scale data characterization, as few archives have inhouse staff with the kind of exceptionally specialized skillsets and experience to tease out the subtle nuances of multi-petabyte datasets. It also means that they should similarly prioritize publishing available documentation on their collection policies and algorithms and emphasize to scholars the limitations of their tools and interfaces. This could include adjustments to their public interfaces to guide researchers in proper use and assumptions.

For example, the Google Ngrams viewer does not provide an option to view the raw number of books mentioning a keyword by year – instead it reports only the percentage of digitized books from that year containing the keyword. This is to ensure that researchers unfamiliar with normalization do not draw false conclusions from the data. In other words, they designed their interface to ensure that researchers were forced to properly normalize their findings against variations in the universe of books digitized per year. Expert users can still access the raw data files, but with sufficient computing requirements that they are far more likely to understand how to properly normalize their findings.

Offering greater documentation is not an either/or proposition. Archives needn’t halt research using their collections until they have produced exhaustive documentation outlining every nuance of their systems. Just releasing basic statistics on their collections and macro-level collection policies would go a long way towards starting a process of greater transparency. Every archive will have errors and missing data and biases in the snapshot it offers of the web. This is not to say they cannot be used, but rather that those nuances must be better understood so they can be accommodated for and worked around.

Today, however, many archives are opaque black boxes that offer researchers little understanding of their inner workings. Essentially they are giant libraries with no index – you can request a book and if it exists you get it, but you can’t browse or search to know what’s in it and if something is missing you don’t know whether it was a simple technical glitch or whether the archive purposely minimizes its collection of that content.

In an ideal world, archives would provide metadata and internal replay logs documenting the complete operations of its crawler, ingest, processing, and storage infrastructures. For some websites it is crucial to know the IP address of the crawler requesting it and the “micro session” the request occurred within due to IP localization and cookie-mediated personalization or metering. Understanding how the processing layers transform a given page into outbound links for crawling can help pinpoint bugs, while flow data can help diagnose why a given site is being crawled more or less often.

Having personally written web crawling systems for nearly 20 years and overseen web-scale crawling infrastructures for just over 15 years, there are many techniques and approaches that can be used to capture and store such massive-scale instrumentation data extremely efficiently. Collecting such data has considerable operational benefits for the archive itself, helping it rapidly pinpoint evolving issues and better understand and address bugs so that they do not fester over years.

Yet, better transparency doesn’t necessarily entail exhaustive metadata and log files documenting every operation across the entire archive. It can be as simple as open sourcing the source code to the crawlers and orchestrating infrastructure. Open sourcing their crawlers and orchestration tools has the side benefit of allowing web archives to leverage the vast global community of developers and web technology experts who specialize in dealing with the myriad idiosyncrasies and nuances of working with the open web. Instead of their own staff having to add every feature and find every bug, open sourcing its tools allows an archive to leverage the latest algorithms and approaches to dealing with dynamic websites, client-side rendering, page extraction, ill-formed HTML and character encoding, crawling strategies, and the like.

The Internet Archive has been a shining example in this regard, making large portions of its underlying infrastructure available via GitHub for others to build upon and improve. Not all of its tools have been publicly released and there is still scant documentation on how these tools are blended together within the Archive itself, the ingest streams that populate them, and the specific configuration tweaks used by the Archive that would help explain some of the nuances of their holdings. But, by releasing its source code publicly, the Archive has built an open infrastructure that allows others to build upon its work and contribute their own expertise.

Perhaps moving forward, through stronger outreach with the developer and computer science communities, and partnerships with web conferences and developer competitions, the Archive may be able to build an even greater network of forward-looking developers to help it constantly evolve its tools to the ever-changing web landscape.

Indeed, in my presentation at the 2010 Library of Congress summit “Citizen Journalists and Community News: Archiving for Today and Tomorrow” I outlined a vision of web archives working more closely with their communities, with scholars, and with content platforms like WordPress to create strategic data feeds for archival, while also pointing out many of the limitations of current archival practices with respect to research access. In my 2012 opening keynote address to the 2012 International Internet Preservation Consortium at the Library of Congress, I again outlined my own experiences and perspectives as a researcher making use of web archives over many years and the kinds of insight and indicators needed for robust research use. A number of these suggestions were subsequently adopted by the Archive in the form of interface and other changes, offering a case example of what happens when researchers and web archives come together.

In the end, the Internet Archive and its brethren are all we have standing between us and the total and complete loss of our online heritage. They are the only open archives of the dawn of the internet era and as imperfect as they are, they are preserving our collective global heritage in a way that no other organization does, and doing it as a public good without profit or other motivation. There will never be perfect data, certainly not when it comes to archiving the infinite ever-changing landscape of the web, but that doesn’t mean that we cannot come together as a community to help fix the rough edges and try to better understand what the nuances and biases of our collections are so that we can address them. It also doesn’t mean that we can’t make more collaborative decisions that bring together archives and the communities they serve to think about the myriad decisions and tradeoffs that define and shape our archives.

The web is disappearing page by page, character by character, image by image, before our very eyes, even as you read the words on this page. Only by coming together as a community can we ensure the preservation and access of our digital history to future generations.

This article was written by Kalev Leetaru from Forbes and was legally licensed through the NewsCred publisher network.

Comment this article

Great ! Thanks for your subscription !

You will soon receive the first Content Loop Newsletter