Science Magazine’s first issue of 2016 includes a discussion chronicling how the National Institutes of Health (NIH) is re-exploring how it manages funding for the many biomedical database products it supports. In particular, the NIH National Human Genome Research Institute (NHGRI) is expected to close out its funding of the Online Mendelian Inheritance in Man (OMIM) database, one of the oldest genomic databases that has run continuously for 50 years. What does this mean for the future of scientific big data hosting?
Today the NIH spends more than $110 million a year on its largest 50 databases, excluding those hosted by the National Library of Medicine (NLM). OMIM, supported by NHGRI, costs $2.1 million a year and draws more than 300,000 unique users a month and 23 million page views a year, while the Gene Ontology Consortium draws 36,000 users a month at a cost of $3.7 million a year. Databases like OMIM in particular have become critical standard reference databases used in both research and clinical diagnosis, leaving key questions about how to support such heavily-used resources. One recommendation has been to convert them into paid subscription services, which was the model used for The Arabidopsis Information Resource (TAIR), after NSF ended its funding.
Much has been written about the big data explosion in the biomedical world, especially genomics, which is expected to yield as much as 40 exabytes of data by 2025, outpacing even YouTube’s storage requirements. Why, one might ask, does it really matter then whether a small handful of databases have to switch from subsidized free access to a cost-recovery model, in a world where biomedical data is on a path to consume what some estimate may be 20 times the needs of YouTube in just the next decade?
The answer lies in the question of how we manage the firehose of data emerging from academic research more broadly. Beginning in 2011, the National Science Foundation has required all grantees to “share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants” and to “encourage and facilitate such sharing.” Awards from some NSF directorates require data to be preserved and made widely available for a minimum of three years after publication and in some cases for substantially longer. This raises the question of who pays for this data archival and sharing?
While an increasing number of academic institutions offer centralized institutional repository systems, few are designed or capable of handing the kinds of massive multi-terabyte datasets being generated by the new era of “big data” research. In fact, in my own personal experience of using NSF-supported supercomputing resources over more than a decade, storage was the single most difficult resource to secure. Thousands or tens of thousands of processors could be secured readily and with minimal delay, but requesting the equivalent of just a few terabytes of disk was an enormous undertaking, and making such content globally available to other researchers over high-speed networks was extraordinarily difficult.
Much of today’s academic High Performance Computing (HPC) infrastructure was built for the era of computation-intensive scientific simulation and modeling, which placed an emphasis on computational capacity, rather than storage, IO capabilities, and commodity network access. While this is slowly changing, storage is still one of the most precious resources in the academic environment and the hardware and software environments are rarely optimized for the kinds of “big data” needs of the new era of research.
On the other hand, the commercial cloud vendors like Google Cloud Platform, Amazon Web Services, and their numerous brethren, are custom designed for the “big data” era. Single datasets ranging into the multiple petabytes or with tens of trillions of records can be analyzed in near-realtime with systems specially built for this class of research.
Economies of scale mean the companies are able to offer environments and pricing difficult to match in the academic environment, with datasets mirrored across the world and with direct connections to internet backbones. The latest Google Cloud Storage pricing lists a cost of just $26 for one terabyte per month, while Amazon’s S3 platform is around $30/month, both of which include the full costs of RAID redundancy, power, cooling, facilities space, backups, hardware maintenance, and 24/7 system administration by a dedicated team of engineers.
When it comes to analyzing all that data, researchers can instantly spin up dedicated clusters tailored and purpose-built for a specific task, using them only for the exact time needed, and performing analyses on-demand, rather than waiting in a traditional batch queue for hours or even days at a time. Datasets can be shared with the world using the same internet backbone connections that power Google and Amazon, allowing datasets to be streamed in realtime at nearly linear scaling, anywhere in the world.
Moreover, both Google and Amazon offer specialized services for genomics research, with a complete human genome costing just $3-5 a month to host. In fact, in a nod to the tremendous potential of the commercial cloud for biomedical research, NIH recently collaborated with both Google and Amazon to house copies of the 1000 Genomes Project in their respective clouds free of charge. Commercial clouds are vastly more secure than most university computing networks, carry the necessary certifications like HIPPA, and offer nearly limitless scaling. As the Alzheimer’s Disease Sequencing Project leader put it, “On the local university server it might take months to run a computationally-intense [analysis] … On Amazon it’s, ‘how fast do you need it done?’, and they do it.”
When it comes to long-term preservation and ensuring that these datasets remain available years and decades into the future, one could imagine a role for the Internet Archive, which today preserves more than 20 petabytes of the web, television, books, music, imagery, and software, having archived and preserved for posterity the open web for almost two decades.
In the end, as the academic enterprise moves towards a future ever more entwined with the world of big data, it faces new challenges in supporting the contemporary needs and long term preservation of data intensive research and offers a powerful new application area for the cloud.
This article was written by Kalev Leetaru from Forbes and was legally licensed through the NewsCred publisher network.