Ten years ago one of the most disruptive modern threats to the corporate data center formally debuted: Amazon Web Services officially launched, first offering S2 and then launching EC2 in August 2006. While far from the first large-scale “cloud” offering, Amazon’s sheer size and scale forever recalibrated the vision of what the cloud could be and the scale it could offer to business. No longer was the cloud just about hosting a few odd web servers – it could be a wholesale replacement for a company’s entire infrastructure. Over the following decade the commercial cloud has transformed the computing landscape, with at least one survey suggesting that as of this year more than 89% of all businesses use the public cloud in some fashion and even the US Government is getting into the cloud spirit. Even Uber is stepping back into the public cloud as it grows internationally.
The size of the commercial cloud is staggering. In 2013 Microsoft announced that it had more than one million servers, while a year later, Google was spending more than $5 billion a quarter on its data centers (half on construction of new centers and half on operating its existing centers). Few companies in the world can afford to spend $20 billion a year on their computing infrastructure, while even the entire US Government combined spends just $81 billion a year across all IT expenditures, with “much of this amount reportedly for operating and maintaining existing (legacy) IT systems” including mission critical computing systems some of which are more than half a century old.
Even the National Science Foundation-funded supercomputing network that powers American academic research struggles to compete with this scale of investment. The National Center for Supercomputing Applications (NCSA) today calls its nearly $200 million NSF-funded Blue Waters supercomputer “one of the most powerful supercomputers in the world” and emphasizes that it is “the fastest supercomputer on a university campus.” When the machine was delivered in 2012 it had 25 petabytes of disk storage, offering one of the largest storage arrays available for academic research. Yet, the same year that Blue Waters came online, a team of Google MapReduce engineers borrowed one of Google’s new clusters that had just been delivered and sorted a 50PB dataset just to see how long it would take.
In short, the same year the most powerful university-housed supercomputer came online, a couple of Google engineers borrowed for a few days just one of Google’s myriad clusters to run a quick benchmark and the machine they borrowed had more than twice the total storage of the most powerful supercomputer available for academic research. To make things even more interesting, 2012 was also the year that Google publicly launched its commercial cloud computing infrastructure, debuting with more than 770,000 cores, of which one single project used more than 600,000 for a single hero run, nearly double the number in Blue Waters and was equipped with more than 2.8 petabytes of RAM, also double that of Blue Waters. Granted, this comparison isn’t quite fair as Blue Waters offers direct, not virtual access to cores, coupled the cores with GPU accelerators, offered distributed shared memory and other hardware layers to offer a more cohesive environment compared with the fully decentralized cluster model of Google’s offering. Yet, for the kinds of workloads increasingly powering the “big data” era, the comparison is an apt one.
Promoted view from a Capgemini Expert
«Big Data doesn’t need data quality, it requires contextual quality whereby data sets viewed in context determine the value/ quality of the same.»
Indeed, at the IEEE eScience conference in Fall 2012 I argued that the NSF computing model needed to fundamentally change in the face of the exponentially growing capabilities of the commercial cloud, especially as they relate to the unique needs of the “big data” research.
In addition, unlike monolithic HPC systems, in which the failure of any critical component will often cause the entire system to go down, the cloud is built on the core concept of abstraction from hardware, meaning virtual computers transparently migrate to new hardware as failures occur. In the production big data world, stability and robustness to failure are critical factors in large system design.
Of course, keep in mind that those 770,000 cores and 2.8 petabytes of RAM only counted the processing power Google made available through their commercial offering, which represented only a miniscule fraction of its sum total computing infrastructure. It is hard for any company to match the offerings of an enterprise that spends $20 billion a year on its data centers.
Yet, perhaps what makes the offerings of this new generation of the commercial cloud so different from the past is that historically, hosted computing environments tended to exclusively specialize in offering computers for rent. The companies renting the computers simply bought machines and doled them out to customers. In contrast, the Googles, Amazons and Microsofts of the world today are leveraging the very infrastructures that run their own global operations. Instead of buying off-the-shelf machines, stuffing them into racks and renting logins to them, these companies are building some of the world’s most sophisticated data centers, custom designing their own computers and even microchips to power them. Instead of writing software for others to use, the companies are opening access to the tools they themselves use internally. The hardware and software layers that power these data centers are the same lifeblood that run the companies’ mission critical business. This means that the companies employ massive armies of developers who are constantly improving this infrastructure, making it faster and more powerful by the day and each of these improvements makes it way in near-realtime to end users.
Few companies have the engineering expertise or money to build their own custom microchips and deploy them in their data centers. Yet, that’s exactly what Google did for machine learning, constructing its own custom ASIC called a Tensor Processing Unit (TPU) to power its cloud machine learning offerings. Now, thanks to the cloud, anyone can leverage this custom hardware to accelerate deep learning algorithms to unimaginable speeds.
Indeed, this is perhaps the other most critical advance of the modern cloud: the standardization of software-as-a-service. The delivery of software services over the web is as old as the Internet itself and companies like Salesforce helped dramatically popularize them with its launch in 1999. Yet, in today’s cloud world companies are increasingly identifying key algorithms and applications they’ve developed for in-house use and making those available to customers via cloud-hosted APIs.
In Google’s case, since the launch of its first cloud offerings nearly a decade ago, it has gradually spun off an ever-growing fraction of its internal tools to Google Cloud Platform, making everything from system management to databases to machine learning available via standard APIs. Most recently, as Google has infused deep learning throughout everything the company does, it has made those same algorithms available as cloud offerings. Building accurate deep learning tools requires data: lots and lots of it. Few companies on this earth have access to the kinds of data available to the Googles and Amazons and Microsofts of the world. As an example, in building a photographic geolocation deep learning algorithm, Google drew upon nearly half a billion images culled from the entire web itself.
What really sets apart these new offerings is that cloud companies are increasingly providing external access to the very tools they use internally, rather than building separate tools for external use. The use of internal tools means that these are battle-hardened systems developed over years of real-world application and designed to handle the myriad of special cases you only encounter at scale. Moreover, the cloud nature of these tools means they can be constantly improved effectively in realtime, with new updates pushed multiple times a day if needed. It is hard for any company to match the engineering prowess of the big cloud offerings and the rise of algorithmic APIs means one can even create an entire startup simply by plugging different APIs into each other.
For example, I personally use Google’s BigQuery platform heavily in my work analyzing global human society. BigQuery is essentially a public access interface to Google’s Dremel software, which is “widely used at Google – from search to ads, from YouTube to Gmail – so there’s great emphasis on continuously making Dremel better. BigQuery users get the benefit of continuous improvements in performance, durability, efficiency and scalability, without downtime and upgrades associated with traditional technologies.” In short, when you’re using BigQuery, you’re using the same technology that powers Google itself, meaning Google’s developers are constantly making it faster and more powerful each day. From the standpoint of a user, the system just gets faster and faster without any of the downtime of premises-installed systems like planned upgrades, hardware changeouts or failures, OS upgrades breaking a critical library, system crashes causing cascading failures, etc.
However, perhaps the most powerful aspect of cloud tools like BigQuery is their ability to tap into the effectively infinite computing resources of their owners. In BigQuery’s case, this means that it resides on top of Google’s worldwide infrastructure, offering nearly limitless scaling.
As a simple example, take a simple regular expression to be run on a 7TB 100 billion row table in just 24 seconds. Assuming a perfectly efficient architecture, this would require at minimum 3,300 cores, 330 100MB/s harddrives and a 330 Gigabit network fabric. Of course, in real life, the actual required hardware numbers are far higher due to inefficiencies. Yet, with a single mouse click, this much hardware comes to bear for just 24 seconds to execute that query. Under the hood, BigQuery is “powered by multiple data centers, each with hundreds of thousands of cores, dozens of petabytes in storage capacity, and terabytes in networking bandwidth.” In short, BigQuery is Google and Google has effectively unlimited computing resources.
(Your browser doesn’t support iframe)
In the video above from this past March, Google’s Jordan Tigani showcases just how far BigQuery can scale. In his GCP NEXT 2016 talk, Jordan live demonstrates searching a 1 trillion row dataset totaling just over 1 petabyte in just 245 seconds (just over 4 minutes), working out to around 4 terabytes per second. Looking at a day’s worth of the data took just 3.2 seconds with automatic partitioning.
Just a single line of SQL code to search a trillion rows and a petabyte of data. While the video doesn’t specify how many cores or harddrives ultimately were required to power that query, that’s actually the point of the cloud: at the end of the day you don’t care if it took 1 machine or 1 million machines – all you want to do is get back an answer, not worry about managing hardware.
Moreover, one can do everything from construct ngrams across tens of billions of words in minutes, perform sentiment analysis at 341 million words per second, build network diagrams from trillions of connections, perform terascale mapping in under a minute, take the first steps towards Psychohistory by modeling the underlying patterns of global society, compute the mathematical formula of compassion fatigue, map 212 years of books, or just perform routine queries at near-realtime speed.
Few companies can afford to spend $20 billion a year on global data centers, have the cachet to employ some of the best and brightest engineers and thought leaders of the computing world, the ability to build and deploy their own custom microchips, access to training datasets comprising the web itself or the hardware and personnel capacity to iterate their infrastructure in realtime. Whether you’re talking about Google, Amazon, Microsoft or any of the other cloud vendors, this is the power of the commercial cloud today – trading hardware management for solving problems in realtime at nearly limitless scale.
Disclosure: I am a Google Cloud Expert and make heavy use of Google Cloud products in my work.
This article was written by Kalev Leetaru from Forbes and was legally licensed through the NewsCred publisher network.