A giant outage of Amazon Web Services and other recent accidents offer an opportunity to reflect on the perilous architecture of the web.
Recent outages from critical services across the net have created massive disruption in recent weeks: Whether it was Amazon’s S3 service failure, which took down thousands of sites, Cloudflare’s “Cloudbleed” security issue, which forced many sites to ask users to reset their passwords, or Google Wifi’s accidental reset, which wiped out customer’s internet profiles, the infrastructure behind the internet has looked substantially more unstable recently.
The packetized technology that underlies most of the internet was created by Paul Baran as part of an effort to protect communications by moving from a centralized model of communication to a distributed one. While the Internet Society questions whether the creation of the internet was in direct response to concerns about nuclear threat, it clearly agrees that “later work on Internetting did emphasize robustness and survivability, including the capability to withstand losses of large portions of the underlying networks.”
From there, the foundation was laid for an internet that treated the distributed model as a key component to ensuring reliability. Almost 50 years later, consolidation around hosting and mobile and the development of the cloud have created a model that increases concentration on top of few key players: Amazon, Microsoft, and Google now host a large number of sites across the web. Many of those companies’ customers have opted to host their infrastructure in a single set of data centers, potentially increasing the frailty of the web by re-centralizing large portions of the net.
That’s what happened when Amazon’s S3 service, essentially a large hard drive used by companies like Spotify, Pinterest, Dropbox, Trello, Quora, and many others, lost one of its data centers on Tuesday morning. The problem began around 9:37 a.m. Pacific, the company later explained, after an employee tried to fix a problem with S3’s billing system: “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers… Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Companies that had content stored in those sets of servers, located in Northern Virginia, essentially stopped functioning properly, prompting experts to recommend that companies look at storing data across multiple data centers to increase reliability. The failure rippled across Amazon’s other services, many of which depend upon S3, leading to “increased error rates” for sites that rely on AWS, and making engineers’ efforts at recovery that much more difficult. Even the webpage Amazon uses to alert customers to outages was affected.
As more people and more devices get connected to the internet, the lure of centralizing control is bumping its head against the initial design of the internet: to drive reliability and scalability.
On a different end of the spectrum, other services intended to provide reliability in the event of an outage or an attack have been experiencing their own issues. Cloudflare, which provides security and hosting services for thousands of websites, revealed last week that it had discovered a security bug that could leak passwords from the sites of its customers, including companies like Betterment, Medium, Uber, and OkCupid. Thousands of companies were forced to ask their customers to change their passwords and make an assessment as to the potential security impact this would have on their overall infrastructure.
While those issues may only be fixed by the owners of the respective sites, the problem of centralization is slowly creeping into the realm of the millions of people who rely upon these services. People using Google Wifi and Google Chromecast found themselves forced to reinstall their systems last week as a bug wiped out centralized configuration files for many of those devices, forcing them offline for a period of time.
As more people and more devices get connected to the internet, the lure of centralizing control—which makes it easier for companies to manage them—is bumping its head against the initial design of the internet: to drive reliability and scalability. With every new largely centralized system that comes online, the internet becomes more brittle, as centralization creates an increased number of single points of failure. In a world where hackers are looking for new ways to take down infrastructures, those centralized services must double down on increasing security and reliability if we want the internet to survive.
Startups relying on standardized infrastructures can go to market faster and more cheaply, but complete reliance on a single set of servers is akin to building a castle on a swamp. While companies like Amazon, Microsoft, Google, and others have a responsibility to ensure the infrastructures they provide remain stable, it is important for any company to consider how to best balance their offerings across different data centers and how to adapt in case of failures.
Unfortunately, that is not what the large cloud providers want you to do. While Adrian Cockroft, vice president for cloud architecture strategy at Amazon Web Services, acknowledged that many big corporate customers like to split their business among multiple cloud providers, as a risk mitigation strategy, he encouraged them to steer most of their business to a single favorite (such as AWS), in order to obtain the best discounts and minimize the need for duplicate training of their own information-technology employees. In a world where Amazon is increasingly becoming a core part of the internet’s infrastructure, it makes sense for them to push for centralization on their own servers but such effort could lead to further problems.
Amazon pledged more fragmentation and decentralization, in order to keep future failures from spreading too fast. “As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery,” the company wrote in its explanation of this week’s failure. “The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.”
The challenges presented in these recent outages are nothing new to the internet, and many of the smarter companies have taken lessons from history and built their offering in a way that ensures reliability and stability. For example, while many companies were flailing because of this week’s S3 failure, Netflix, one of the poster boys for Amazon services, was fine. In 2012, the company suffered from a major outage and learned its lesson. It built a set of tools to ensure that content keeps streaming even if the underlying data centers go dark and created a bunch of programs called “the Simian Army” to disrupt its own services.
Having successfully proven them to work, the company has open-sourced that software so anyone can use and improve it. Even companies whose websites didn’t fail on Wednesday would be wise to take advantage of it and similar ideas if they hope to avoid the next cloud catastrophe.