Last week Amazon had a major outage in an east coast data center that took down many businesses and some federal offices. The outage was apparently due to an employee accidentally taking more servers offline than was intended while debugging a billing system.
Even more servers went off-line in a chain reaction and all of these servers had to be restarted. During this time S3 was unable to service customer requests. In particular, Amazon cloud storage was inaccessible. Following are some of my observations as well as some other industry observers.
Customers of cloud services, like Amazon S3, can reduce their exposure to outages in one regional data center by having multiple instances of their application running in other regional data centers as part of the same cloud, or perhaps even better, in another unrelated cloud (for instance in a Google cloud located on the US West Coast as well as an Amazon cloud located on the US East Coast).
As Chuck Dubuque, VP of Product and Solution Marketing at Trinti said, “In the financial markets, investors protect themselves from volatility by diversifying. The same might hold true for companies and organizations that rely on the cloud.”
This can be part of an organization’s disaster recovery plans. Having as much diversity in where you keep data and services, both in data center location and management, can provide protection from this sort of failure, but this generally increases the total cost for cloud services as you are effectively multiplying the number of cloud services that you have available.
Paul Zeiter, President of Zerto said “Business and IT leaders are getting ahead of the curve by carefully crafting their hybrid cloud strategies – one that gives them multiple layers of infrastructure redundancy protection – to achieve IT resilience that keeps critical business operations seamlessly moving forward. This is possible using a combination of multiple cloud types for recovery including public and private clouds, and managed to ensure any disruption is quickly remediated in a manner that is imperceptible to customers.”
Having a storage capability using in-house hardware is another approach, but again this comes with additional cost. Many smaller organizations cannot easily afford such additional services or their own hardware. The biggest issue for being able to quickly recover from issues like this is access to stored data.
Geoff Barrall, COO of Nexsan commented that “The turmoil caused by the AWS S3 outage shows just how vital reliable data access is. With so many businesses utilizing a connected workforce, constant access to data is necessary to keep operating. Any amount of downtime costs businesses time and money and can be more easily managed if data is kept within an organization’s own IT infrastructure. With sophisticated file, sync and share capabilities, private cloud solutions can offer the flexibility that a connected workforce needs, with the security and control of on premises data storage.”
Perhaps the easiest, and least expensive, way to provide more rapid recovery is to keep data stored in more than one place with multiple instances of the apps used available on each service (although not activated on the backup location).
Businesses rely on stored data to run their businesses. The Amazon outage and similar outages at other cloud services shows that these services aren’t perfect, but then none are. Companies must balance the costs of private or public cloud based redundancy versus the cost of lost business when their main storage is out of order.