Amazon Web Services (AWS) delivers a set of services that together form a reliable, scalable, and inexpensive computing platform 'in the cloud'. These pay-as-you-use cloud computing services include Amazon S3, Amazon EC2, Amazon SimpleDB, Amazon SQS, Amazon FPS, and others.
The promise of network block storage is wonderful: Take a familiar abstraction (the disk), sprinkle on some magic cloud pixie dust so that it’s completely reliable, available over the same cheap network you’re using for app traffic, map it to any instance in a datacenter regardless of network topology, make it so cheap it’s practically free, and voila, we can have our cake and eat it too! It’s the holy grail many a storage vendor, most of whom with decades experience in storage systems and engineering teams thousands strong have chased for a long, long time. The disk that never dies. The disk that’s not a disk.
The reality, however, is that the disk has never been a great abstraction, and the long history of crappy implementations has meant that many behavioral workarounds have found their way far up the stack. The best case scenario is that a disk device breaks and it’s immediately catastrophic taking your entire operating system with it. Failure modes go downhill from there. Networks have their own set of special failure modes too. When you combine the two, and that disk you depend on is sitting on the far side of the network from where your operating system is, you get a combinatorial explosion of complexity.
Fascinating piece on the perils of disk abstraction. Raises a very good question: Why do we worry about disks at all in the cloud? I wonder how many folks would just be tossing data into the cloud without the comfy metaphor of disk and machine to lean on?
Let’s think of a failure mode here: Network congestion starts making your block storage environment think that it has lost mirrors, you begin to have resilvering happen, you begin to have file systems that don’t even know what they’re actually on start to groan in pain, your systems start thinking that you’ve lost drives so at every level from the infrastructure service all the way to “automated provisioning-burning-in-tossing-out” scripts start ramping up, programs start rebooting instances to fix the “problems” but they boot off of the same block storage environment.
You have a run on the bank. You have panic. Of kernels. Or language VMs. You have a loss of trust so you check and check and check and check but the checking causes more problems.
Closing in on 36 hours since this melt down began, Amazon has still not been able to restore all of the EC2 instances and EBS volumes that where knocked offline in the #SkynetMassacre. This article is the best explanation of what most likely happened. And the scary part is that it will happen again. And again.
Sadly, there is not a lot to do but try and build enough redundancy into your systems to survive this sort of thing. But it is likely that building that redundancy is going to bring about another melt down at some point. Guess I’ll just need to keep thinking about how to deal with this sort of thing.
Coming soon is the Phono gateway.