Amazon’s AWS servers went down yesterday, taking down much of the internet with them. Amazon is the biggest provider of cloud services in the world, and the outage meant that several prominent websites were unavailable for hours. It led to many millions of dollars in losses, and much panic among the programming community.
Turns out it was all because of a typo.
Amazon has said that at 9:37 PST, an authorized team member executed a standard command to remove a small number of servers for one of its subsystems that was used for the billing process. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” says its official statement.
The wrongly executed command took down two major S3 subsystems, which power a large number of sites across the internet. And this employee couldn’t have had worse luck – the domino effect caused by taking down these two servers also meant that Amazon’s own status update page wasn’t getting updated, which meant that Amazon kept telling its users its services were up when they clearly were not.
The outage apparently took so long to be fixed because these servers had not been completely restarted for many years, and the restart process took “longer than expected”.
This doesn’t look good for the company, which claims to provide 99.999% availability as a minimum standard on its cloud services. But Amazon says that’s it’s making amends – it’s making changes to S3 to enable its systems to recover more quickly. It’s also changing the dependencies of its health dashboard, saying that it’s now spreading it over multiple geographies. And most importantly, it’s declaring war on typos, by setting up an an automated process to check if excess capacity has been suddenly removed. “This will prevent an incorrect input from triggering a similar event in the future,” the company said.
But that’ll come as little solace to the Amazon employee who’s presumably having the worst week of their life. It’s not everyday you alone manage to take down half the internet.