When AWS has an outage people really notice. Obviously the tech community explodes. Every company starts tweeting out their downtime status while swarms of developers / ops folks at all of these companies gang tackle the problem to try to get their site back up. Increasingly though, even the non-tech folks notice and are frustrated. On the great Christmas Eve outage of 2012 that took down consumer behemoths such as Netflix, social media was awash in frustrated users, but even the relatively minor outage on Friday caused enough waves for my completely non-technical wife to notice as some of her favorite websites were down.
While AWS is an incredible service it has it's issues from time to time so it's best to be prepared. Over the past years, Mindflash has made significant strides to try to prepare for these sorts of events and during the Friday the 13th outage this paid off.
On Friday Amazon had a partial outage with one of its availability zones (read: one of its data centers) in Virginia. The issue was that while these machines were able to connect to each other just fine, the outside world was unable to reach them. At Mindflash we run all of our services in Virginia and actually had a substantial number of our servers running in the availability zone that had the problem, but we managed to get away relatively unscathed. This was largely due to our efforts over the past year to run with redundancy in multiple availability zones.
To give you a bit of background, Mindflash runs about 5 different types of servers:
With a couple caveats we run at least 2 of each of these types of servers and have either load balanced them on ELB so that our traffic is split between each of the same type of machine or set them up to run on a queuing system so if one of the machines goes down it just stops picking up jobs. We do this both to maintain great performance when we have heavy traffic as well as to ensure the site's uptime.
When we set up our machines for each of these types of servers we ensure that we have them spread out across the availability zones. That way if machines go offline in one of the availability zones we still have at least 1 in another availability zone that's working. For the servers that run off of a queue system, they'll stop taking jobs when they go offline because they won't be able to reach the queue. For those being load balanced, Amazon's load balancing service, ELB, is able to tell when a server is offline and automatically stop serving traffic to it. Since only one availability zone went out on Friday in theory we should have been 100% good to go.
Well, we weren't quite. Remember that the outage only affected traffic from outside of Amazon's network. Thus, the load balancers didn't think the servers in that availability zone were offline and continued to serve traffic to them. This makes for a shaky user experience. If you were lucky you'd get one of the servers that was still responding. If you weren't you'd see errors or the page just wouldn't load. When our monitoring system alerted us to this issue, it was easy to fix as we just manually removed the machines in the bad availability zone from all of our load balancers.
There was a positive to this though. We're not 100% there with our redundancy strategy yet so we still have a few servers of which we aren't running two or more. Even though some of them were in the availability zone with the outage none of them are hit directly by machines outside of Amazon's network. Thus, everything continued to work since the connection between Amazon's servers was still intact and working just fine.
What we've learned from this is that our strategy to ensure we have redundancy spread between multiple availability zones paid off where we've implemented it. There's always more work to do at Mindflash and we'll continue to improve our site reliability and preparedness for these rare events. However, even in the rare event that Mindflash does experience an outage you can rest assured that none of your data is lost.