S3stat November/December Outage Postmortem

S3stat had an issue that stopped us from being able to run reports during the last 10 days of November. We successfully backfilled all missing data, and were then hit with a related issue that stopped reports for 3 days at the end of December. While we're backfilling that second hole, I'd like to take a moment to explain what happened, what we did about it, and why it shouldn't happen again (for a while at least).

What Happened

We run our Nightly Reporting job using Amazon's EC2 service. We actually use about twelve different AWS services to do our thing, which shouldn't be a surprise considering that we're in the business of processing AWS data. Specifically, we use EC2 Spot Instances for most of our workload, which is Amazon's way of selling their unused extra computing power on the cheap to customers like us who need "lots of computing, but not necessarily a fixed amount" for short periods of time.

We need about 200 hours of computing to run a day's worth of reports, so each morning we ask Amazon to rent us up to 50 of their spare machines for a couple hours. AWS is big, so they usually actually do have 50 extra machines laying around for us to grab, but we make a point of starting up some of their standard, full-price machines as well. Just in case.

For the most part, Amazon's cloud stuff is pretty solid. When they deprecate something, they give everybody lots of heads up (which explains why every single one of our customers got scary mails from them in September warning that they were going to deprecate an API endpoint next August, and that we (S3stat) needed to spend 5 minutes sometime in the next year bumping a version of a thing).

So it was a bit of a surprise to discover that they had suddenly run out of Spot Instances to sell us. Our job started taking longer and longer to run, then eventually started failing entirely. The reason was that Amazon would turn on less and less of those 50 machines each day, and at first we didn't notice. Then it started turning on Zero extra machines each day, and we did notice. Because suddenly we had a small handful of machines working their little hearts out all day long to churn through as many reports as they could before the next day's job kicked off and added even more work.

For added fun, because they never finished the previous day, they would never give the "done" signal to turn themselves off, so they'd keep running the next day. Each day we would see another six machines fire up to help, and after enough days there would be enough of them to actually finish a whole day's worth of work. So they'd breathe an exhausted sigh and turn themselves off. Then the whole thing would start again the next day.

Anyway, when we figured this out, the quick solution was to stop asking for these (now) flaky Spot Instances and instead just buy full-priced standard machines for the whole workload each day. This got the job running again. We fired up an extra couple hundred of those good machines and pointed them at the backlog of missed work from late November until that was all back up to date. Life was good. Or at least good enough to go on holiday at the end of the year.

The Server Knows When You're On Vacation

One thing you may have noticed about S3stat is that we use the Royal "we" a lot. Because there's not a lot of actual We here. Mostly it's just me, so when We go on holiday, so does the whole company. Normally that's not a big deal because we built this thing to run on its own and not bother we when we's on vacation (which we is a lot). But still, when the thing does break, it does have an uncanny ability to do it when I'm sleeping in a tent in the Sahara desert. Which is exactly what happened this time.

You see, there was a reason that Amazon ran out of Spot Instances of the particular machine type we're using. And if I was smart enough I would have caught it a month ago. Amazon appears to have decided not to buy or maintain any more "m1.medium" machines. They've moved on to newer hardware, and are currently provisioning a fleet of "m6" generation machines. The old ones don't get used much anymore, and they're just sort of letting them die organically.

The reason why my Spot Instance requests were going unanswered is because most of the time, nearly all the machines of that type were in use. And only a month later, it turned out that you couldn't even get 50 Standard machines of type "m1.medium". We'd ask for 50 machines, and AWS would respond with "ERROR: We don't have 50 machines to sell you right now. Hold tight while we go buy some more." You could ask for 1, and it'd all go well, but ask for 20 and it'd fail in a way that left you with zero running machines.

So the job stopped working again. For essentially the same reason as before. Because Amazon ran out of computers and wasn't going to do anything about it.

Moving Forward

That's a shame and all, but it's also good news. It means that we know what happened. And it suggests an easy fix: Ask for newer machines. That's what we did, and it worked nicely. There's still a bit of work to do in the future to get all the way to the latest and greatest hardware that AWS has to rent, but we're on "m3" now, so it'll take a while before those run out. We should have plenty of time to get upgraded the rest of the way before our plug gets pulled again.

by Jason Kester