Service Outage Remedies and Post Mortem

We’re planning a short downtime tonight, 4/30/2013 at 10pm PT and a longer maintenance window over the weekend (5/4 from 5-8pm PT) to address some of the root causes of our recent downtime.

We’ll update this blog post with a much more detailed post-mortem, but the 6 word summary:  an AWS instance went to lunch.  That got a process in our monitoring system stuck on a kernel lock that it couldn’t time out.

During the longer maintenance window over the weekend, we’ll be upgrading our monitoring system, and, further, rolling out a much needed upgrade to our app-server architecture to let us handle these failures much more gracefully.

2 Comments

  • timdorr May 13, 2013 at 8:54 pm

    Any progress on the more detailed update?

    • Leo Cheng May 18, 2013 at 2:25 am

      Hi Tim – We’ve just posted an update today. Please feel free to take a look – it’s got a lot more details on what happened and what we’re doing about it.

Post a Comment