On April 30th this year, we experienced a service outage for about an hour in the afternoon. We restored our service around 5:20pm Eastern Time.
The cause of this outage was an AWS instance that stopped running: normally, not a problem that should lead to a long outage. Unfortunately, it led to a process in our monitoring system getting stuck on a kernel lock that it couldn’t timeout, and hence no alerts were delivered to our staff.
The system was in a “zombie” state where it displayed as running in the AWS management console, but consistently failed its AWS health-checks, even across reboots. In this case, the only way to recover the specific instance we’ve found is to “stop” it and “start” it again. On boot, there was nothing significant in the instance’s system logs to indicate an obvious problem. One moment it was running and the next it wasn’t. Our theory, and unfortunately, it’s still a theory, is that some bit of AWS low-level networking fabric goes AWOL. We see this DOA situation on a regular basis with our build worker pool, where we start and stop instances with a much higher frequency, and we’ve come to accept it as a way of doing business in AWS. Nodes fail, and the system must survive. Still, we’re all ears for a more informed diagnosis.
Over the weekend following the outage, we started rolling out more changes to our system architecture to handle node failures using best-practices. We’ve been stably running on a multi-zone replicated database infrastructure for months, and it’s served us well and brought us peace of mind. However, we have not yet eliminated all of the single points of failure in our main app and build execution infrastructure.
Now onto the upgrades that started on 5/4 and are continuing through May and June, including the scheduled downtime this weekend (5/19).
First, we’ve introduced a new load balancer to pave the way for truly HA web and app serving. Second, we’ve upgraded our monitoring tools to properly survive hard networking failures. Next, we’re about to release a new, automatically updating status page based on the awesome stashboard project.
There are still a few single-points of failure in our system and we’re busily working on eliminating them in changes that we target for backend releases toward the end of May.
Since the system changes, you may have noticed that the Tddium app is much snappier (it’s got a lot more horsepower on tap now), but you may have seen a few builds terminate or found the app unresponsive for a few minutes here and there. A few parameters left over from our old configuration sizing our web serving pool against a backend processing pool led to an interprocess deadlock that one-by-one took out web server procs. We’re fixing those parameters, and rolling out software changes to avoid the situation entirely. The silver lining is that our new monitoring systems have been responsive and timely.
Over the next few weeks, if you have any feedback about a difference in your experience, good or bad, please let us know via our support page at http://support.tddium.com/