On Wednesday 2012 Dec 19, the main Tddium web service experienced an outage for about 4 hours when our primary database crashed. We had been preparing a warm-standby DB replica for production deployment in January. We were able to use it to recover completely. We are now live with a high-availability architecture in our primary datacenter.
The outage involved two distinct service interruptions. The first lasted around 9 minutes, from 2012 Dec 20 0028 UTC to 0037 UTC (7:28pm ET). It was triggered by a runaway query in our DB master node. Our staff received an IO utilization alert, killed the query and the controlling process, and restored operations. The IO utilization alert masked the alert for another problem: the DB archive volume on our master node was full.
At 0113 UTC, both the primary and archive volumes on our DB master node filled up and caused the DB master to fail. We initiated a snapshot restore procedure, and data movement completed at around 0337 UTC. At this point, we found that incremental backups following the snapshot were corrupt. At 0417 UTC, we prepared our DB standby node for promotion to master, executed that promotion around 0435 UTC, and restored full service at 0445 UTC.
We can now return to full service within 30 minutes of a single-server catastrophic DB failure. Our goal is to survive major datacenter outages with no more than 10 minutes of downtime. We’ll keep you posted as we build to that target.
As always, don’t hesitate to reach out to firstname.lastname@example.org if you have any questions.
The Solano Labs Team