Every organization needs to have disaster recovery plans: disks fail, computers crash, buildings burn, and users delete files inadvertently. In the bad old days, backups meant tapes and, if you had the money, a tape robot. If you were serious about backups you rotated a set of tapes off site. A remote office, if you had one, might suffice, but either way it meant FedExing tapes or contracting with someone like Iron Mountain. System administrators gave a lot of thought to the backup schedule, too. How often did you do a full backup? How about incrementals? Often the total primary storage you allocated to a user had more to do with the backup window than it did the cost of primary storage. Full backups made satisfying restore requests easy but incrementals are faster. Either way, backup was a major headache and a significant cost center. Wikipedia actually has a pretty good overview of the problem and the various approaches people employ.
So suppose you decided it was too much hassle to schedule, manage, and occasionally retrieve backups from tapes. A large RAID 5 or RAID 6 array might do the trick. It had the advantage of being online and readily accessible and it was, in theory at least, invulnerable to double or triple disk failures. Disks don’t fail too often, so maybe that would be good enough. Not so fast — at one place where I worked in the late ’90s one of the department heads thought he’d be clever and do backups for his group on a DEC Alpha with a fancy RAID 5 array. A combination of poor power quality and correlated failures lead to a double disk failure and a major data loss. The problem is that the typical naive analysis tends to assume that disk failures are uncorrelated, but here in Reality Land, failures are correlated. There was a must-read paper from CMU at FAST in 2007 on the difference between MTTF and reality (here’s the tech report version). The short version is that the bathtub failure curve we all know and love is not a good model of reality, that the replacement rate doesn’t differ much in practice between SATA, SCSI, and FC disks, and that failure is not memoryless — instead it exhibits significant correlation and even autocorrelation. Google also published a paper on the topic at FAST that is well worth the time to read. Google has the raw data due to their scale to draw some interesting conclusions across a large population of disks.
If you found the management headaches, cost, and failure statistics depressing, consider that fact that bit-rot is real. Bits can — and do — silently go bad. Some modern file systems, for instance ZFS, have addressed this problem head-on. If you’re stuck using ext3, ext4, or even FFS, rest assured bit-rot can happen to you. Modern variants of these filesystems have implemented Write-Ahead Logging or its cousin Soft Updates to handle crash recovery but don’t protect you from bit rot. If you aren’t constantly verifying your archival data you may find that the backups you thought you had are useless. The answer to this problem is so-called continuous data protection: a background process to scrub storage volumes. When I worked at Data Domain in the early days in San Mateo, bit rot was a serious concern so the appliances incorporated background scrubbing from the beginning.
So, what is a small organization that is serious about backups to do? Buying and deploying enough disks to deal with correlated failures, managing it, deploying storage systems that handle multi-disk failures, and scrubbing continuously requires either a significant engineering investment or buying a rock-solid product. Startups probably aren’t in the market for a couple of Data Domain restorers or a disk farm. The obvious answer is cloud storage. But cloud storage is not a panacea. Correlation is the bugbear of cloud storage, too. Another paper from Google, this time in OSDI last year tells the tale. There are two important metrics for cloud storage: durability and availability. The former tell us how likely we are to lose our data permanently; the latter how likely we are to be able to retrieve our data now. The Google paper shows that while disk failures can lead to permanent loss, temporary node failures account for most periods of unavailability. Node availability is correlated within data centers, so replication across data centers is desirable. Again, this kind of wide-scale distribution is not something small companies are equipped to provide but it is part of the promise of the cloud.
So, what does all of this mean for backups here at Solano Labs? In practice, it means we’re sending snapshots of our source tree to Amazon’s S3. S3 scrubs, checksums both on disk and in the transport layer, and replicates across data centers. It strives for many 9s durability and four 9s availability. If you have large datasets you can even ship Amazon disks rather than waiting for the wide area network. Of course, while S3 provides a simple low-level API and redundant storage, it doesn’t provide the mechanism for getting backups from git. To do that, we have continued to apply the devops approach from last time and automated the process. For the time being, this means some Ruby scripts run by cron, but we’re working on a more reliable scheduled job system, too. A YAML configuration tells the Ruby script what to backup and Fog then gets in the driver’s seat and pushes a compressed archive to S3. The entire process can be checked end-to-end by comparing cryptographic hashes. If you’re interested in automating cloud interactions and using Ruby, you definitely want to check out Fog. In a later post, we’ll take a deeper dive into the mechanics of pushing to S3 and show some usable Ruby code.