Backups in the Cloud

Every organization needs to have disaster recovery plans: disks fail, computers crash, buildings burn, and users delete files inadvertently.  In the bad old days, backups meant tapes and, if you had the money, a tape robot.  If you were serious about backups you rotated a set of tapes off site.  A remote office, if you had one, might suffice, but either way it meant FedExing tapes or contracting with someone like Iron Mountain.   System administrators gave a lot of thought to the backup schedule, too.  How often did you do a full backup?  How about incrementals?  Often the total primary storage you allocated to a user had more to do with the backup window than it did the cost of primary storage.  Full backups made satisfying restore requests easy but incrementals are faster.  Either way, backup was a major headache and a significant cost center.  Wikipedia actually has a pretty good overview of the problem and the various approaches people employ.

So suppose you decided it was too much hassle to schedule, manage, and occasionally retrieve backups from tapes. A large RAID 5 or RAID 6 array might do the trick.  It had the advantage of being online and readily accessible and it was, in theory at least, invulnerable to double or triple disk failures.  Disks don’t fail too often, so maybe that would be good enough.  Not so fast — at one place where I worked in the late ’90s one of the department heads thought he’d be clever and do backups for his group on a DEC Alpha with a fancy RAID 5 array. A combination of poor power quality and correlated failures lead to a double disk failure and a major data loss.  The problem is that the typical naive analysis tends to assume that disk failures are uncorrelated, but here in Reality Land, failures are correlated.  There was a must-read paper from CMU at FAST in 2007 on the difference between MTTF and reality (here’s the tech report version).  The short version is that the bathtub failure curve we all know and love is not a good model of reality, that the replacement rate doesn’t differ much in practice between SATA, SCSI, and FC disks, and that failure is not memoryless — instead it exhibits significant correlation and even autocorrelation.  Google also published a paper on the topic at FAST that is well worth the time to read.  Google has the raw data due to their scale to draw some interesting conclusions across a large population of disks.

If you found the management headaches, cost, and failure statistics depressing, consider that fact that bit-rot is real.  Bits can — and do — silently go bad.  Some modern file systems, for instance ZFS, have addressed this problem head-on.  If you’re stuck using ext3, ext4, or even FFS, rest assured bit-rot can happen to you.  Modern variants of these filesystems have implemented Write-Ahead Logging or its cousin Soft Updates to handle crash recovery but don’t protect you from bit rot. If you aren’t constantly verifying your archival data you may find that the backups you thought you had are useless.  The answer to this problem is so-called continuous data protection: a background process to scrub storage volumes.  When I worked at Data Domain in the early days in San Mateo, bit rot was a serious concern so the appliances incorporated background scrubbing from the beginning.

So, what is a small organization that is serious about backups to do?  Buying and deploying enough disks to deal with correlated failures, managing it, deploying storage systems that handle multi-disk failures, and scrubbing continuously requires either a significant engineering investment or buying a rock-solid product.  Startups probably aren’t in the market for a couple of Data Domain restorers or a disk farm.  The obvious answer is cloud storage.  But cloud storage is not a panacea.  Correlation is the bugbear of cloud storage, too.  Another paper from Google, this time in OSDI last year tells the tale.  There are two important metrics for cloud storage: durability and availability.  The former tell us how likely we are to lose our data permanently; the latter how likely we are to be able to retrieve our data now.  The Google paper shows that while disk failures can lead to permanent loss, temporary node failures account for most periods of unavailability.  Node availability is correlated within data centers, so replication across data centers is desirable.  Again, this kind of wide-scale distribution is not something small companies are equipped to provide but it is part of the promise of the cloud.

So, what does all of this mean for backups here at Solano Labs?  In practice, it means we’re sending snapshots of our source tree to Amazon’s S3.  S3 scrubs, checksums both on disk and in the transport layer, and replicates across data centers.  It strives for many 9s durability and four 9s availability.  If you have large datasets you can even ship Amazon disks rather than waiting for the wide area network.  Of course, while S3 provides a simple low-level API and redundant storage, it doesn’t provide the mechanism for getting backups from git.  To do that, we have continued to apply the devops approach from last time and automated the process.  For the time being, this means some Ruby scripts run by cron, but we’re working on a more reliable scheduled job system, too.  A YAML configuration tells the Ruby script what to backup and Fog then gets in the driver’s seat and pushes a compressed archive to S3.  The entire process can be checked end-to-end by comparing cryptographic hashes.  If you’re interested in automating cloud interactions and using Ruby, you definitely want to check out Fog.  In a later post, we’ll take a deeper dive into the mechanics of pushing to S3 and show some usable Ruby code.


  • Daniel Weinreb May 22, 2011 at 8:17 pm

    For backup in the cloud, I’d be interested to know what you think about Carbonite and Mozy. These are more oriented toward end-users, but I guess I’m asking not just from the point of view of your own needs, but also (both) the whole concept in general. Thanks.

    • wjosephson May 22, 2011 at 9:30 pm

      I don’t have a lot of insight into the consumer cloud backup space as I don’t use it myself. I do know that most of these services — Mozy included — do aggressive dedupe on the cloud side. That is, they use a content-addressed store that saves only one copy of common files or file segments. Existing techniques work well for many data types (e-mail, word processing documents, and with a little analysis many popular music files), but they don’t work well with images and movies. What these services offer is a turn-key solution to a painful problem and a relatively simple user interface. The two pain points that remain are security and bandwidth consumption: since they rely on dedupe they can’t offer end-to-end encryption. They can encrypt blocks while they sit on a remote server, but they have to manage the keys for you and therefore they can reveal or lose control of those keys. For instance, there’s been some controversy in the case of Drop Box recently. The second is that to the best of my knowledge, most of these services don’t provide enterprise grade WAN optimization a la Riverbed. For my personal data, both of these matter: I want good key management, high quality encryption, and I have a bandwidth problem at home: I don’t have FiOS and so pushing many gigabytes of data, even if only the first time, is a major pain point. For enterprise applications, the security problem still needs to be solved, but if, as in the case of Solano Labs, you are already in the cloud, then bandwidth to S3 is less problematic. Moreover, you can ship a disk to Amazon via FedEx and they will upload the contents to an S3 bucket for you.

      So I guess the answer is, as it almost always is with systems problems, “it depends”. If you have a relatively small amount of sensitive data or perhaps if you have good connectivity and your data doesn’t turn over too often (e.g. you have a lot of music from iTunes and aren’t a professional photographer), then these services are a great idea. If you need an enterprise SLA and bandwidth is a problem, then you may need to roll your own or look elsewhere. I’m hoping someone will offer a good WAN optimization client for pushing files to the cloud. You could sell it as an SDK for consumer applications and do OEM deals with a hardware appliance for private clouds that need remote backup.

Post a Reply to wjosephson Cancel reply