Back at the end of March we talked about the importance of having good backups in the cloud. Today, I’ll describe the system we use here at Solano Labs in a little more detail. For applications hosted in Amazon’s cloud, we use EBS volume snapshots stored in S3. The general approach is force the system to a suitably quiescent state and then kick off an S3 snapshot of the underlying EBS volume. The snapshot is tagged with enough extra metadata to make selecting the right snapshot for a restore straightforward. Of course if we just snapshot the volume without any preparation, the resulting backup image is likely to be unusable. Instead, it is necessary to force applications to checkpoint any critical state and to force the file system into a consistent state. In our environment, the easiest way to get usable file system snapshots is to use the XFS file system and use xfs_freeze. Other operating system and file system combinations support similar functionality — ZFS is a particularly interesting example, but an in-kernel implementation is sadly not available for Linux.
At the application layer, the most important resource for us at the moment is the database. We are using Postgres, which conveniently has a good method for ensuring usable file system based backups: pg_start_backup. The pg_start_backup method takes two arguments: a arbitrary tag and an optional boolean argument. If the second argument is true, Postgres will try to checkpoint the database as quickly as possible at the risk of some performance degradation for running queries. One might implement a Postgres provider with freeze and unfreeze methods like so:
@conn = PGconn.connect(:user=>’postgres’, :dbname=>’postgres’)
def freeze(fs, dev, mnt)
r = @conn.exec("SELECT pg_start_backup(‘fs-snap’);")
def unfreeze(fs, dev, mnt)
r = @conn.exec("SELECT pg_stop_backup();")
Given resource provider implementations like the Postgres class above, the Ruby script responsible for periodic backups is actually pretty simple once the system has been set up: checkpoint Postgres, freeze the underlying XFS volume, kick off an EBS snapshot, then unfreeze everything. Depending upon your environment, you may need to tell Postgres to be quick about its checkpoint process as it will otherwise attempt not to disrupt running transactions by flooding the I/O subsystem with requests.
So far, this approach has treated us well and is another win for the devops approach. If you decide to implement it, be sure that you have actually tried restoring from a backup! You may also want to consider some redundancy in your backups so that you can, for instance, bring your system up in another availability zone in case a zone or entire region in Amazon fails.