Do not store database backups in git

by ewout

Some people use revision control systems like git or svn for storing and managing backups. The idea sounds appealing, because consecutive backups only differ slightly and revision control systems can optimize the space required to store them. The following paragraphs will explain why this is just plain wrong.

Data retention

An RCS is built for infinite data retention, as writing source code is expensive and storage is cheap. Backups lose their value with age, nobody cares about the daily backups of 5 years ago.

Performance

An RCS is built for tracking a large number of small, interdependent files. Backup files are large and backups from multiple applications are unrelated. Git does not handle large files well and repositories can become unusably slow or use an insane amount of memory when pushing, pulling or checking out.

Corruption

When a RCS repository gets corrupted, chances are that all backups stored inside become inaccessible. Svn stores revisions as deltas relative to the previous revision. When one delta file becomes unreadable, future revisions are affected. Git does not just store the deltas and is more defensive against corruption with built-in SHA-1 hashing. Furthermore, git repositories can be easily replicated with full history so the chances of corruption are slim. But even with the best RCS tools, there is an extra non-trivial layer between the filesystem and the data, and this layer is a liability.

Yet another tool

Every machine that uses backups requires the RCS tool to be installed. This is only a minor inconvenience, but why use yet another tool when the standard unix tools work just fine?

What to use then?

My advice for storing database backups is simple: create timestamped sql dumps periodically and compress them with bzip2. Keep them around as long as your data retention policy requires it or until you run out of space, which will be a long, long time in an era where hard disk space is measured in terabytes.

Fork me on GitHub