Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Similar Articles

15/09/2009 23:24
Too hot to handle

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

MySQL Backups... not as simple as it looks

Database backups are often a can of worms and working with databases daily, I frequently encounter situations where popular commercial backup tools can raise questions over backup consistency in particular scenarios while often appearing to be working in simplistic tests.

Previously I've administrated e-commerce sites running on LAMP (among others) and backups again were a tricky business that requires a lot of care to get right.

This article has really sprung out of revisiting my own backups for the MySQL instance behind this blog as well as a load of other important things.

Warning

This is some generic lessons learned and applies to my particular scenario. Some or all of this may not be valid in your scenario. Some of it may even be outright wrong. There is no substitute for having your own in-depth understanding and comprehensive testing of your own scenario.

Setting valid goals

The #1 place I see backups failing is that the focus is put on making a backup, where actually the end game is reliable restore. Perhaps we should actually change the language we use to reflect this since talking about backups means that everyone focuses on that and not the restore / recovery.

More important though is the overall business continuity which seems least well considered of all. These days many businesses are likely to face ruin if they suffered a large unrecoverable data loss, yet a thorough plan and regular testing of recovery seems broadly neglected.

A while back I saw a backup tools vendor using scare-advertising saying that over 40% of backups can't be restored... therefore you should buy their tool. I'm not sure their source of statistics or how valid it was, but the thing is that in my experience they may well be right, but the reason they may be right probably applies to their tool equally as well as everyone else.

Over the years I've discovered many problems the moment I start considering backups, or more precisely restores:

  • Selective backups... and then someone adds more data and forgets to add it to the backup - Oops!
  • Lost (d)encryption keys... the backup works, to bad you're not going to be able to restore it if the one machine with they keys fails
  • Disabled backups... at some point an admin had to do some work and disabled (possibly only partially) backups for the duration, but then forgot to re-enable them after and that's how they stayed
  • Bugs... an update of a tool used as part of the backup process introduced a new bug and under certain usage scenarios was happily running and reporting no errors, only problem was the output it was producing couldn't be decoded to restore
  • Expired licenses... every day they change the tapes and every day the backups quietly do nothing because the backup software license had silently expired
  • Complex restore process so when the phone rings at 3am on a Sunday morning and the admin crawls out of bed all groggy with a hangover from the night before while all their management are panicking and asking them "when will it be up", it's a sure bet this will not turn out well...
  • Lack of a plan... it may seem simple but when things go pear-shaped a lot of people will not be thinking clearly. Thinking through (and drilling) a recovery plan with the options, decision making processes, peer review and everything else that is needed could save a whole lot of problems. A complete loss is an easy decision, but what of a partial loss? Do you restore? Do you try recover? In a disaster situation the fog of unknowns is down and having thought through and recorded things to check and all the detail to consider puts you in a much safer position.
  • Rebuild instead of Restore: It's a surprisingly common practice that many people think it's a good idea not to take complete backups where the entire system can be restored in one-shot. While selective restore capability is often also vital, if a major incident occurs chances are it's going to be entire systems lost and not just a database.... "don't worry we've got the data and the rest can be installed from scratch"... only problem is that took 3 weeks for a team of long-gone contractors in the first place and if anyone had backed up the most recent config it would still take a week but all we have is a database backup.... Another common one is to rely on automated system builds, but then again, changes have been made during the course of running the system, build tools have been updated and can't quite do the same things they originally did and many more factors that can undermine the fast rebuild of systems.

For the most part problems go undiscovered until it's time to do it for real. Regularly testing that end-to-end (in a disaster like scenario - eg. bare metal) would not only test the process, but also keep skills sharp and provide opportunity for improving process, decision making and configuration to make things run smoother. It's doing this exercise periodically that flushes out all the above problems.

VM Snapshots (WARNING!)

The vital bit of information here is that not all VM snapshots are created equal. On training for a popular virtualisation technology I quizzed the trainer about backups after I spotting some vague / risky areas. As it turns out almost universal assumptions about the validity of VM shapshots for backups probably a big mistake in many scenarios, especially when databases are involved.

The problem comes in that in many cases backup tools do not capture the memory in the VM snapshots taken for backup purposes (they normally do for general-purpose VM snapshots). That's probably not a big deal with a file server where quiescing filesystems is sufficient, but where consistency of databases has to be maintained it could be a really big deal. Not all database technologies have files on disk that can be guaranteed to be recoverable, and even if a database has a crash recovery mechanism, it's better if your backups don't depend on how well the crash recovery works.... well, that's the ideal.

There's an easy test for memory in VM shapshot backups: test restore a VM snapshot backup. If the image boots from cold (ie. the OS boots) after restore then you have NOT got the memory. If the restored image resumes to a running state after restore (as opposed to booting) and continues from where it left off then chances are you do have the memory and your backup is likely to be safe.

Even so, you should check very carefully. There could also be other things that risk problems like time steps/shifts on a system that thought it was running all that time and many other factors. In many cases a test restore may work... after all it's done on a quiet system so chances are that most or all data has been committed to disk... then later when the backup is done live on a busy system it all turns out badly when you come to restore.

Database Backup Mechanisms

There are generally three main themes here, and the same holds true (sort-of) for MySQL:

  • Cold backup - stop the database, backup (or snapshot), start it up again
    • Requires downtime, but combined with snapshots this could be minimised even for large databases
    • Generally very safe - database is not running so data should be completely consistent
    • Often used in conjunction with replication: live database is left running, slave/shadow/replicated database is stopped and backed up. This does introduce the risk that the backup is open-loop and inconsistencies or replication problems could go undetected for long periods as the database is not in active use and that's what you are backing up....
  • Dump the data to a flat file, backup the flat file, restore the flat file, re-play the flat file into the database
    • Not always practical for very large databases... time to dump, storage space and more could make this prohibitive
    • File is normally very large, however often compresses very well if the dump tool allows for this or can be piped into a compression tool
    • Generally a very well proven backup/restore mechanism
  • Capture the underlying database files (Shapshot)
    • Efficient with large data sets assuming VM, SAN, LVM, Filesystem mechanism used doesn't suffer serious impact
    • Often requires some form of interaction with the database to get it into a safe state for snapshot which may also impact it in other ways or introduce risks
    • May depend on design and layout of storage which may place constraints on how storage can be optimized for performance

MySQL choices

Dump (mysqldump)

This choice is easy if you have a relatively small data set. It's a very well proven mechanism and does not require downtime. The output is actually SQL statements (text) so if you really have to you can edit it to fix a problem it is possible.

It also comes out of STDOUT which means you can pipe it into your favourite compression tool and being text high compression ratios are the norm (I get around 6:1). A favourite trick for remote backups which use rsync is to use "gzip --rsyncable" and withe mysqldump the option "--order-by-primary" which keeps the output in a form that rsync has the best chance at only transferring minimal (changed) data.

An example (taking advantage of Debian config) command would be something like:

mysqldump --defaults-file=/etc/mysql/debian.cnf --opt --order-by-primary --routines \
                --flush-logs --flush-privileges --all-databases --events \
        | gzip --rsyncable >$DESTINATION-tmp$$.gz
mv $DESTINATION-tmp$$.gz $DESTINATION.gz

Restoring is simply a case of piping the decompressed data into mysql.

Snapshot

This is clearly a higher risk backup approach but becomes more of a necessity as data becomes too large for dumping it to be practical.

In the case of MyISAMM databases it's necessary to execute "FLUSH TABLES WITH READ LOCK" in order to get the data in a consistent state for backup. This of course would be very bad for any length of time, but likely acceptable for the few seconds it takes to generate the snapshot after which we need to execute "UNLOCK TABLES". It's very important to ensure that "UNLOCK TABLES" gets executed, even if the snapshot fails. Leaving the database with locks will likely impact service. In my case I have a script that launches a background process on a timeout that will execute "UNLOCK TABLES" after a short timeout even if the parent script fails / terminates.

IMPORTANT: The client that "FLUSH TABLES WITH READ LOCK" is executed in must be kept running until the snapshot is complete and "UNLOCK TABLES" is run. Effectively the lock will be lost (like executing "UNLOCK TABLES") as soon as the client exits. This means you can't do each of these commands in different (pre/post) scripts as that would imply different clients.

With InnoDB databases the above flush / lock mechanism doesn't do anything. It's assumed that the crash recovery mechanism will be used on restore. This means that you have to ensure that the crash recovery mechanism will work on restore.

The basic idea of the InnoDB crash recovery mechanism is that change in the buffer pool are written to the REDO Logs (typically 2 or more files located in the MySQL data directory with names like ib_logfile*) which are ring-buffers (ie. the files get reused). Updates to tables are then done in batches. On recovery completed transactions in the REDO Logs are applied to table files.

While it's often the case that the whole of the MySQL data directory containing both tables and REDO Logs will be in the same filesystem / volume, it's not unusual for people to separate them for performance (storage optimized for sequential write and random IO). It's also not unusual that some databases (in sub-directories) could be placed on different storage volumes. In these cases snapshots (depending on the mechanism) may not be atomic which means that the time tables are captured and the time the REDO Logs are captured are not the same. While it may still work out (especially on low-traffic databases), this introduces the element of chance into backups that the recovery mechanism may not work on restore and depends on the sequence snapshots occurred, how busy the server was (eg. if long enough REDO Logs could have completely cycled through and have no overlap with transactions at the time tables were captured), and how well the recovery mechanism can cope with this.

The lowest risk scenario is that all the data (REDO Logs and tables) must be on the same volume to avoid these risks, so that they are captured atomically with the snapshot.

Cold Backups

Unless downtime is acceptable (eg. test systems), the only practical way of using this for a live system is with replication.

The big risk introduced then is how well the data is replicated. As the replicated system is not actively used there is a high risk of any data anomalies going undetected. These risks then need to be mitigated with regular comprehensive testing of the data (restores) to ensure that data is both consistent and matches the live (master) database.

Binary Logs

MySQL provides logs / journals of all the changes (UPDATE/INSERT) and these can also be backed up periodically (eg. hourly through the day) and replayed after the main backup has been restored to minimize the window of loss.

What is worth considering is how many logs you keep as if they cycle off before your next backup you will not be able to use those. Basically, you want enough logs to see you through the busiest times between backups, plus a good margin (possibly multiple days which would also provide some cover for a failed backup).

Replaying binary logs is done with "mysqlbinlog" which can dump the binary logs as text or replay them. Options are available for selecting what transactions to replay. See the man page for details.

 

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately