Glen Pitt-Pladdy :: BlogMySQL Backups... not as simple as it looks | |||
Database backups are often a can of worms and working with databases daily, I frequently encounter situations where popular commercial backup tools can raise questions over backup consistency in particular scenarios while often appearing to be working in simplistic tests. Previously I've administrated e-commerce sites running on LAMP (among others) and backups again were a tricky business that requires a lot of care to get right. This article has really sprung out of revisiting my own backups for the MySQL instance behind this blog as well as a load of other important things. WarningThis is some generic lessons learned and applies to my particular scenario. Some or all of this may not be valid in your scenario. Some of it may even be outright wrong. There is no substitute for having your own in-depth understanding and comprehensive testing of your own scenario. Setting valid goalsThe #1 place I see backups failing is that the focus is put on making a backup, where actually the end game is reliable restore. Perhaps we should actually change the language we use to reflect this since talking about backups means that everyone focuses on that and not the restore / recovery. More important though is the overall business continuity which seems least well considered of all. These days many businesses are likely to face ruin if they suffered a large unrecoverable data loss, yet a thorough plan and regular testing of recovery seems broadly neglected. A while back I saw a backup tools vendor using scare-advertising saying that over 40% of backups can't be restored... therefore you should buy their tool. I'm not sure their source of statistics or how valid it was, but the thing is that in my experience they may well be right, but the reason they may be right probably applies to their tool equally as well as everyone else. Over the years I've discovered many problems the moment I start considering backups, or more precisely restores:
For the most part problems go undiscovered until it's time to do it for real. Regularly testing that end-to-end (in a disaster like scenario - eg. bare metal) would not only test the process, but also keep skills sharp and provide opportunity for improving process, decision making and configuration to make things run smoother. It's doing this exercise periodically that flushes out all the above problems. VM Snapshots (WARNING!)The vital bit of information here is that not all VM snapshots are created equal. On training for a popular virtualisation technology I quizzed the trainer about backups after I spotting some vague / risky areas. As it turns out almost universal assumptions about the validity of VM shapshots for backups probably a big mistake in many scenarios, especially when databases are involved. The problem comes in that in many cases backup tools do not capture the memory in the VM snapshots taken for backup purposes (they normally do for general-purpose VM snapshots). That's probably not a big deal with a file server where quiescing filesystems is sufficient, but where consistency of databases has to be maintained it could be a really big deal. Not all database technologies have files on disk that can be guaranteed to be recoverable, and even if a database has a crash recovery mechanism, it's better if your backups don't depend on how well the crash recovery works.... well, that's the ideal. There's an easy test for memory in VM shapshot backups: test restore a VM snapshot backup. If the image boots from cold (ie. the OS boots) after restore then you have NOT got the memory. If the restored image resumes to a running state after restore (as opposed to booting) and continues from where it left off then chances are you do have the memory and your backup is likely to be safe. Even so, you should check very carefully. There could also be other things that risk problems like time steps/shifts on a system that thought it was running all that time and many other factors. In many cases a test restore may work... after all it's done on a quiet system so chances are that most or all data has been committed to disk... then later when the backup is done live on a busy system it all turns out badly when you come to restore. Database Backup MechanismsThere are generally three main themes here, and the same holds true (sort-of) for MySQL:
MySQL choicesDump (mysqldump)This choice is easy if you have a relatively small data set. It's a very well proven mechanism and does not require downtime. The output is actually SQL statements (text) so if you really have to you can edit it to fix a problem it is possible. It also comes out of STDOUT which means you can pipe it into your favourite compression tool and being text high compression ratios are the norm (I get around 6:1). A favourite trick for remote backups which use rsync is to use "gzip --rsyncable" and withe mysqldump the option "--order-by-primary" which keeps the output in a form that rsync has the best chance at only transferring minimal (changed) data. An example (taking advantage of Debian config) command would be something like:
mysqldump --defaults-file=/etc/mysql/debian.cnf --opt --order-by-primary --routines \ Restoring is simply a case of piping the decompressed data into mysql. SnapshotThis is clearly a higher risk backup approach but becomes more of a necessity as data becomes too large for dumping it to be practical. In the case of MyISAMM databases it's necessary to execute "FLUSH TABLES WITH READ LOCK" in order to get the data in a consistent state for backup. This of course would be very bad for any length of time, but likely acceptable for the few seconds it takes to generate the snapshot after which we need to execute "UNLOCK TABLES". It's very important to ensure that "UNLOCK TABLES" gets executed, even if the snapshot fails. Leaving the database with locks will likely impact service. In my case I have a script that launches a background process on a timeout that will execute "UNLOCK TABLES" after a short timeout even if the parent script fails / terminates. IMPORTANT: The client that "FLUSH TABLES WITH READ LOCK" is executed in must be kept running until the snapshot is complete and "UNLOCK TABLES" is run. Effectively the lock will be lost (like executing "UNLOCK TABLES") as soon as the client exits. This means you can't do each of these commands in different (pre/post) scripts as that would imply different clients. With InnoDB databases the above flush / lock mechanism doesn't do anything. It's assumed that the crash recovery mechanism will be used on restore. This means that you have to ensure that the crash recovery mechanism will work on restore. The basic idea of the InnoDB crash recovery mechanism is that change in the buffer pool are written to the REDO Logs (typically 2 or more files located in the MySQL data directory with names like ib_logfile*) which are ring-buffers (ie. the files get reused). Updates to tables are then done in batches. On recovery completed transactions in the REDO Logs are applied to table files. While it's often the case that the whole of the MySQL data directory containing both tables and REDO Logs will be in the same filesystem / volume, it's not unusual for people to separate them for performance (storage optimized for sequential write and random IO). It's also not unusual that some databases (in sub-directories) could be placed on different storage volumes. In these cases snapshots (depending on the mechanism) may not be atomic which means that the time tables are captured and the time the REDO Logs are captured are not the same. While it may still work out (especially on low-traffic databases), this introduces the element of chance into backups that the recovery mechanism may not work on restore and depends on the sequence snapshots occurred, how busy the server was (eg. if long enough REDO Logs could have completely cycled through and have no overlap with transactions at the time tables were captured), and how well the recovery mechanism can cope with this. The lowest risk scenario is that all the data (REDO Logs and tables) must be on the same volume to avoid these risks, so that they are captured atomically with the snapshot. Cold BackupsUnless downtime is acceptable (eg. test systems), the only practical way of using this for a live system is with replication. The big risk introduced then is how well the data is replicated. As the replicated system is not actively used there is a high risk of any data anomalies going undetected. These risks then need to be mitigated with regular comprehensive testing of the data (restores) to ensure that data is both consistent and matches the live (master) database. Binary LogsMySQL provides logs / journals of all the changes (UPDATE/INSERT) and these can also be backed up periodically (eg. hourly through the day) and replayed after the main backup has been restored to minimize the window of loss. What is worth considering is how many logs you keep as if they cycle off before your next backup you will not be able to use those. Basically, you want enough logs to see you through the busiest times between backups, plus a good margin (possibly multiple days which would also provide some cover for a failed backup). Replaying binary logs is done with "mysqlbinlog" which can dump the binary logs as text or replay them. Options are available for selecting what transactions to replay. See the man page for details.
|
|||
Disclaimer: This is a load of random thoughts, ideas and other nonsense and is not intended to be taken seriously. I have no idea what I am doing with most of this so if you are stupid and naive enough to believe any of it, it is your own fault and you can live with the consequences. More importantly this blog may contain substances such as humor which have not yet been approved for human (or machine) consumption and could seriously damage your health if taken seriously. If you still feel the need to litigate (or whatever other legal nonsense people have dreamed up now), then please address all complaints and other stupidity to yourself as you clearly "don't get it".
Copyright Glen Pitt-Pladdy 2008-2023
|
Comments: