Glen Pitt-Pladdy :: BlogHome Lab Project: Galera Clustering with Percona XtraDB | |||
These days even non-critical (eg. Social/Entertainment) sites are starting to expect always-up functionality. Traditionally critical workloads have had enormous amounts of money spent on multiply redundant hardware and HA software solutions, not to mention DR for when the worst happens. While this can reduce downtime when systems fail, HA normally has the requirement for some downtime while services restart. Only a true multi-master cluster can achieve near-seamless operation even when whole datacenters are lost. Also, that's not to say that it's a panacea. Good design (both infrastructure and application) with quality parts and attention to detail are still necessary for this to work well. Galera Clustering provides this for MySQL and related databases (MariaDB, Percona XtraDB) and is becoming a popular technology used by some of the worlds largest tech companies. QuorumA fundamental concept in clustering is Quorum. Basically this is a voting system between multiple (normally minimum 3) nodes and some sort of safety mechanism. For example, if we have 3 nodes and all of them can see each other are healthy we have a 100% healthy cluster. If one node fails, the remaining 2 nodes detect this and decide to exclude the failed node. If however you get a general networking failure, all nodes become isolated from each other but not necessarily from application nodes (eg. in WAN failure win a geographically dispersed infrastructure - local application nodes can still communicate with the local database node) each one detects it can only see itself is healthy, but we have a minimum 2 nodes (typically the rule is to require more than half the cluster nodes) required to run the cluster so it switches to a safe state (eg. refuses queries) to avoid split-brain. In reality not all the nodes need to participate in data storage, but an odd number (so we have a distinct more/less than half condition for quorum) of nodes need to participate in determining the health of other nodes. Another factor is how you dice up nodes or groups of nodes, because it isn't just nodes that can fail, but also their supporting infrastructure. Say for example you have 3 nodes, 2 in one datacenter, 1 in another datacenter. The first datacenter fails leaving 1 functioning node in the second datacenter. Because less than half the cluster is operational the remaining node will protect against split-brain and the cluster will be down from a single datacenter failure. This means that we really need not just 3 cluster nodes minimum, but actually 3 of whatever supporting infrastructure is used round each node, or groups of nodes. If that's networking then we need networks (eg. switches) to not be shared between the 3 nodes. If it's datacenters, then we need datacenters to not be shared. Weighting can often also be applied to nodes. This means that you can create many more combinations of nodes and groups of nodes so long as there is a clear distinction between the number of nodes (with weighting) required. Personally I think this complicates things excessively and introduces risk of a human error in designing this from generating an unsafe condition. Safest is to go with an odd number of infrastructure components and require more than half operational. Unsafe ClustersBelieve it or not, I've come across clusters sold for Enterprise use which do not use Quorum. They instead add additional checks. For example you can have another heartbeat by direct cable connection between two nodes ... and so it goes adding more checks. The fundamental problem here is that the purpose of the cluster is to protect against fallible infrastructure, and ALL infrastructure is fallible. It will all fail at some point and there are a practically unlimited number of failure scenarios that can occur in a complex infrastructure. The only safe situation is if there is sound logic (the less complex the better) in place that mitigates the failure (and most important, avoids split-brain). Simply throwing more safety checks at it is just playing with the odds, not actually providing the sound logic to avoid the inevitable failure. Worse yet, often the additional checks are things like additional network channels. The problem here is that there are still multiple single points of failure: NIC, Drivers, Network Stack, Network Configuration. Storage based heartbeating helps mitigate that, but again it's still just playing with the odds. Add in the mix the additional complexity and more failure scenarios generated and you quickly realise that it's not solving the problem and might be creating more. Any amount of additional safety checks is no substitute for sound logic to protect data integrity. What Galera DoesGalera uses Quorum and requires more than half the nodes to be present. That means the logic is sound. If you design the infrastructure to work in this way then everything should work as intended. Galera replicates data across multiple nodes, but also provides the ability to have Arbiter nodes (take part in Quorum, don't store data but do pass it through). This means that you have flexibility to have multiply redundant, geographically dispersed database infrastructure while not having to have all nodes as big data storage nodes. Data replication has to be synchronous (ie. all nodes update together), however this only has to occur on Commit. That means that read queries (SELECT) can be done against any node without impact (in fact this effectively gives you scale-out functionality), but write queries (INSERT/UPDATE) will have latency only on the Commit which is generally going to be low impact on most well designed applications. In failure scenarios where nodes there are insufficient nodes to run the cluster (eg. isolation) the nodes error on all commands with "not yet prepared node for application use" or "unknown command" errors. Building oneIf you just want to get a cluster up and manage it easily rather than mess with all the internal detail then take a look at Several Nines. Their Cluster Control product is hard to beat with automated configuration of nodes and Web GUI management. Read no further. In my case for my Lab experiments I want more direct control and want to be able to have direct control at a lower level of abstraction, so that is what all this is about. I have kicked 3x CentOS 7 VMs and am ready to start. You may also like to create a separate volume for /var/lib/mysql since this is going to be where the main activity is going on and this could grow significantly with use. Adding Repos and InstallingThere is a load of docs on installation available, and in my case I'm going with Percona. Because of my yum proxy arrangement (using apt-cacher-ng) I have to manually download and install the package that provides the Percona repo: # wget http://www.percona.com/downloads/percona-release/redhat/0.1-3/percona-release-0.1-3.noarch.rpm Then install with: # yum install Percona-XtraDB-Cluster-56 At this point you will also be propted to accept the GPG key. For the purposes of Home Lab experiments I'm not too fussed, but validating origin and fingerprints etc. is a must for any environments where security could be a risk. Once this is done on all 3 nodes, we are ready to move onto configuration. ConfigurationOut the box Percona XtraDB is pretty much ready to go with most config and matches with the basic required config in the Configuration docs. You probably still want to do some tuning (eg. Buffer Pool), but the fundamentals are there. To prepare the cluster we need to set a few things which exist, but are generic values: wsrep_cluster_name = LabCluster These are:
Some extra stuff was discovered that is necessary to make Percona innobackupex work: wsrep_sst_auth = sstuser:somesecretstuff This specifies a user and password to be used for the state transfer backup (update to match your requirements), and explicitly states the datadir. Then you will need, on the "Primary Component" only (as it gets replicated to others) to add the user: # mysql Before we startDocs say to do a load of things that weaken security including disabling firewalls. This might be something you want to do if all your nodes are within the same network and it's adequately protected behind firewalls and other security measures. Where you do need to lock down better, Several Nines has a page of key ports used by Galera which you may want to look at carefully. Likewise with things like SELinux, but for the purposes of my Home Lab experiments, I am fully disabling firewalls: # systemctl stop firewalld.service We can play with this again later for testing isolating nodes. For SELinux, if you have moved /var/lib/mysql to a separate volume, chances are you've lost the SELinux contexts so restore them with: # restorecon -Rv /var/lib/mysql ... however that assumes you have an appropriate SELinux setup to start with. Out the box this still isn't going to work with Galera so for the purpose of the Home Lab experiments where security is not important, I'm disabling it in /etc/selinux/config by setting: SELINUX=disabled After that you will need to reboot to fully disable SELinux. Initialising the ClusterThe first time the cluster is started we need one node to be the "Primary Component" from where others initialise from as they join. For this we need to start our first node with the --wsrep-new-cluster option, however we hit a problem. Unlike lots of docs around, it doesn't actually work simply adding this to the command line of systemctl. After reading some unit files, the way to start up with this option is to use the mysql@* unit to load the corresponding /etc/sysconfig/mysql.* file of environment variables. In our case mysql@bootstrap provides the --wsrep-new-cluster option so to initialise the Primary Component use: # systemctl start mysql@bootstrap Other nodes can then be initialised just by starting them up: # systemctl start mysql You can verify that the cluster has 2+ nodes by connecting to one and: # mysql .... OK so that sounds simple, but there is a lot that can break When stuff breaksBy default XtraDB uses the Percona backup tool, innobackupex which won't actually work out the box. All the problems I found are hopefully documented above, but there's always a chance you will meet a new one. I found that:
In all cases the problems where solved by looking in /var/log/messages on BOTH donor and joiner nodes. Another problem is starting up servers when the whole cluster has been shut down (this should not happen in normal operation, but this is a Lab system after all!). What is necessary is to check /var/lib/mysql/grastate.dat and start the most advanced node (by seqno) first using the bootstrap: # systemctl start mysql@bootstrap Then start the others up as normal: # systemctl start mysql For this reason you probably shouldn't be enabling nodes of a Galera cluster to be starting at boot. What now?So we have a working cluster now, but that's just the first bit. In order to take advantage of the cluster properly the application has to be able to cope with node failure scenarios in an intelligent manner. The exact approach depends on the nature of the application, so this is really just some ideas. If we have a geographically dispersed cluster (ie. each node in a separate Datacenter) then it probably makes some sense to split up the application the same. No point in the cluster surviving a lost Datacenter but the application not! That allows us to point the local application to the local cluster node. We would need load balancing for clients to use suitable healthchecks to take the application out of the pool if the cluster node it uses fails, but otherwise simple enough. Another approach that is probably more appropriate when all nodes are located in the same Datacenter as is the application (but also works for the geographically dispersed scenario), would be to use load-balancing between cluster nodes and the application as well as between the clients and the application nodes. This would ensure that any combination of database nodes and application nodes can fail and the overall application still remain resilient, utilising all working resources as effectively as possible. There are obviously a lot of different ways you can play around with this and come up with all sorts of different approaches for slicing and dicing databases and application depending on the usage scenario. A basic test clientNow that we have a working cluster, we'll build a basic HAProxy client to go with them. This means that clients can go via HAProxy to Load Balance between live Galera nodes in the cluster. HAProxyFor this we are using basic TCP load balancing to MySQL (port 3306), however this isn't really enough. With Galera, the MySQL instance may still be running and listening on the port, but due to loss of Quorum or other problems it has stopped processing queries. This means that basic TCP load balancing will still think the node is running even though it is unable to service requests. Fortunately everything you need is already shipped with the Percona packages. This requires a GRANT (see the /usr/bin/clustercheck script) and xinetd installing, enabling and staritng on each node: # yum install xinetd After that you should be able to telnet to port 9200 on each of the nodes and get a response: # telnet ip.of.cluster.node 9200 This enables HAProxy to check this port. Then you will need HAProxy and the Percona (or other) client which you can get with: # yum install haproxy Percona-Server-client-56 ... assuming you put the Percona repo on your client machine as above. Then you will need some config for HAProxy. In /etc/haproxy/haproxy.cfg remove the frontend and backend sections and replace them with a listen section: listen galera *:3306 Since out the box SELinux doesn't like HAProxy listening on this port or talking to other servers, you will need to deal with that (or for the sake of my Home Lab, just disable SELinux), then get HAProxy up: # systemctl enable haproxy Now, if you have that, assuming you have a valid GRANT in place for your user: GRANT ALL ON mytestdb.* TO 'root'@'ip.of.client.machine' IDENTIFIED BY 'rootpassword'; Then you should be able to access the cluster via HAProxy: # mysql --protocol=TCP --user=root --password=rootpassword --database=mytestdb There is also a lot of tweaks you can do to HAProxy config to set timeouts relating to how it will behave when things go wrong, as well as lots of other performance related things. That's probably best done in conjunction with your application since if reliability is important then you will probably want to make simulation of these scenarios part of your release testing and have the whole system proven to be resilient end-to-end. Doing real stuffSo far we have done some basic testing, however in the real world we need to be able to run real applications, and that could result in some different behaviour to what we expect if something fails. Specifically, HAProxy will be polling for the healthcheck on port 9200, but if the node fails between polls (eg. looses connectivity to peers) then your application may well get errors like "not yet prepared node for application use" or "unknown command" which you will need to handle in your application. HAProxy will find you a connection to a node, but good application programming and error handling is something that you will have to take care of up until the transaction is committed. This is true irrespective of if you are using Galera with or without HAProxy, standalone MySQL, or any other database - they can all fail part way through an operation.
|
|||
This is a bunch of random thoughts, ideas and other nonsense, and is not intended to be taken seriously. I'm experimenting and mostly have no idea what I am doing with most of this so it should be taken with cuation and at your own risk. Intrustive technologies are minimised where possible. For the purposes of reducing abuse and other risks hCaptcha is used and has it's own policies linked from the widget.
Copyright Glen Pitt-Pladdy 2008-2023
|