Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2012-04-26 16:59
MySQL Performance Graphs on Cacti via SNMP
2014-05-05 11:52
MySQL Backups... not as simple as it looks
2015-07-07 21:40
Bayesian Classifier Classes for Python
2017-03-21 13:53
Kubernetes to learn Part 1
2017-03-21 15:18
Kubernetes to learn Part 2

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Home Lab Project: Galera Clustering with Percona XtraDB

These days even non-critical (eg. Social/Entertainment) sites are starting to expect always-up functionality. Traditionally critical workloads have had enormous amounts of money spent on multiply redundant hardware and HA software solutions, not to mention DR for when the worst happens. While this can reduce downtime when systems fail, HA normally has the requirement for some downtime while services restart.

Only a true multi-master cluster can achieve near-seamless operation even when whole datacenters are lost. Also, that's not to say that it's a panacea. Good design (both infrastructure and application) with quality parts and attention to detail are still necessary for this to work well.

Galera Clustering provides this for MySQL and related databases (MariaDB, Percona XtraDB) and is becoming a popular technology used by some of the worlds largest tech companies.

Quorum

A fundamental concept in clustering is Quorum. Basically this is a voting system between multiple (normally minimum 3) nodes and some sort of safety mechanism.

For example, if we have 3 nodes and all of them can see each other are healthy we have a 100% healthy cluster. If one node fails, the remaining 2 nodes detect this and decide to exclude the failed node. If however you get a general networking failure, all nodes become isolated from each other but not necessarily from application nodes (eg. in WAN failure win a geographically dispersed infrastructure - local application nodes can still communicate with the local database node) each one detects it can only see itself is healthy, but we have a minimum 2 nodes (typically the rule is to require more than half the cluster nodes) required to run the cluster so it switches to a safe state (eg. refuses queries) to avoid split-brain.

In reality not all the nodes need to participate in data storage, but an odd number (so we have a distinct more/less than half condition for quorum) of nodes need to participate in determining the health of other nodes.

Another factor is how you dice up nodes or groups of nodes, because it isn't just nodes that can fail, but also their supporting infrastructure. Say for example you have 3 nodes, 2 in one datacenter, 1 in another datacenter. The first datacenter fails leaving 1 functioning node in the second datacenter. Because less than half the cluster is operational the remaining node will protect against split-brain and the cluster will be down from a single datacenter failure.

This means that we really need not just 3 cluster nodes minimum, but actually 3 of whatever supporting infrastructure is used round each node, or groups of nodes. If that's networking then we need networks (eg. switches) to not be shared between the 3 nodes. If it's datacenters, then we need datacenters to not be shared.

Weighting can often also be applied to nodes. This means that you can create many more combinations of nodes and groups of nodes so long as there is a clear distinction between the number of nodes (with weighting) required. Personally I think this complicates things excessively and introduces risk of a human error in designing this from generating an unsafe condition. Safest is to go with an odd number of infrastructure components and require more than half operational.

Unsafe Clusters

Believe it or not, I've come across clusters sold for Enterprise use which do not use Quorum. They instead add additional checks. For example you can have another heartbeat by direct cable connection between two nodes ... and so it goes adding more checks.

The fundamental problem here is that the purpose of the cluster is to protect against fallible infrastructure, and ALL infrastructure is fallible. It will all fail at some point and there are a practically unlimited number of failure scenarios that can occur in a complex infrastructure.

The only safe situation is if there is sound logic (the less complex the better) in place that mitigates the failure (and most important, avoids split-brain). Simply throwing more safety checks at it is just playing with the odds, not actually providing the sound logic to avoid the inevitable failure.

Worse yet, often the additional checks are things like additional network channels. The problem here is that there are still multiple single points of failure: NIC, Drivers, Network Stack, Network Configuration. Storage based heartbeating helps mitigate that, but again it's still just playing with the odds. Add in the mix the additional complexity and more failure scenarios generated and you quickly realise that it's not solving the problem and might be creating more.

Any amount of additional safety checks is no substitute for sound logic to protect data integrity.

What Galera Does

Galera uses Quorum and requires more than half the nodes to be present. That means the logic is sound. If you design the infrastructure to work in this way then everything should work as intended.

Galera replicates data across multiple nodes, but also provides the ability to have Arbiter nodes (take part in Quorum, don't store data but do pass it through). This means that you have flexibility to have multiply redundant, geographically dispersed database infrastructure while not having to have all nodes as big data storage nodes.

Data replication has to be synchronous (ie. all nodes update together), however this only has to occur on Commit. That means that read queries (SELECT) can be done against any node without impact (in fact this effectively gives you scale-out functionality), but write queries (INSERT/UPDATE) will have latency only on the Commit which is generally going to be low impact on most well designed applications.

In failure scenarios where nodes there are insufficient nodes to run the cluster (eg. isolation) the nodes error on all commands with "not yet prepared node for application use" or "unknown command" errors.

Building one

If you just want to get a cluster up and manage it easily rather than mess with all the internal detail then take a look at Several Nines. Their Cluster Control product is hard to beat with automated configuration of nodes and Web GUI management. Read no further.

In my case for my Lab experiments I want more direct control and want to be able to have direct control at a lower level of abstraction, so that is what all this is about.

I have kicked 3x CentOS 7 VMs and am ready to start. You may also like to create a separate volume for /var/lib/mysql since this is going to be where the main activity is going on and this could grow significantly with use.

Adding Repos and Installing

There is a load of docs on installation available, and in my case I'm going with Percona. Because of my yum proxy arrangement (using apt-cacher-ng) I have to manually download and install the package that provides the Percona repo:

# wget http://www.percona.com/downloads/percona-release/redhat/0.1-3/percona-release-0.1-3.noarch.rpm
.....
# yum localinstall percona-release-0.1-3.noarch.rpm

Then install with:

# yum install Percona-XtraDB-Cluster-56
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
.....

At this point you will also be propted to accept the GPG key. For the purposes of Home Lab experiments I'm not too fussed, but validating origin and fingerprints etc. is a must for any environments where security could be a risk.

Once this is done on all 3 nodes, we are ready to move onto configuration.

Configuration

Out the box Percona XtraDB is pretty much ready to go with most config and matches with the basic required config in the Configuration docs. You probably still want to do some tuning (eg. Buffer Pool), but the fundamentals are there.

To prepare the cluster we need to set a few things which exist, but are generic values:

wsrep_cluster_name             = LabCluster
wsrep_node_name                = Node2
wsrep_cluster_address          = gcomm://ip.of.node.number1,ip.of.node.number2,ip.of.node.number3

These are:

  • Give your cluster a name that will be valid (ie. uniquely describes it in your infrastructure)
  • Give each node a unique name in the cluster
  • List the addresses of all the nodes participating in the cluster

Some extra stuff was discovered that is necessary to make Percona innobackupex work:

wsrep_sst_auth                 = sstuser:somesecretstuff
datadir                        = /var/lib/mysql/

This specifies a user and password to be used for the state transfer backup (update to match your requirements), and explicitly states the datadir.

Then you will need, on the "Primary Component" only (as it gets replicated to others) to add the user:

# mysql
.....
mysql> GRANT RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost' IDENTIFIED BY 'somesecretstuff';

Before we start

Docs say to do a load of things that weaken security including disabling firewalls. This might be something you want to do if all your nodes are within the same network and it's adequately protected behind firewalls and other security measures.

Where you do need to lock down better, Several Nines has a page of key ports used by Galera which you may want to look at carefully. Likewise with things like SELinux, but for the purposes of my Home Lab experiments, I am fully disabling firewalls:

# systemctl stop firewalld.service
# systemctl disable firewalld.service

We can play with this again later for testing isolating nodes.

For SELinux, if you have moved /var/lib/mysql to a separate volume, chances are you've lost the SELinux contexts so restore them with:

# restorecon -Rv /var/lib/mysql

... however that assumes you have an appropriate SELinux setup to start with. Out the box this still isn't going to work with Galera so for the purpose of the Home Lab experiments where security is not important, I'm disabling it in /etc/selinux/config by setting:

SELINUX=disabled

After that you will need to reboot to fully disable SELinux.

Initialising the Cluster

The first time the cluster is started we need one node to be the "Primary Component" from where others initialise from as they join. For this we need to start our first node with the --wsrep-new-cluster option, however we hit a problem. Unlike lots of docs around, it doesn't actually work simply adding this to the command line of systemctl. After reading some unit files, the way to start up with this option is to use the mysql@* unit to load the corresponding /etc/sysconfig/mysql.* file of environment variables. In our case mysql@bootstrap provides the --wsrep-new-cluster option so to initialise the Primary Component use:

# systemctl start mysql@bootstrap

Other nodes can then be initialised just by starting them up:

# systemctl start mysql

You can verify that the cluster has 2+ nodes by connecting to one and:

# mysql
.....
mysql> show STATUS like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 2     |
+--------------------+-------+
1 row in set (0.00 sec)

.... OK so that sounds simple, but there is a lot that can break

When stuff breaks

By default XtraDB uses the Percona backup tool, innobackupex which won't actually work out the box. All the problems I found are hopefully documented above, but there's always a chance you will meet a new one.

I found that:

  • By default there is no backup user so everything seems to go OK until innobackupex tries to "SHOW SLAVE STATUS" on the donor, and then fails with a permissions problem. The solution was found documented by Percona, but not in the Galera docs.
  • It seems that on restore, the joing node hits problems with not knowing where the datadir is... well that was simple enough.

In all cases the problems where solved by looking in /var/log/messages on BOTH donor and joiner nodes.

Another problem is starting up servers when the whole cluster has been shut down (this should not happen in normal operation, but this is a Lab system after all!). What is necessary is to check /var/lib/mysql/grastate.dat and start the most advanced node (by seqno) first using the bootstrap:

# systemctl start mysql@bootstrap

Then start the others up as normal:

# systemctl start mysql

For this reason you probably shouldn't be enabling nodes of a Galera cluster to be starting at boot.

What now?

So we have a working cluster now, but that's just the first bit. In order to take advantage of the cluster properly the application has to be able to cope with node failure scenarios in an intelligent manner. The exact approach depends on the nature of the application, so this is really just some ideas.

If we have a geographically dispersed cluster (ie. each node in a separate Datacenter) then it probably makes some sense to split up the application the same. No point in the cluster surviving a lost Datacenter but the application not! That allows us to point the local application to the local cluster node. We would need load balancing for clients to use suitable healthchecks to take the application out of the pool if the cluster node it uses fails, but otherwise simple enough.

Another approach that is probably more appropriate when all nodes are located in the same Datacenter as is the application (but also works for the geographically dispersed scenario), would be to use load-balancing between cluster nodes and the application as well as between the clients and the application nodes. This would ensure that any combination of database nodes and application nodes can fail and the overall application still remain resilient, utilising all working resources as effectively as possible.

There are obviously a lot of different ways you can play around with this and come up with all sorts of different approaches for slicing and dicing databases and application depending on the usage scenario.

A basic test client

Now that we have a working cluster, we'll build a basic HAProxy client to go with them. This means that clients can go via HAProxy to Load Balance between live Galera nodes in the cluster.

HAProxy

For this we are using basic TCP load balancing to MySQL (port 3306), however this isn't really enough. With Galera, the MySQL instance may still be running and listening on the port, but due to loss of Quorum or other problems it has stopped processing queries. This means that basic TCP load balancing will still think the node is running even though it is unable to service requests.

Fortunately everything you need is already shipped with the Percona packages. This requires a GRANT (see the /usr/bin/clustercheck script) and xinetd installing, enabling and staritng on each node:

# yum install xinetd
.....
# systemctl enable xinetd
# systemctl start xinetd

After that you should be able to telnet to port 9200 on each of the nodes and get a response:

# telnet ip.of.cluster.node 9200
Trying 10.146.39.230...
Connected to 10.146.39.230.
Escape character is '^]'.
HTTP/1.1 200 OK
Content-Type: text/plain
Connection: close
Content-Length: 40

Percona XtraDB Cluster Node is synced.
Connection closed by foreign host.

This enables HAProxy to check this port.

Then you will need HAProxy and the Percona (or other) client which you can get with:

# yum install haproxy Percona-Server-client-56

... assuming you put the Percona repo on your client machine as above.

Then you will need some config for HAProxy. In /etc/haproxy/haproxy.cfg remove the frontend and backend sections and replace them with a listen section:

listen galera *:3306
    mode tcp
    option httpchk
    balance leastconn
    server  node1 ip.of.node.number1:3306 check port 9200
    server  node2 ip.of.node.number2:3306 check port 9200
    server  node3 ip.of.node.number3:3306 check port 9200

Since out the box SELinux doesn't like HAProxy listening on this port or talking to other servers, you will need to deal with that (or for the sake of my Home Lab, just disable SELinux), then get HAProxy up:

# systemctl enable haproxy
# systemctl start haproxy

Now, if you have that, assuming you have a valid GRANT in place for your user:

GRANT ALL ON mytestdb.* TO 'root'@'ip.of.client.machine' IDENTIFIED BY 'rootpassword';

Then you should be able to access the cluster via HAProxy:

# mysql --protocol=TCP --user=root --password=rootpassword --database=mytestdb
.....
mysql>  SELECT * FROM test;
+----+--------+
| id | info   |
+----+--------+
|  3 | test00 |
+----+--------+
1 row in set (0.01 sec)

There is also a lot of tweaks you can do to HAProxy config to set timeouts relating to how it will behave when things go wrong, as well as lots of other performance related things. That's probably best done in conjunction with your application since if reliability is important then you will probably want to make simulation of these scenarios part of your release testing and have the whole system proven to be resilient end-to-end.

Doing real stuff

So far we have done some basic testing, however in the real world we need to be able to run real applications, and that could result in some different behaviour to what we expect if something fails. Specifically, HAProxy will be polling for the healthcheck on port 9200, but if the node fails between polls (eg. looses connectivity to peers) then your application may well get errors like "not yet prepared node for application use" or "unknown command" which you will need to handle in your application.

HAProxy will find you a connection to a node, but good application programming and error handling is something that you will have to take care of up until the transaction is committed. This is true irrespective of if you are using Galera with or without HAProxy, standalone MySQL, or any other database - they can all fail part way through an operation.