antirez 1278 days ago. 119203 views.
This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm rewriting it from scratch, so this blog post serves as a first way to share the algorithm with the community.

Let's start with the problem to solve. Redis Cluster uses a master - slave design in order to recover from nodes failures. The key space is partitioned across the different masters in the cluster, using a concept that we call "hash slots". Basically every key is hashed into a number between 0 and 16383. If a given key hashes to 15, it means it is in the hash slot number 15. These 16k hash slots are split among the different masters.

At every single time only one master should serve a given hash slot. Slaves just replicate the master dataset so that it is possible to fail over a master and put the cluster again into an usable state where all the hash slots are served by one node.

Redis Cluster is client assisted and nodes are not capable to forward queries to other nodes. However nodes are able to redirect a client to the right node every time a client tries to access a key that is served by a different node. This means that every node in the cluster should know the map between the hash slots and the nodes serving them.

The problem I was trying to solve is, how to take this map in sync between nodes in a safe way? A safe way means that even in the event of net splits, eventually all the nodes will agree about the hash slots configuration.

Another problem to solve was the slave promotion. A master can have multiple slaves, how to detect, and how to act, when a master is failing and a slave should be promoted to replace it?

Metadata is not data

In the case of Redis Cluster handling of metadata is significantly different than the way the user data itself is handled. The focus of Redis Cluster is:

1) Speed.
2) No need for merge operations, so that it is semantically simple to handle the very large values typical of Redis.
3) The ability to retain most writes originating from clients connected to the majority of masters.

Given the priorities, Redis Cluster, like the vanilla single node version of Redis, uses asynchronous replication where changes to the data set are streamed to slave nodes with an asynchronous acknowledgement from slaves. In other words when a node receives a write, the client most of the times directly talk with the node in charge for the key hash slot, and the node has no other chatting to do with other nodes.

However this means that Redis Cluster is not a true CP system: there is a window where writes can be lost. The trivial case to lose a write is to write to a master that stops working after accepting our write and replying to the client, but before being able to propagate the write to its slaves.

This window is very small, in the sub-millisecond range. However when a client is partitioned away from the majority of the master nodes there is a bigger window to lose data, as a partitioned master will continue to accept writes for some time, while on the majority side the same master may be failed over by a slave. So when the partition will be fixed, the master will be reconfigured as a slave and writes will be lost.

Apart from the replicas, a key is stored into a single master node, so there is no need to agree or merge its value. Given the design, there is no need to use an agreement protocol in order to write or read data to/from the cluster. However metadata is a different story, we want that every node has a coherent vision of the cluster setup, and that the right configuration is eventually propagated to all the nodes even in case of failures and network partitions.

Using an agreement algorithm

The simplest way to solve such a problem is to use a consensus algorithm such as Paxos or Raft, and this was the direction I was going to take. However implementing consensus algorithms is hard. Actually it is so hard that often years are needed for implementations to stabilize.

At some point I noticed something that makes the work of Redis Cluster nodes simpler, that is, information about hash slots is always idempotent. If hash slot 5 is served by A, and later because the configuration changes hash slot 5 is served by B, nodes don't need to know what happened, everything they need is that the configuration for an hash slot was updated.

This changes everything basically: agreement protocols are able to take a state machine synchronized by running the same sequence of operations in every node. If the state machine is deterministic, then the internal state will be the same in all the nodes eventually. However all the state Redis Cluster needs, for a given slot, can be embedded into a single packet.

Because of this we don't need a log of operations stored on disk, nor a way to make sure to fetch all the operations still not fetched, or to figure out what should be applied and what not, all the state can be copied in a single message. In short Redis Cluster does not require a full agreement protocol so I stolen just what I needed from Raft, and tried to figure out the rest.

Failure detection

In order to see if a node has issues, Redis Cluster still uses the old algorithm that is based on gossip. Nodes inform other nodes about the state of a few random nodes using ping / pong packets. These ping / pong packets are in turn used to check if a node is reachable from the point of view of the sender of the ping. If the (informal) majority of masters agree that a given node is not reachable, then the node is flagged as failing. I said "informal" as there is no true agreement here, but simply:

1) Every node flags other nodes are failing if the majority of master nodes signaled the node as down in a given time range.
2) Every node removes the flag if the node is back reachable and is a salve, or a master that after some time is still serving slots from our point of view (was not failed over).

The failure detection is completely informal and has the only property that eventually all the nodes will agree on the failure: either the majority of nodes will mark it as failing resulting into a chain effect that will force all the other nodes to mark the node as failing, OR there is no majority and if the node is reachable again everybody will clear the flag.

The point here is that the failure detection does not require any safety, it is only useful in order to trigger the safe part of the algorithm, that is, replacing the master with a slave and update the cluster configuration.

Slave promotion

Promoting a slave must be a safe operation, and one that should ensure that the configuration change is propagated across the cluster as soon as possible.

Slave promotion is up to slaves and uses a mechanism very similar to the Raft algorithm leader election. This is what happens:

1) A slave detects its master is failing.
2) The slave will try to win the election in order to promote itself to master.
3) If it is successful, it will change its state and will advertise the new configuration.
4) If it is unsuccessful it will try again to win the election after some time.

Every slave will try to start the election at a slightly different time in order to avoid a split brain condition that will require a new election. Redis Cluster uses a random delay that is driven by the number of seconds a slave was disconnected from the master, in order to favor slaves that were able to talk with the master more recently (slaves with too old data don't try at all to get elected).

Every cluster node has the concept of currentTerm as in Raft, that is called currentEpoch in Redis Cluster. Every node tries to have a currentEpoch that is the highest found among other nodes, so this information is always added in ping /pong packets headers. Every time a node sees a currentEpoch of another node that is greater than its epoch, it updates its currentEpoch.

The election is like Raft: a slave that tries to get elected increments its currentEpoch and sends a failover-auth-request to every master hoping to get its vote. Masters refuse to vote if the master instance of the slave is not failing from the point of view of a given master, or if the currentTerm advertised by the slave is smaller than the currentTerm of the receiving master.

Also masters vote a single time for every epoch: this ensures that for every epoch we can have just a single winner, this is central both in the Raft algorithm and in the Redis Cluster slave election.

Basically, if a slave wins the election, it uses the epoch at which the election was won as the version of its configuration, and newer configurations always win over older configurations.

The configEpoch

In order to make more clear how it works, we need to add some more information. Basically every ping / pong packet does not just publish the currentEpoch, but also the configEpoch, that is, the epoch at which the master started to serve its set of hash slots. Initially when a cluster is created every master uses a configEpoch of zero. As failover events happen, there will be nodes with greater configEpoch values.

As in the old algorithm, the ping and pong packets also carry a bitmap with the hash slots served by a given node. Since every node knows the last observed configEpoch of every other node, it can detect configuration changes to incorporate.

For instance if node B claims to serve hash slot 5 that was previously served by node A, but the configEpoch of node B is greater than the configEpoch we have for A, then we accept the new configuration.

The same mechanism is also used in order to reconfigure a reappearing master as a slave, or to reconfigure all the other slaves after a failover. The old master's served slots count will drop to zero, and the nodes will switch as replicas of the node that is serving the slots now.

The real algorithm used has more details that don't change the semantics, but make everything more fluid in common cases. For example after a slave wins an election it broadcasts a PONG to every node in order to make the configuration change faster, and to prevent other slaves from initiating a new useless election.

Similarly a master that was partitioned out from the majority for enough time (the same time needed to flag it as failing) stop accepting writes, and will not accept writes for a few more seconds even after the majority of masters is reachable again, in order to give some time to the other nodes to inform it of configuration changes. This makes less likely that a client with an old routing table will try and succeed to write to the returning master that is now failed over.

From the point of view of the code, the implementation is requiring a minor amount of code, as everything was already implemented in the old algorithm even if in a broken way, it was unsafe but the basic abstractions and message formats were ok.

All in all I'm failing in love with distributed programming and I hope to learn more in the next weeks...
blog comments powered by Disqus