Reply to Aphyr attack to Sentinel

antirez 4463 days ago. 289170 views.

In a great series of articles Kyle Kingsbury, aka @aphyr on Twitter, attacked a number of data stores:

[1] http://aphyr.com/tags/jepsen

Postgress, Redis Sentinel, MongoDB, and Riak are audited to find what happens during network partitions and how these systems can provide the claimed guarantees.

Redis is attacked here: http://aphyr.com/posts/283-call-me-maybe-redis

I said that Kyle "attacked" the systems on purpose, as I see a parallel with the world of computer security here, it is really a good idea to move this paradigm to the database world, to show failure modes of systems against the claims of vendors. Similarly to what happens in the security world the vendor may take the right steps to fix the system when possible, or simply the user base will be able to recognize that under certain circumstances something bad is going to happen with your data.

Another awesome thing in the Kyle's series is the educational tone, almost nothing is given for granted and the articles can be read by people that never cared about databases to distributed systems experts. Well done!

In this blog post I'll try to address the points Kyle made about Redis Sentinel, that's the system he tested.

Sentinel goals
===

In the post Kyle writes "What are the consistency and availability properties of Sentinel?".
Probably this is the only flaw I saw in this article.

Redis Sentinel is a distributed *monitoring* system, with support for automatic failover.
It is in no way a shell that wraps N Redis instances into a distributed data store. So if you consider the properties of the "Redis data store + Sentinel", what you'll find is the properties of any Redis master-slave system where there is an external component that can promote a slave into a master under certain circumstances, but has limited abilities to change the fundamental fact that Redis, without Redis Cluster, is not a distributed data store.

However it is also true that Redis Sentinel also acts as a configuration device, and even with the help of clients, so as a whole it is a complex system with given behavior that's worth analyzing.

What I'm saying here is that just the goal of the system is:

1) To promote a slave into a master if the master fails.
2) To do so in a reliable way.

All the stress here is in the point "2", that is, the fact that sentinels can be placed outside the master-slaves system makes the user able to decide a more objective point of view to declare the master as failing.

And another property is that Sentinel is distributed enough so that single sentinels can fail at any time, including during the failover process, and the process will still continue unaffected as long as it is still possible to reach the majority.

I think that the goal of Redis Sentinel is pretty clear so I'm surprised (not in a negative way) that it was tested creating a partition where the old master is in the minority together with a client, and then show that the client was still able to write to the old master. I honestly don't think any user expects something different from Redis Sentinel. That said, I'll ignore this fact from now on and reply to the different parts of the article as there is important information anyway IMHO, especially since, after all, Redis Sentinel + N Redis instances + M Clients is "A System", so Kyle analysis makes sense even under my above assumptions.

Partitioning the cluster
===

Ok I just made clear enough that there is no such goal in Sentinel to turn N Redis instances into a distributed store, so basically what happens is that:

1) Clients in the majority side will be able to continue to write once the failover is complete.
2) Clients in the minority side may possibly write to the old master, and when the network is ok again, the master will be turned into a slave of the new master, so all the writes in the minority side are lost forever.

So you can say, ok, Sentinel has a limited scope, but could you add a feature so that when the master feels in the minority it no longer accept writes? I don't think it's a good idea. What it means to be in the minority for a Redis master monitored by Sentinels (especially given that Redis and Sentinel are completely separated systems)?

Do you want your Redis master stopping to accept writes when it is no longer able to replicate to its slaves? Or do you want it when enough Sentinels are down? My guess is that given the goals of the system, instead of going down the road of stopping the master for possibly harmless conditions (or not as bad as a stopped master), just use the fact that Sentinel is very configurable: place your Sentinels and set your quorum so that you are defensive enough against partitions. This way the system will activate only when the issue is really the master node down, not a network problem. Fear data loss and partitions? Have 10 Linux boxes? Put a Sentinel in every box and set quorum to 8.

Just to be clear, the criticism is a good one, and it shows how Sentinel is not good to handle complex net splits with minimal data loss. Just this was never the goal, and what users were doing with their home-made scripts to handle failover was in the 99% of cases much worse than what Sentinel achieve as failure detection and handling of the failover process.

Redis consensus protocol
===

Another criticism is that the Sentinel protocol is complex to analyze, and even requires some help from the client.

It is true that is a complex protocol because while the agreement is vaguely byzantine looking, actually is a dynamic process without an ordered number of steps to reach an agreement. Simply the state about different things like if a node is failing or not, and who should perform the promotion, is broadcasted continuously among sentinels.

A majority is basically reached when the state of N nodes (with N >= quorum) that is no older than a given number of seconds, agrees about something.

Both failure detection and the election of the sentinel doing the failover are reasonable candidates for this informal protocol since the information every sentinel has about the availability of a given instance or sentinel itself is a moving target itself. Also the rest of the system is designed to be resistant against errors in the agreement protocol (the first sentinel recognizing a failure will force all the others to recognized it, and the failover process is auto-detected by the other instances that can monitor the elected slave. Also care is taken to avoid a protocol that is fragile against multiple sentinels doing the failover at the same time if this may ever happen).

Kyle notes that there is the concept of TILT so that Sentinel is sensible to clock skew and desynchronization. Actually there is no explicit use of absolute time in the protocol nor Sentinels are required to have a synchronized clock at all.

Just to clarify TILT is a special mode that is used when Sentinel detects its internal state is corrupted in two ways: either the system clock jumped in the past, so a Sentinel can no longer trust its *internal* state, or the clock appears to have jumped in the future, that means, the sentinel process for some reason was blocked for a long time. In both cases such a sentinel will enter TILT mode so it will stop acting for some time, until the state is believed to be already reliable. TILT is basically not part of the Sentinel protocol, but just a programming trick to make a system more reliable in presence of strange behaviors from the operating system.

Involvement of the clients
===

In Sentinel clients involvement is not mandatory since you may want to run a script during a failover so that configuration will change in some permanent way.

However the suggested mode of operation is to use clients that refresh the information when a reconnection is needed (actually we are going into the direction of forcing a periodic state refresh, and when Sentinel demotes a reappearing old master we'll send a command to the old master that forces all the connections to be dropped, this improves the reliability of the system in a considerable way).

So in the article I can read:

* Sentinels could promote a node no clients can see
* Sentinels could demote the only node clients can actually reach
* …

And so forth. Again here the point is, Sentinel is designed exactly to let you pick your tradeoffs from that point of view, and the documentation suggests that your Sentinels stay in the same machines where you run your clients, web servers, and so forth, not into the Redis server nodes.

Because indeed almost always the point of view you want to say something is "down" is the point of view of the client.

Broken things Kyle did not investigated
===

Kyle did a great work to show you want you should *not* expect from Sentinel.

There is much more we are to fix, because HA is a complex problem in master -> slave systems. For instance the current version of Sentinel does not handle well enough reconfigured instances that reboot with an old config: sometimes you may just lost a slave that is ignored by Sentinels.

This and other problems are still a work in progress, and what I'm trying to provide with Redis Sentinel is a monitoring and failover solution that does not suck so much, as in, you can select the point of view of what "down" means, both topologically and as a quorum, and you can stay sure that a few sentinels going away will not break your failover process.

Redis Cluster
===

Redis Cluster is a system much more similar to what Kyle had in mind when testing Sentinel. For instance after a split the side with the minority of slaves will stop accepting writes so while there is always a window for data loss, there is in the big picture of things always only a single part of the network that accepts writes.

I invite you to read the other articles in the Kyle's series, they are very informative.

Thank you Kyle, please keep attacking databases.