Reply to Aphyr, part 2

antirez 4544 days ago. 10988 views.

Thanks to Aphyr that replied to my reply here: http://aphyr.com/posts/287-asynchronous-replication-with-failover

I'll try to continue the good exchange. First of all I would propose the notation Aphyr used in his blog post as the standard for Redis network representation.
As you can see the rules are simple and nice, and very expressive:

1) R is Redis, S is Sentinel, C is Client.
2) An R inside a square is a master.
3) There is a segment between every instance that can replicate with another one.
4) Letters not separated by spaces, like CS, means machines are inside the same virtual or physical computer.
5) An arrow shows promotion, like |R| -> R

In order to specify Sentinel quorum I add Q<quorum>. Example Q5.

A matter of point of view
===

<aphyr> If you use any kind of failover, your Redis system is a distributed store. Heck, reading from secondaries makes Redis a distributed store.

I think we agree about that, you can surely consider the system as a whole as a distributed system and this is why I recognized the analysis as valid. I just meant that while the system has distributed system properties Sentinel itself is focused into having a specific role, and with this limited role as Aphyr shows Sentinel can't change the fact that Redis is a master / slave system using asynchronous replication.

<antirez> Do you want your Redis master stopping to accept writes when it is no longer able to replicate to its slaves?

<aphyr> Yes. This is required for a CP system with failover.

Here there is simply a different point of view: Apyhr is reasoning in terms of formal correctness, while I'm reasoning in terms of real-world production environments. Note that the two things have big margins of overlaps. For example AP systems are important in the real world, and similarly even the most theoretician can't deny that splits does not happen at random in most setups, but there is a bigger probability of a split happening along defined lines (like a switch, a wan connection, or along the connection of a single host).

So basically what Aphyr suggests is that the following partition should stop the master from working (I'll change his notation into ASCII art, pardon):

R R / |R|
\
/ CS CS CS
\
/ Q2

Even if clients are perfectly able to write to the master and the partition would resolve without issues.

Who is right here? It is simply a matter of assumptions. Redis Sentinel assumes that in your environment there are the following two characteristics:

1) You most of the times see single hosts going down.
2) When a partition happen, it has a very large probability to happen in ways you can predict.

If "1" and "2" are true, you can place Sentinels in a way that will continue your operations even if theoretically you should sacrifice availability.

When instead this is not true and a completely random (or attacker chosen) partition happens, you'll need to consider the system as an eventual consistent system where the merge step is just destroying the data of the minority side (!).

More about this later, now following the order of Aphyr's article, I want to open a short parenthesis.

Redis as an AP system
===

<aphyr> You could do this as an application developer by setting every Redis node to be a primary, and writing a proxy layer which uses, say, consistent hashing and active anti-entropy to replicate writes between nodes. Take a look at Antirez's own experiments in this direction (http://antirez.com/news/36).

About that, probably it is the good time to generalize a bit the ideas expressed in that blog post Aphyr mentioned.

When Aphyr attacks Riak in his series, you see that basically with Dynamo-alike systems you either use application-assisted merges of values or you are going to run into data loss. Yet most users, also encouraged by the fact that for instance Riak uses last-write-wins as a default setting, think at AP systems as magical devices that will preserve you data.

However once you want to follow the right rules, merging stuff with the help of the application, you are d00med again, as it's a PAIN IN THE ASS. It is just for that that Riak switched to last-write-win as a default: systems need to be practical and obey to the user base needs, regardless of the fact that this means data loss, just people should be aware of that (and Aphyr contributed to the good cause). What another parallel with IT Security? Database behavior must be psychologically acceptable for the programmer, like security.

Now what I think is that you can have the cake and eat it if you mix Redis data types with the Dynamo model, that is, use Redis data structures and characterize every different data structure with a "type specific merge strategy".

Sets: take the union of the two sets. Possible use: shopping carts.
Lists: augment every element using clocks. On merge add elements not common to both the lists preserving an approximated ordering using clocks.
Sorted sets: take the union of all the elements, score is last-write-wins.
Hashes: take the union of all the fields of the represented object, on value conflict, last write wins.
Strings: last-write-wins

By moving specific semantics into specific types, the user may choose the right data type depending on the expected behavior on merge. When the need is to retain most of the information even if duplicates will appear, use Sets. When you want to save order, like in time series, use lists, and so forth.

There is a lot of work to do into understanding what are the most sensible merging strategies in order to model many problems with minimal efforts, the above is just an example. But long story short, this way you have an AP store that does not present you duplicated values asking for a merge, but at the same time as long as you use the right data type, is able to guarantee out of the box some defined merge behavior.

p.s. I would add "counters" as a native data type for such a store since they require special handling to obtain a good result.

About topology
===

<antirez> … place your Sentinels and set your quorum so that you are defensive enough against partitions. This way the system will activate only when the issue is really the master node down, not a network problem. Fear data loss and partitions? Have 10 Linux boxes? Put a Sentinel in every box and set quorum to 8.

<aphyr> I… can't parse this statement in a way that makes sense. Adding more boxes to a distributed system doesn't reduce the probability of partitions–and more to the point, trying to determine the state of a distributed system from outside the system itself is fundamentally flawed.

my point is:

1) If you place sentinels with clients, you can use the quorum to dictate how big a minority writing to the old master can be (the part of writes you'll lose)
2) If you have ways to predict different network splits probabilities, you can use this superpower to place C, S, and R so that the probability of a "bad" partition (one leading to data loss) is small.

Also the examples Apyhr is doing there is about sentinels running alone in their instances. This is fair if you want to add to your analysis the further rule that even processes running in the same physical computer have different logical connections to the network, but I think this is a very strict assumption to do. I would assume that if clients and sentinels run in the same host there is a very small probability of partitions like in the examples.

I think that within the scope of real world networks there is a value in placing the "observer" in different places since you can model the way it fails. Especially when you have a few Redis nodes, and a big number of clients, to dictate the availability from the point of view of Redis nodes may more likely result into unwanted / not necessary fail overs, like in the following example:

C C C C C C C C C C
|RS|
- - - - - - - - - - - -
RS RS Q2

Here you would see a promotion in the side without clients, and all the queries destroyed on rejoin. Or following Apyhr advice to have a pure CP system, this would result into sacrificing availability: all the clients unable to write to the master that is otherwise working.

I would sleep better with:

CS CS CS CS CS CS
CS CS CS CS Q8
|R| R R

In that setup I know two things:

1) the system will try hard to go forward as long as there is a DB visible by the majority of clients.
2) if bad partitions happen, for losing data I need to have at max 2 clients isolated with the old master, otherwise only the other side will be able to write and no data loss will result after rejoin.

So my point is that if you augment CAP with probability of failure and partitions of a given type, and add to the mix a master -> slave setup with asynchronous replication, the result is that to design practical systems apparently violating good theoretical sometimes is a good idea.

Not just Sentinel
===

Aphyr than continues saying that all the master -> slave systems with asynchronous replication are like that.

I don't agree in general, what I mean is that in systems like Redis Cluster, where monitoring is performed among nodes and only the side with the majority of master nodes continues to operate, only provide a *window* of data loss. After a few seconds (depending on the configuration) the minority side stops accepting writes: It is a master -> slave system in its essence but with time-bound data loss on splits.

The net result is that, yes, still you may lose data, but the fact the data loss is limited in time completely changes the kind of applications you may use a given system for.

Thanks again to Aphyr for the good exchange.