Redis as AP system, reloaded

antirez 4612 days ago. 273706 views.

So finally something really good happened from the Redis criticism thread.

At the end of the work day I was reading about Redis as AP and merge operations on Twitter. At the same time I was having a private email exchange with Alexis Richardson (from RabbitMQ, and, my boss). Alexis at some point proposed that perhaps a way to improve safety was to asynchronously ACK the client about what commands actually were not received so that the client could retry. This seemed a lot of efforts in the client side, but somewhat totally opened my view on the matter.

So the idea is, we can't go for synchronous replication, but indeed we get ACKs from the replicas, asynchronous ACKS, specifically.
What about retaining all the writes not acknowledged into a buffer, and "re-play" them to the current master when the partition heals?
The window we need to require to take the log is very small if the ACKs are frequent enough (currently the frequency is 1 per second, but this could be more easily).
If we give up Availability after N times the window we can say, ok, no more room, we now start to reply with errors to queries.

The HUGE difference with this approach is that this works regardless of the size of values. There are also semantical differences since the stream of operations is preserved instead of the value itself, so there is more context. Think for example about INCR.

Of course this would not work for anything, but one could mark in the command table what command to reply and what to discard. SADD is an example of perfect command since the order of operations does not matter. DEL is likely something to avoid replying. And so forth. In turn if we reply against the wrong (stale) master, it will accumulate the commands and so forth. Details may vary, but this is the first thing that really makes a difference.

Probably many of you that are into eventually consistent databases know about the log VS merge strategies already, but I had to re-invent the wheel as I was not aware. This is the kind of feedback I expected in the Redis thread that I did not received.

Another cool thing about this approach is that it's pretty opt-in, it can be just a state in the connection. Send a command and the connection is of "safe" type, so all the commands sent will be retained and replayed if not acknowledged, and so forth.

This is not going to be in the first version of Redis Cluster as I'm more happy to ship ASAP the current design, but it is a solid incremental idea that could be applied later, so a little actual result into the evolution of the design.