PSYNC - <antirez>

antirez 4909 days ago. 275265 views.

Dear Redis users, in the final part of 2012 I repeated many time that the focus, for 2013, is all about Redis Cluster and Redis Sentinel.

This is exactly what I'm going to do from the point of view of the big picture, however there are many smaller features that make a big difference from the point of view of the Redis user day to day operations. Such features can't be ignored as well. They are less shiny in a feature list, and they are not good to generate buzz and interest in new users, and sometimes boring to code, but they are very important from a practical point of view.

So I ended the year and I'm starting the new one spending considerable time on a feature that was long awaited by many users having production instances crunching data every day, that is, the ability for a slave to partially resynchronize with the master without requiring a full resynchronization every time.

The good news is that finally today I've an implementation that works well in my tests. This means that this feature will be present in Redis 2.8, so it is the right time to start making users aware of it, and to describe how it works.

Some background
---

Redis replication is a pretty brutal piece of code in many ways:

1) It works by re-playing on slaves every command that was received in the Redis master that actually produced a change in the data set.
2) From the point of view of slaves, masters are just a bit special clients, but they are almost like normal clients sending commands. No special replication protocol or data format used for replication.
3) It *used to force* a full resynchronization every time a slave connects to a master. This means, at every connection, the slave will receive a copy of the master data set, in form of an RDB file, and load it.

Because of this characteristics Redis replication has been very reliable from the point of view of corruption. If you always full-resync, there are little chances for inconsistency. Also it was architecturally trivial, because masters are like clients, no special protocol is used and so forth.

Simple and reliable, what can go wrong? Well, what goes wrong is that sometimes even when simplicity is very important, to do an O(N) work when zero work is needed is not a good idea. I'm looking at you, point three of my list.

Consider the following scenario:

* Slave connect to master, and full resync.
* Master and slave chat for one hour.
* Slave disconnects from Master because of some silly network issue for 2 seconds.

A full resynchronization to reconnect is required. It was a design sacrifice because after all we are dealing with RAM-sized data sets. It can't be so hard. But actually as RAM gets cheaper, and big users more interested in Redis, we have many production instances with big data sets that need to full resync at every network issue.

Also resynchronization involves unpleasant things:

1) The disk is involved, since the slave saving the RDB file needs to write that file somewhere.
2) The master is forced to create an RDB file. Not a big deal as this master is supposed to save or write the AOF anyway, but still, more I/O without a good reason.
3) The slave needs to block at some point after reconnection in order to load the RDB file into memory.

This time it was the case to introduce complexity in order to make things better.

# So now Redis sucks as well?

MySQL is one of the first databases I get exposed for sure, a few decades ago, and the first time I had to setup replication I was shocked about how much it sucked. Are you serious that I need to enable binary logs and deal with offset?

Redis replication, that everyone agrees is dead-simple to setup, is more or less a response to how much I dislike MySQL replication from the point of view of the "user interface".

Even if we needed partial replicaton, I didn't wanted Redis to be like that.
However to perform partial resynchronization you in some way or the other need something like that:

<slave> Hi Master! How are you? Remember that I used to be connected with you and we were such goooood friends?
<master> Hey Slave! You still here piece of bastard...
<slave> Well, shut up and give me data starting from offset 282233943 before I signal you to the Authority Of The Lazy Databases.
<master> Fu@**#$(*@($! ... 1010110010101001010100101 ...

So the obvious solution is to have a file in the master side with all the data so that when a slave wants to resync, we can provide any offset without problems just reading it from the file. Except that this sucks a lot: We became append-to-disk-bound even if AOF is disabled, need to deal with the file system that can get full, slow (Hey EC2!), and files to rotate one way or the other. Horrid.

So the following is a description about how Redis partial resynchronization implementation again accepts sacrifices to avoid to suck like that.

# Redis PSYNC

Redis partial resynchronization does two design sacrifices.
It accepts that the slave will be able to resynchronize only if:

1) It reconnects in a reasonable amount of time.
2) The master was not restarted.

Because of this two relaxed requirements, instead of using a file, we can use a simple buffer inside our... Memory! Don't worry, a very acceptable amount of memory.

So a Redis master is modified in order to:

* Unify the data that is sent to the slave, so that every slave receives exactly the same things. We were about here already, but SELECT and PING commands were sent in a slave-specific fashion. Now instead the replication output to slaves is unified.
* Take a backlog of what we send to slaves. So for instance we take 10 MB of past data.
* Take a replication global offset, that the user never needs to deal with. We simply provide this offset to the slave, that will increment it every time it receives data. This way the slave is able to ask for partial resynchronization indicating the offset to start with.

Oh also, we don't want the system to be fragile, so we use the master "run id", that is a concept that was introduced in Redis in the past as an unique identifier of a given instance execution. When the slave synchronizes with the master, it also gets the master run id, so that a next partial resynchronization attempt will be made only if the master is the same, as in, the exact same execution of Redis.

Also the PSYNC command was introduced as a variant of SYNC for partially resync capable instances.

# How all this works in practice?

When the slave gets disconnected, you'll see something like this:

[60051] 17 Jan 16:52:54.979 * Caching the disconnected master state.
[60051] 17 Jan 16:52:55.405 * Connecting to MASTER...
[60051] 17 Jan 16:52:55.405 * MASTER <-> SLAVE sync started
[60051] 17 Jan 16:52:55.405 * Non blocking connect for SYNC fired the event.
[60051] 17 Jan 16:52:55.405 * Master replied to PING, replication can continue...
[60051] 17 Jan 16:52:55.405 * Trying a partial resynchronization (request 6f0d582d3a23b65515644d7c61a10bf9b28094ca:30).
[60051] 17 Jan 16:52:55.406 * Successful partial resynchronization with master.
[60051] 17 Jan 16:52:55.406 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

See the first line, the slave *caches* the master client structure, so that all the buffers are saved to be reused if we'll be able to resynchronize.

In the master side we'll see instead:

[59968] 17 Jan 16:52:55.406 * Slave asks for synchronization
[59968] 17 Jan 16:52:55.406 * Partial resynchronization request accepted. Sending 0 bytes of backlog starting from offset 30.

So basically, as long as the data is still available as no more than N bytes worth of Redis protocol (of write commands) was sent to the master, the slave will be still able to reconnect. Otherwise a full resynchronization will be performed. How much backlog to allocate is up to the user.

# Current status

The 'psync' branch on my private repository now seems to work very well, however this is Very Important Code and must be tested a lot. This is what I'm doing right now. When I'm considerably sure the code is solid, I'll merge into unstable and 2.8 branch. When it runs for weeks without issues and starts to be adopted by the brave early adopters, we'll release 2.8 with this and other new features.