<antirez>

antirez 4256 days ago. 267314 views.
Redis uses streamed asynchronous replication, that's one of the simplest forms of replication you can imagine: a continuos stream of writes is sent to the slaves, without waiting for the slaves to process the writes in any way before replying to the client.

I always gave that almost for granted, as I always assumed Redis was not a good match for synchronous replication, that has an higher latency. However recently I tried to fix another issue with Redis replication, that is, timeouts are all up to the slave.

This is how it used to work:

1) Master sends data to slaves. However sometimes there is no data to send (no write traffic). We still need to send something to slaves in order to avoid slaves will detect a timeout.
2) So a master periodically sends PINGs to slaves as well, every 10 seconds by default.
3) Detection of a broken replication link is up to the slaves that will close the connection when a timeout is detected.
4) Masters are able to detect errors in the replication link only when reported by the operating system as a socket error.

So the ability of masters to detect errors in the replication link was pretty limited in Redis 2.6, and this is BAD. There are many kind of broken links that will result in no error raised in the socket, but still we end accumulating writes for a slave that is down. The only defense against this was the ability of Redis 2.6 to detect when the output buffer was too big, and close the connection before to use all the available memory as slave output buffers.

Pinging back
===

In order to fix this issue the most natural thing to do is to also ping from slave to master, so that the master can be aware of slaves, otherwise the slave -> master communication is completely zero, as slaves don't reply to write commands sent by a master in any way to save bandwidth.

However I was not very happy with sending just PING, since it was possible to send something way more useful, that is, the current *replication offset*. The replication offset is a new concept we have in 2.8 with PSYNC. Basically every master has a 64 bit global counter, about how much replication stream it produced. Moreover the replication stream is identical for all the slaves, so every slave shares the same global replication offset with the master.

The replication offset is primarily used by PSYNC, so that slaves can request a partial resynchronization asking the master to send data starting from a given offset, that is, the last offset that the slave received.

So instead of sending PINGs I made slaves pinging the masters with a new command:


    REPLCONF ACK <replication-offset>

This way the master is aware of the amount of replication stream processed so far, and as a result it knows the "lag" of the slave. This is how it looks like when we ask a slave for "INFO replication":

    $ redis-cli info replication
    # Replication
    role:master
    connected_slaves:1
    slave0:127.0.0.1,6380,online,121483
    master_repl_offset:121483
    repl_backlog_active:1
    repl_backlog_size:1048576
    repl_backlog_first_byte_offset:2
    repl_backlog_histlen:121482

As you can see the offset (last element) of slave0 is the same as master_repl_offset. So the slave is perfectly aligned.

Great, so far so good, but wait, isn't this half of what you need to implement synchronous replication?

Synchronous replication the easy way
===

So if we know the offset a slave processed so far, we could implement a new feature in Redis transactions, like that:


    MULTI
    MINREPLICAS 3 60
    SET foo bar
    EXEC

Here MINREPLICAS would tell Redis, make the command return only when my write reached the master and at least two slaves.
The first argument is the number of replicas, the second is a timeout, as we can't wait forever if there are not enough slaves accepting the write.

Implementing this is simple:

1) After the master processes the command, we save the current replication offset.
2) We also send REPLCONF GETACK to every slave in order to receive an ACK ASAP (otherwise sent every second).
3) We block the client, similarly to what happens when BLPOP is called.
4) As we receive enough ACKs from slaves so that N replicas have an offset already >= to the one we saved, we unblock the client.

Cool right? Synchronous replication almost for free, not affecting the other commands at all, and so forth.

No rollbacks, no fun?
===

There is a problem however, what happens if the timeout is reached and we still did not reached N replicas?

In Redis we don't have rollbacks, and I don't want to add this feature as rollbacks with complex big values are hard to implement, very costly, and will make everything too complex for my current tastes.

So, the write will *anyway* reach the master and a number of slaves < N-1 even if the transaction was not able to honor the requested MINREPLICAS count. However we can notify the user about the number of replicas reached as a first element of the MULTI/EXEC reply. This way the user may rollback manually if he wishes, or he may retry, assuming the write is idempotent…

I wonder if the feature is still useful without rollbacks.

Alternatives
===

There is an alternative: now we are able to sense slaves, so we may implement a much weaker form of check, that could be still very useful in practical systems, that is:

    MINREPLICAS <count> <min-idle>

Where I ask Redis to *start* the transaction only if there are at least <count> slaves connected, with an idle time in the ACK that is less than the specified <min-idle>. This does not guarantee that the write will be propagated to N replicas as there is an obvious window, but we'll be sure that if slaves get disconnected or blocked in some way, after some time (chosen by the user) the writes will no longer be accepted.

What we have now
===

Slave sending ACKs back to our master entered into the 2.8 branch (as it was a bug fix, basically), so the different possibilities are open for the future, but currently I don't feel like it is the right time to implement synchronous replication in Redis without thinking more about the behavior of the feature. However the fact the underlaying mechanism is so simple is tempting...
blog comments powered by Disqus
: