<antirez>

antirez 3403 days ago. 286281 views.
Yesterday Amplitude published an article about scaling analytics, in the context of using the Set data type. The blog post is here: https://amplitude.com/blog/2015/08/25/scaling-analytics-at-amplitude/

On Hacker News people asked why not using Redis instead: https://news.ycombinator.com/item?id=10118413 

Amplitude developers have their set of reasons for not using Redis, and in general if you have a very specific problem and want to scale it in the best possible way, it makes sense to implement your vertical solution. I’m not adverse to reinventing the wheel, you want your very specific wheel sometimes, that a general purpose system may not be able to provide. Moreover creating your solution gives you control on what you did, boosts your creativity and your confidence in what you, as a developer can do, makes you able to debug whatever bug may arise in the future without external help.

On the other hand of course creating system software from scratch is a very complex matter, requires constant developments if there is a wish to actively develop something, or means to have a stalled, non evolving piece of code if there is no team dedicated to it. If it is very vertical and specialized, likely the new system is capable of handling only a slice of the whole application problems, and yet you have to manage it as an additional component. Moreover if it was created by mostly one or a few programmers that later go away from the company, then fixing and evolving it is a very big problem: there isn’t sizable external community, nor there are the original developers.

Basically writing things in house is not good or bad per se, it depends. Of course it is a matter of sensibility to understand when it’s worth to implement something from scratch and when it is not. Good developers know.

From my point of view, regardless of what the Amplitude developers final solution was, it is interesting to read the process and why they are not using Redis. One of the concerns they raised is the overhead of the Set data type in Redis. I believe they are right to have such a concern, Redis Sets could be a lot more memory efficient, and weeks before reading the Amplitude article, I already started to explore ways to improve Sets memory efficiency. Today I want to share the plans with you.

Dual representation of data types
===

In principle there where plain data structures, implemented more or less like an algorithm text book suggests: each node of the data structure is implemented dynamically allocating it. Allocation overhead, fat pointers, poor cache locality, are the big limits of this basic solution.

Pieter Noordhuis and I later implemented specialized implementations of Redis abstract data types, to be very memory efficient, using single allocations to hold tens or a few hundreds of elements in a single allocation, sometimes with ad-hoc encodings to better use the space. Those versions of the data structures have O(N) time complexity for certain operations, or sometimes are limited to elements having a specific format (numbers) or sizes.

So for example when you create an Hash, it starts represented in a memory efficient way which is good for a small number of elements. Later it gets converted to a real hash table if the number of elements reach a given threshold. This means that the memory efficiency of a Redis data type depends a lot on the number of elements it stores.

The next step: Redis lists
===

At some point Twitter developers realized that there was no reason to go from an array of elements in a single allocation, representing items in a List, to an actual linked list which is a lot less memory efficient. There is something in the middle: a linked list of arrays representing a few items.
Their implementation does not handle defragmentation when you remove something in the middle. Pieter and I in the past tried to understand if this was worth or not, but we had some feeling the defragmentation effort may not be compensated by the space savings, and a non defragmenting implementation of this idea was too fragile as a general purpose implementation of Redis lists: remove a few elements in the middle and your memory usage changes dramatically.

Fortunately Matt Stancliff implemented the idea, including the defragmentation part, in an excellent way, and after some experimentation he showed that the new implementation was at least as good as the current implementation in Redis from the POV of performances, and much better from the point of view of memory usage. Moreover the memory efficiency of lists was no longer a function of the size of the list, and there was a single representation to deal with.

Lists are kinda special since, to have linked lists of small arrays is really an optimal representation that may not map easily to other data types. It is possible to do something like that for Sets and other data types?

Redis Sets
===

Sets memory usage is a bit special. They don’t have a specialized representation like all the other Redis data structures for sets composed of strings. So even a very small Set is going to consume a lot of memory. The special representation actually exists and is excellent but only works if the Set is composed of just numbers and is small: in such a case, we represent the Set with a special encoding called “intset”. It is an ordered linear array of integers, so that we can use binary search for testing existence of members. The array automatically changes size of each element depending on the greatest element in the set, so representing a set that has the strings 1, 20, 30, 15 is going to take just one byte per element plus some overhead, because the strings can be represented as numbers, and are inside the 8 bits range. However just add “a” to the set, and it will be converted into a full hash table:

127.0.0.1:6379> sadd myset 1 2 3 4 5
(integer) 5
127.0.0.1:6379> object encoding myset
"intset"
127.0.0.1:6379> sadd myset a
(integer) 1
127.0.0.1:6379> object encoding myset
"hashtable"

Sets of integer are a very used data type in Redis, so it is actually very useful to have that. But why we don’t have a special representation for small sets composed of non numerical strings, like we have for everything else? Well, the idea was that to have a data type with *three* representations was not going to be a good thing from the point of view of Redis internals. If you check t_zset.c o t_set.c you’ll see it require some care to deal with multiple representations. The more you want to abstract away dealing with N representations, the more you no longer have access to certain optimizations. Moreover the List story showed that it was possible to have a single representation with all the benefits. What you lose in terms of scanning the small aggregates containing N elements, you win back because of better cache locality, so it is possible to experiment with things that look like a tragic time/space tradeoff, but in reality are not.

Specializing Redis hash tables
===

Big hashes, non numerical (or big) sets, and big sorted sets, are currently represented by hash tables. The implementation is the one inside the dict.c file. It is an hash table implemented in a pretty trivial way, using chaining to resolve collisions. The special things in this hash table implementation are just two: it never blocks in order to rehash, the rehashing process is handled incrementally. I did this in the first months of my VMware sponsorship, and it was a big win in terms of latency, of course. dict.c also implements a special primitive called “scanning” invented by Pieter Noordhuis, which is a cursor based iterator without overheads nor state, but with reasonable guarantees. Apart from that the Redis hash table expects keys and values to be pointers to something, and methods in order to compare and release keys, and to release values.

This is how you want to design a general purpose hash table: pointers and methods (function pointers) to deal with values everywhere. However Redis data structures have an interesting property: every element in complex data structures are always, semantically strings. Hashes are maps between a string field and a string value. Sets are unordered sets of strings, and so forth.

What happens if we implement an hash table which is designed in order to store just string keys and string values? Well… it looks like there is a simple way to make such an hash table very memory efficient. We could set the load factor to something greater than 1, for example 10, and if there are 5 buckets in the hash table, each bucket will contain on average 10 elements.

So each bucket will be something like a linear array of key-value items, with prefixed lengths, in a very similar fashion to the encodings we use for small data types currently. Something like:

0: <3>foo<3>bar<5>hello<6>world!<0>
1: <8>user:103<3>811 … <0>
2: … <0>

And so forth. The encoding could be specialized or just something existing like MessagePack. So here the extra work you do in each bucket is hopefully compensated by the better locality you get.

To implement scanning and incremental rehashing on top of this data structure is viable as well, I did an initial analysis and while it is not possible to just copy the implementation in dict.c it is possible to find other ways to obtain the same effects.

Note that, technically speaking, it is possible to store pointers in such an hash table: they will be just strings from the point of view of the hash table implementation, and it is possible to signal, in the hash table type, that those are pointers that need special care (for example free-value function pointers or alike). However only testing can say if it’s worth it or not.

However there are problems that must be solved in order to use this for more than sets, or at least in order to use *only* such a representation, killing the small representations we have currently. For example, the current small representations have a very interesting property: they are already a serialization of themselves, without extra work required: we use this to store data into RDB files, to transfer data between nodes in Redis Cluster, and so forth. The specialized hash table should have the same property hopefully, or at least each single bucket should be already in a serialized format without any post-processing work required. If this is not the case, we could use this new dictionaries only in place of the general hash tables after the expansion, which is already a big win.

Conclusions
===

This is an initial idea that requires some time for the design to be improved, validated by an implementation later, and in-depth load tested to guarantee there is no huge regression in certain legitimate workloads. If everything goes well we may end with a Redis server which is a lot more memory efficient than in the past. Such an hash table may also be used in order to store the main Redis dictionary in order to make each key overhead much smaller.
blog comments powered by Disqus
: