<antirez>

antirez 4072 days ago. 280486 views.
Redis API for data access is usually limited, but very direct and straightforward.

It is limited because it only allows to access data in a natural way, that is, in a data structure obvious way. Sorted sets are easy to access by score ranges, while hashes by field name, and so forth.
This API “way” has profound effects on what Redis is and how users organize data into it, because an API that is data-obvious means fast operations, less code and less bugs in the implementation, but especially forcing the application layer to make meaningful choices: the database as a system in which you are responsible of organizing data in a way that makes sense in your application, versus a database as a magical object where you put data inside, and then it will be able to fetch and organize data for you in any format.

However most Redis data types, including the outer key-value shell if we want to consider it a data type for a moment (it is a dictionary after all), are collections of elements. The key-value space is a collection of keys mapped to values, as Sets are collections of unordered elements, and so forth.

In most application logic using Redis the idea is that you know what member is inside a collection, or at least you know what member you should test for existence. But life is not always that easy, and sometimes you need something more, that is, to scan the collection in order to retrieve all the elements inside. And yes, since this is a common need, we have commands like SMEMBERS or HGETALL, or even KEYS, in order to retrieve everything there is inside a collection, but those commands are always a last-resort kind of deal, because they are O(N) operations.

Your collection is very small? Fine, use SMEMBERS and you get Redis-alike performances anyway. Your collection is big? Don’t use O(N) commands if not for “debugging” purposes. A popular example is the misused KEYS command, source of troubles for non-experts, and top hit among the Redis slow log entries.

Black holes
===

The problem is that because of O(N) operations, Redis collections, (excluding Sorted Sets that can be accessed by rank, in ranges, and in many other different ways), tend to be black holes where you put things inside, and you can hardly explore them again.

And there are plenty of reasons to explore what is inside. For example garbage collection tasks, schema migration, or even fixing what is inside keys after an application bug corrupted some data.

What we really needed was an iterator. Pieter Noordhuis and I were very aware of this problem since the early days of Redis, but it was a major design challenge because traditionally the deal is, you want a data structure to be iterable? Well, this is going to be a sorted tree-like data structure, with the concept of previous and next element.

Instead most Redis data types are based on hash tables, and Redis hash tables are even of the most particular kind, as them automatically and incrementally resize, so sometime you even have two tables at the same time slowly exchanging elements from one to the other.

Hash tables are cool because they have a good memory efficiency, and the constant-time access property. What we use is power-of-two sized hash tables, with chaining for collision handling, and this has worked pretty well so far. An indirect indicator is that sorted sets, the only data structure we have that is based on a tree-alike structure (skip lists) is measurably slower than others once elements start to pile up. While Log(N) is small, with million of elements it starts to be a big enough number that cache misses summed together make a difference.

There was no easy way to say goodbye to hash tables.

However eventually Pieter applied one of the most powerful weapons the designer has in its hands: sacrifice, and send me a pull request for a new SCAN command.

The command was able to walk the key space incrementally, using a cursor that is returned back to the user after every call to SCAN, so it is a completely stateless affair. It is something like that:

redis 127.0.0.1:6379> flushall
OK
redis 127.0.0.1:6379> debug populate 33
OK
redis 127.0.0.1:6379> scan 0
1) "52"
2)  1) "key:29"
    2) "key:13"
    3) "key:9"
    4) "key:12"
    5) "key:28"
    6) "key:30"
    7) "key:26"
    8) "key:14"
    9) "key:21"
   10) "key:20"
redis 127.0.0.1:6379> scan 52
1) "9"
2)  1) "key:16"
    2) "key:31"
    3) "key:3"
    4) "key:0"
    5) "key:32"
    6) "key:17"
    7) "key:24"
    8) "key:8"
    9) "key:15"
   10) "key:11"

… and so forth until the returned cursor is 0 again …

This is possible because SCAN does not make big promises, hence the sacrifice: it guarantees to return all the elements that are in the collection from the start to the end of the iteration, at least one time.

This means, for example, that:

1) Elements may be returned multiple times.
2) Elements added during the iteration may be returned, or not, at random.

It turns out that this is a perfectly valid compromise, and that at application level you can almost always do what is needed to play well with this properties. Sometimes the operation you are doing on every element are simply safe to re-apply, some other times you can just put a flag in your (for example) Hash in order to mark it as processed, or a timestamp perhaps, and so forth.

Eventually merged
===

Pieter did an excellent work, but the pull request remained pending forever (more than one year), because it relied on a complex to understand implementation. Basically in order to guarantee the previous properties with tables that can change from one SCAN call to the other, Pieter used a cursor that is incremented inverting the bits, in order to count starting from the most significant bits first. This is why in the example you see the returned cursor jumping forward and backward.

This has different advantages, including the fact that it returns a small number of duplicates when possible (by checking less slots than a more naive implementation). Eventually I studied his code and tried to find a simpler algorithm with the same properties without success, so what I did is to document the algorithm. You can read the description here in the
dict.c file: https://github.com/antirez/redis/blob/unstable/src/dict.c#L663

After you do some whiteboard reasoning, it is not too hard to see how it works, but it is neither trivial, however it works definitely very well.

So with the code merged into unstable, I tried to generalize the implementation in order to work with all the other types in Redis that can be iterated this way, that are Hashes, Sets and Sorted Sets, in the form of additional commands named HSCAN, SSCAN, ZSCAN.

You may wonder why to have a ZSCAN. The reason is that while sorted sets are iterable in other ways, the specific properties of the SCAN iterator are not trivial to simulate by scanning elements by rank or score. Moreover sorted sets are internally implemented by a skiplist and an hash table, so we already had the hash table and to extend SCAN to sorted sets was trivial.

Patterns too!
===

SCAN and its variants can be used with the MATCH option in order to only return elements matching a pattern. I’m also implementing the TYPE option in order to only return keys of a specific type.

This is almost for free, since SCAN does not guarantees to return elements at all, so what happens is that it scans something like 10 buckets of the hash table per call (by default, you can change this) and later filters the output. Even if the return value contains no elements, you keep iterating as soon as the returned cursor is non-zero.

As you can see this is something you could do client-side as well, by matching the returned elements with a pattern, but it is much faster to do it server side given that is a very common pattern, and one that users are already used to because of the KEYS command. And it requires less I/O of course, if the pattern only matches a small number of elements.

This is an example:

edis 127.0.0.1:6379> scan 0 MATCH key:1*
1) "52"
2) 1) "key:13"
   2) "key:12"
   3) "key:14"
redis 127.0.0.1:6379> scan 52 MATCH key:1*
1) "9"
2) 1) "key:16"
   2) "key:17"
   3) "key:15"
   4) "key:11"
redis 127.0.0.1:6379> scan 9 MATCH key:1*
1) "59"
2) 1) "key:10"
   2) "key:1"
   3) "key:18"
redis 127.0.0.1:6379> scan 59 MATCH key:1*
1) "0"
2) 1) "key:19"

In the last call the returned cursor is 0, so we ended the iteration.

Excited about it? I’ve good news, this is going to be back ported into 2.8 since it is completely self contained code so if it is broken, it does not affect other stuff. Well not just that, there are very big companies that are using SCAN for some time now, so I’m confident it’s pretty stable.

Enjoy iterating!

Discuss this blog post on Hacker News: https://news.ycombinator.com/item?id=6633091
blog comments powered by Disqus
: