Random notes on improving the Redis LRU algorithm

antirez 3419 days ago. 285623 views.

Redis is often used for caching, in a setup where a fixed maximum memory to use is specified. When new data arrives, we need to make space by removing old data. The efficiency of Redis as a cache is related to how good decisions it makes about what data to evict: deleting data that is going to be needed soon is a poor strategy, while deleting data that is unlikely to be requested again is a good one.

In other terms every cache has an hits/misses ratio, which is, in qualitative terms, just the percentage of read queries that the cache is able to serve. Accesses to the keys of a cache are not distributed evenly among the data set in most workloads. Often a small percentage of keys get a very large percentage of all the accesses. Moreover the access pattern often changes over time, which means that as time passes certain keys that were very requested may no longer be accessed often, and conversely, keys that once were not popular may turn into the most accessed keys.

So in general what a cache should try to do is to retain the keys that have the highest probability of being accessed in the future. From the point of view of an eviction policy (the policy used to make space to allow new data to enter) this translates into the contrary: the key with the least probability of being accessed in the future should be removed from the data set. There is only one problem: Redis and other caches are not able to predict the future.

The LRU algorithm
===

While caches can’t predict the future, they can reason in the following way: keys that are likely to be requested again are keys that were recently requested often. Since usually access patterns don’t change very suddenly, this is an effective strategy. However the notion of “recently requested often” is more insidious that it may look at a first glance (we’ll return shortly on this). So this concept is simplified into an algorithm that is called LRU, which instead just tracks the *last time* a key was requested. Keys that are accessed with an higher frequency have a greater probability of being idle (not accessed) for a shorter time compared to keys that are rarely accessed.

For instance this is a representation of four different keys accesses over time. Each “~” character is one second, while the “|” line at the end is the current instant.

~~~~~A~~~~~A~~~~~A~~~~A~~~~~A~~~~~A~~|
~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~|
~~~~~~~~~~C~~~~~~~~~C~~~~~~~~~C~~~~~~|
~~~~~D~~~~~~~~~~D~~~~~~~~~D~~~~~~~~~D|

Key A is accessed one time every 5 seconds, key B once every 2 seconds
and key C and D are both accessed every 10 seconds.

Given the high frequency of accesses of key B, it has among the lowest idle
times, which means its last access time is the second most recent among all the
four keys.

Similarly A and C idle time of 2 and 6 seconds well reflect the access
frequency of both those keys. However as you can see this trick does not
always work: key D is accessed every 10 seconds, however it has the most
recent access time of all the keys.

Still, in the long run, this algorithm works well enough. Usually keys
with a greater access frequency have a smaller idle time. The LRU
algorithm evicts the Least Recently Used key, which means the one with
the greatest idle time. It is simple to implement because all we need to
do is to track the last time a given key was accessed, or sometimes
this is not even needed: we may just have all the objects we want to
evict linked in a linked list. When an object is accessed we move it
to the top of the list. When we want to evict objects, we evict from
the tail of the list. Tada! Win.

LRU in Redis: the genesis
===

Initially Redis had no support for LRU eviction. It was added later, when memory efficiency was a big concern. By modifying a bit the Redis Object structure I was able to make 24 bits of space. There was no room for linking the objects in a linked list (fat pointers!), moreover the implementation needed to be efficient, since the server performance should not drop too much because of the selection of the key to evict.

The 24 bits in the object are enough to store the least significant
bits of the current unix time in seconds. This representation, called
“LRU clock” inside the source code of Redis, takes 194 days to overflow. Keys metadata are updated much often, so this was good enough.

However there was another more complex problem to solve, how to select
the key with the greatest idle time in order to evict it? The Redis
key space is represented via a flat hash table. To add another data
structure to take this metadata was not an option, however since
LRU is itself an approximation of what we want to achieve, what
about approximating LRU itself?

The initial Redis algorithm was as simple as that: when there is to evict
a key, select 3 random keys, and evict the one with the highest
idle time. Basically we do random sampling over the key space and evict
the key that happens to be the better. Later this “3 random keys”
was turned into a configurable “N random keys” and the algorithm
speed was improved so that the default was raised to 5 keys sampling
without losing performances. Considering how naive it was, it worked
well, very well actually. If you think at it, you always never do
the best decision with this algorithm, but is very unlikely to do
a very bad decision too. If there is a subset of very frequently accessed
keys in the data set, out of 5 keys it is hard to be so unlucky to
only sample keys with a very short idle time.

However if you think at this algorithm *across* its executions, you
can see how we are trashing a lot of interesting data. Maybe when
sampling the N keys, we encounter a lot of good candidates, but
we then just evict the best, and start from scratch again the next
cycle.

First rule of Fight Club is: observe your algorithms with naked eyes
===

At some point I was in the middle of working at the upcoming Redis
3.0 release. Redis 2.8 was actively used as an LRU cache in multiple
environments, and people didn’t complained too much about the
precision of the eviction in Redis, but it was clear that it could
be improved even without using a noticeable amount of additional CPU
time, and not a single bit of additional space.

However in order to improve something, you have to look at it. There
are different ways to look at LRU algorithms. You can write, for example,
tools that simulate different workloads, and check the hit/miss ratio
at the end. This is what I did, however the hit/miss ratio depends
a lot on the access pattern, so additionally to this information I
wrote an utility that actually displayed the algorithm quality in a
visual way.

The program was very simple: it added a given number of keys, then
accessed the keys sequentially so that each had a decreasing
idle time. Finally 50% more keys were added (the green ones in the
picture), so that half of the old keys needed to be evicted.

In a perfect LRU implementation no key from the new added keys are evicted, and the initial 50% of the old dataset is evicted.

This is the representation produced by the program for different
versions of Redis and different settings:

http://redis.io/images/redisdoc/lru_comparison.png

When looking at the graph remember that the implementation we
discussed so far is the one of Redis 2.8. The improvement you
see in Redis 3.0 is explained in the next section.

LRU V2: don’t trash away important information
===

With the new visual tool, I was able to try new approaches and
test them in a matter of minutes. The most obvious way to improve
the vanilla algorithm used by Redis was to accumulate the otherwise
trashed information in a “pool” of good candidates for eviction.

Basically when the N keys sampling was performed, it was used to
populate a larger pool of keys (just 16 keys by default).
This pool has the keys sorted by idle time, so new keys only enter
the pool when they have an idle time greater than one key in the
pool or when there is empty space in the pool.

This small change improved the performances of the algorithm
dramatically as you can see in the image I linked above and
the implementation was not so complex. A couple memmove() here
and there and a few profiling efforts, but I don’t remember
major bugs in this area.

At the same time, a new redis-cli mode to test the LRU precision
was added (see the —lru-test option), so I had another way to
check the performances of the LRU code with a power-law access
pattern. This tool was used to validate with a different test that
the new algorithm worked better with a more real-world-ish workload.
It also uses pipelining and displays the accesses per second, so
can be used in order to benchmark different implementations, at least
to check obvious speed regressions.

Least Frequently Used
===

The reason I’m writing this blog post right now is because a couple
of days ago I worked at a partial reimplementation and to different
improvements to the Redis cache eviction code.

Everything started from an open issue: when you have multiple databases
with Redis 3.2, the algorithm evicts making local choices. So
if for example you have all keys with a small idle time in DB number 0,
and all keys with large idle time in DB number 1, Redis will evict
one key from each DB. A more rational choice is of course to start
evicting keys from DB number 1, and only later to evict the other keys.

This is usually not a big deal, when Redis is used as a cache it is
rarely used with different DBs, however this is how I started to work
at the eviction code again. Eventually I was able to modify the pool
to include the database ID, and to use a single pool for all the DBs
instead of using multiple pools. It was slower, but by profiling and
tuning the code, eventually it was faster than the original
implementation by around 20%.

However my curiosity for this subsystem of Redis was stimulated again
at that point, and I wanted to improve it. I spent a couple of days
trying to improve the LRU implementation: use a bigger pool maybe?
Account for the time that passes when selecting the best key?

After some time, and after refining my tools, I understood that the
LRU algorithm was limited by the amount of data sampled in the database
and was otherwise very good and hard to improve. This is, actually,
kinda evident from the image showing the different algorithms:
sampling 10 keys per cycle the algorithm was almost as accurate as
theoretical LRU.

Since the original algorithm was hard to improve, I started to test
new algorithms. If we rewind a bit to the start of the blog post, we
said that LRU is actually kinda a trick. What we really want is to
retain keys that have the maximum probability of being accessed in the
future, that are the keys *most frequently accessed*, not the ones with
the latest access.

The algorithm evicting the keys with the least number of accesses
is called LFU. It means Least Frequently Used, which is the feature of
the keys that it attempts to kill to make space for new keys.

In theory LFU is as simple as associating a counter to each key. At
every access the counter gets incremented, so that we know that a given
key is accessed more frequently than another key.

Well, there are at least a few more problems, not specific to Redis,
general issues of LFU implementations:

1. With LFU you cannot use the “move to head” linked list trick used for LRU in order to take elements sorted for eviction in a simple way, since keys must be ordered by number of accesses in “perfect LFU”. Moving the accessed key to the right place can be problematic because there could be many keys with the same score, so the operation can be O(N) in the worst case, even if the key frequency counter changed just a little. Also as we’ll see in point “2” the accesses counter does not always change just a little, there are also sudden large changes.

2. LFU can’t really be as trivial as, just increment the access counter
on each access. As we said, access patterns change over time, so a key
with an high score needs to see its score reduced over time if nobody
keeps accessing it. Our algorithm must be albe to adapt over time.

In Redis the first problems is not a problem: we can just use the trick
used for LRU: random sampling with the pool of candidates. The second
problem remains. So normally LFU implementations have some way in order
to decrement, or halve the access counter from time to time.

Implementing LFU in 24 bits of space
===

LFU has its implementation peculiarities itself, however in Redis all
we can use is our 24 bit LRU field in order to model LFU. To implement
LFU in just 24 bits per objects is a bit more tricky.

What we need to do in 24 bits is:

1. Some kind of access frequency counter.
2. Enough information to decide when to halve the counter.

My solution was to split the 24 bits into two fields:

16 bits 8 bits
+----------------+--------+
+ Last decr time | LOG_C |
+----------------+--------+

The 16 bit field is the last decrement time, so that Redis knows the
last time the counter was decremented, while the 8 bit field is the
actual access counter.

You are thinking that it’s pretty fast to overflow an 8 bit counter,
right? Well, the trick is, instead of using just a counter, I used
a logarithmic counter. This is the function that increments the
counter during accesses to the keys:

uint8_t LFULogIncr(uint8_t counter) {
if (counter == 255) return 255;
double r = (double)rand()/RAND_MAX;
double baseval = counter - LFU_INIT_VAL;
if (baseval < 0) baseval = 0;
double p = 1.0/(baseval*server.lfu_log_factor+1);
if (r < p) counter++;
return counter;
}

Basically the greater is the value of the counter, the less probable
is that the counter will really be incremented: the code above computes
a number ‘p’ between 0 and 1 which is smaller and smaller as the counter
increases. Then it extracts a random number ‘r’ between 0 and 1 and only
increments the counter if ‘r < p’ is true.

You can configure how much aggressively the counter is implemented
via redis.conf parameters, but for instance, with the default
settings, this is what happens:

After 100 hits the value of the counter is 10
After 1000 is 18
After 100k is 142
After 1 million hits it reaches the 255 limit and no longer increments

Now let’s see how this counter is decremented. The 16 bits are used in
order to store the less significant bits of the UNIX time converted
to minutes. As Redis performs random sampling scanning the key space
in search of keys to populate the pool, all keys that are encountered
are checked for decrement. If the last decrement was performed more than
N minutes ago (with N configurable), the value of the counter is halved
if it is an high value, or just decremented if it is a lower value
(in the hope that we can better discriminate among keys with few
accesses, given that our counter resolution is very small).

There is yet another problem, new keys need a chance to survive after
all. In vanilla LFU a just added key has an access score of 0, so it
is a very good candidate for eviction. In Redis new keys start with
an LFU value of 5. This initial value is accounted in the increment
and halving algorithms. Simulations show that with this change keys have
some time in order to accumulate accesses: keys with a score less than
5 will be preferred (non active keys for a long time).

Code and performances
===

The implementation described above can be found in the “unstable” branch
of Redis. My initial tests show that it outperforms LRU in power-law
access patterns, while using the same amount of memory per key, however
real world access patterns may be different: time and space locality
of accesses may change in very different ways, so I’ll be very happy
to learn from real world use cases how LFU is performing, and how the
two parameters that you can tune in the Redis LFU implementation change
the performances for different workloads.

Also an OBJECT FREQ subcommand was added in order to report the
frequency counter for a given key, this can be both useful in order
to observe an application access pattern, and in order to debug the
LFU implementation.

Note that switching at runtime between LRU and LFU policies will have
the effect to start with almost random eviction, since the metadata
accumulated in the 24 bits counter does not match the meaning of the
newly selected policy. However over time it adapts again.

There are probably many improvements possible.

Ben Manes pointed me to this interesting paper, describing an algorithm
called TinyLRU (http://arxiv.org/pdf/1512.00727.pdf).

The paper contains a very neat idea: instead of remembering the
access frequency of the current objects, let’s (probabilistically)
remember the access frequency of all the objects seen so far, this
way we can even refuse new keys if, from the name, we believe they
are likely to get little accesses, so that no eviction is needed at all,
if evicting a key means to lower the hits/misses ratio.

My feeling is that this technique, while very interesting for plain
GET/SET LFU caches, is not applicable to the data structure server
nature of Redis: users expect the key to exist after being created
at least for a few milliseconds. Refusing the key creation at all
seems semantically wrong for Redis.

However Redis maintains LFU informations when a key is overwritten, so
for example after a:

SET oldkey some_new_value

The 24 bit LFU counter is copied to the new object associated to the
old key.

The new eviction code of Redis unstable contains other good news:

1. Policies are now “cross DB”. In the past Redis made local choices as explained at the start of this blog post. Now this is fixed for all the policies, not just LRU.

2. The volatile-ttl eviction policy, which is the one that evicts based on the remaining time to live of keys with an expire set, now uses the pool like the other policies.

3. Performances are better by reusing SDS objects in the pool of keys.

This post ended a lot longer than I expected it to be, but I hope it offered a few insights on the new stuff and the improvements to the old things we already had. Redis, more than a “solution” to solve a specific problem, is a generic tool. It’s up to the sensible developer to apply it in the right way. Many people use Redis as a caching solution, so improvements in this area are always investigated from time to time.

Hacker News comments: https://news.ycombinator.com/item?id=12185534