Redis latency spikes and the 99th percentile

antirez 4276 days ago. 296559 views.

One interesting thing about the Stripe blog post about Redis is that they included latency graphs obtained during their tests. In order to persist on disk Redis requires to call the fork() system call. Usually forking using physical servers, and most hypervisors, is fast even with big processes. However Xen is slow to fork, so with certain EC2 instance types (and other virtual servers providers as well), it is possible to have serious latency spikes every time the parent process forks in order to persist on disk. The Stripe graph is pretty clear in this regard.

As you can guess, if you perform a latency test during the fork, all the requests crossing the moment the parent process forks will be delayed up to one second (taking as example the graph above, not sure about what was the process size nor the EC2 instance). This will produce a number of samples with high latency, and will affect the 99th percentile result.

To change instance type, configuration, setup, or whatever in order to improve this behavior is a good idea, and there are use cases where even a single request having a too high latency is unacceptable. However apparently it is not obvious how latency spikes of 1 second every 30 minutes (or more, if you use AOF with the right rewrite triggers) is very different from latency spikes which are evenly distributed in the set of requests.

With evenly distributed spikes, if the generation of a page needs to perform a number of requests to a Redis server in order to create the output, it is very likely that a page view will incur in the latency penalty: this impacts the quality of service in a great way potentially, check this link: http://latencytipoftheday.blogspot.it/2014/06/latencytipoftheday-most-page-loads.html.

However 1 second of latency every 30 minutes run is a completely different thing. For once, the percentile with good latency gets better *as the number of requests increase*, since the more the requests are, the more this second of latency will be unlikely to get over-represented in the samples (if you have just 1 request per minute, and one of those requests happen to hit the high latency, it will affect the 99.99th percentile much more than what happens with 100 requests per second).

Second: most page views will be unaffected. The only users that will see the 1 second delay are the ones that make a request crossing the fork call. All the other requests will experience an extremely low probability of hitting a request that has a latency which is significantly worse than the average latency. Also note that a page view crossing the fork time, even when composed of 100 requests, can’t be delayed for more than a second, since the requests are completed as soon as the fork() call terminates.

The bottom line here is that, if there are hard latency requirements for each single request, it is clear that a setup where a request can be delayed 1 second from time to time is a big problem. However when the goal is to provide a good quality of service, the distribution of the latency spikes have a huge effect on the outcome. Redis latency spikes due to fork on Xen are isolated points in the line of time, so they affect a percentage of page views, even when the page views are composed of a big number of Redis requests, which is proportional to the latency spike total time percentage, which is, 1 second every 1800 seconds in this case, so only the 0.05% of the page views will be affected.

Latency characteristics are hard to capture with a single metric: the full percentile curve and the distribution of the spikes, can provide a better picture. In general good rule of thumbs are a good way to start a research, and in general it is true that the average latency is a poor metric. However to promote a rule of thumb into an absolute truth has also its disadvantages, since many complex things remain complex, and in need for close inspection, regardless of our willingness to over simplify them.

At the same time, fork delays in EC2 instances are one of the worst experiences for Redis users in one of the most popular runtime environments today, so I’m starting to test Redis with EC2 regularly now: we’ll soon have EC2 specific optimization pages on the Redis official documentation, and a way to operate master-slaves replicas with persistence disabled in a safer way.

If you need EC2 + Redis master with persistence disabled now, the simplest to deploy “quick fix” is to disable automatic restarts of Redis instances, and use Sentinel for failover, so that crashed masters will not automatically return available, and will be failed over by Sentinel. The system administrator can restart the master manually after checking that the failover was successful and there is a new active master.

EDIT: Make sure to check the Hacker News thread that contains interesting information about EC2, Xen and fork time: https://news.ycombinator.com/item?id=8532851. Also not all the EC2 instances are the same, and certain types provide great fork time on pair with bare metal systems: https://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure#.VFJQ-JPF8yF