<antirez>

antirez 1060 days ago. 168620 views.
Yesterday Stripe engineers wrote a detailed report of why they had an issue with Redis. This is very appreciated. In the Hacker News thread I explained that because now we have diskless replication (http://antirez.com/news/81) now persistence is no longer mandatory for people having a master-slaves replicas set. This changes the design constraints: now that we can have diskless replicas synchronization, it is worth it to better support the Stripe (ex?) use case of replicas set with persistence turned down, in a more safe way. This is a work in progress effort.

In the same post Stripe engineers said that they are going to switch to PostgreSQL for the use case where they have issues with Redis, which is a great database indeed, and many times if you can go with the SQL data model and an on-disk database, it is better to use that instead of Redis which is designed for when you really want to scale to a lot of complex operations per second. Stripe engineers also said that they measured the 99th percentile and it was better with PostgreSQL compared to Redis, so in a tweet @aphyr wrote:

“Note that *synchronous* Postgres replication *between AZs* delivers lower 99th latencies than asynchronous Redis”

And I replied:

“It could be useful to look at average latency to better understand what is going on, since I believe the 99% percentile is very affected by the latency spikes that Redis can have running on EC2.”

Which means, if you have also the average, you can tell if the 99th percentile is ruined (or not) by latency spikes, that many times can be solved. Usually it is as simple as that: if you have a very low average, but the 99th percentile is bad, likely it is not that Redis is running slow because, for example, operations performed are very time consuming or blocking, but instead a subset of queries are served slow because of the usual issues in EC2: fork time in certain instances, remote disks I/O, and so forth. Stuff that you can likely address, since for example, there are instance types without the fork latency issue.

For half the Twitter IT community, my statement was to promote the average latency as the right metric over 99th percentiles:

"averages are the worst possible metric for latency. No latency I've ever seen falls on a bell curve. Averages give nonsense."

"You have clearly not understood how the math works or why tail latencies matter in dist sys. I think we're done here."

“indeed; the problem is that averages are not robust in the presence of outliers”

Ehm, who said that average is a good metric? I proposed it to *detect* if there are or not big outliers. So during what was supposed to be a normal exchange, I find after 10 minutes my Twitter completely full of people that tell me that I’m an idiot to endorse averages as The New Metric For Latency in the world. Once you get the first retweets, you got more and more. Even a notable builder of other NoSQL database finds the time to lecture me a few things via Twitter: I reply saying that clearly what I wrote was that if you have 99th + avg you have a better picture of the curve and can understand if the problem is the Redis spikes on EC2, but magically the original tweet gets removed, so my tweets are now more out of context. My three tweets:

1. “may point was, even if in the internet noise I'm not sure if it is still useful, that avg helps to understand why (…)”
2. “the 99% percentile is bad. If avg is very good but 99% percentile is bad, you can suspect a few very bad samples”
3. “this is useful with Redis, since with proper config sometimes you can improve the bad latency samples a lot.”

Guess what? There is even somebody that isolated tweet #2 that was the continuation of “to understand why the 99% percentile is bad” (bad as in, is not providing good figures), and just read it out of context: “the 99% percentile is bad”.

Once upon a time, people used to argue for days on usenet, but at least there was, most of the times, an argument against a new argument and so forth, with enough text and context to have a normal condition. This instead is just amplification of hate and engineering rules 101 together. 99th latency is the right metric and average is a poor one? Make sure to don’t talk about averages even in a context where it makes sense otherwise you get 10000 shitty replies.

What to do with that? Now a good thing about me is that I’m not much affected by all this personally, but it is also clear that because I use Twitter for a matter of work, in order to inform people of what is happening with Redis, this is not a viable working environment. For example, latency: I care a lot about latency, so many efforts were done during the years in order to improve it (including diskless replication). We have monitoring as well in order to understand if and why there are latency spikes, Redis can provide you an human readable report of what is happening inside of it by monitoring different execution paths. After all this work, what you get instead is the wrong message retweeted one million times, which does not help. Most people will not follow the tweets to make an idea themselves, the reality is, at this point, rewritten: I said that average percentile is good and I don’t realize that you should look at the long tail. Next time I’ll talk about latency, for many people, I’ll be the one that has a few non clear ideas about it, so who knows what I’m talking about or what I’m doing?

At the same time Twitter is RSS for humans, it is extremely useful to keep many people updated about what I love to do, which is, to work to my open source project that so far I tried to develop with care. So I’m trying to think about what a viable setup can be. Maybe I can just blog more, and use the Redis mailing list more, and use Twitter just to link stuff so that interested people can read, and interested people can argue and have real and useful discussions.

I’ve a lot of things to do about Redis, for the users that have a good time with it, and a lot of things to do for the users that are experiencing problems. I feel like my time is best spent hacking instead of having non-conversations on Twitter. I love to argue, but this is just a futile exercise.
blog comments powered by Disqus
: