Generally speaking, I love randomized algorithms, but there is one I love particularly since even after you understand how it works, it still remains magical from a programmer point of view. It accomplishes something that is almost illogical given how little it asks for in terms of time or space. This algorithm is called HyperLogLog, and today it is introduced as a new data structure for Redis. Counting unique things === Usually counting unique things, for example the number of unique IPs that connected today to your web site, or the number of unique searches that your users performed, requires to remember all the unique elements encountered so far, in order to match the next element with the set of already seen elements, and increment a counter only if the new element was never seen before.
News posted by antirez
Yesterday and today I managed to spend some time with linenoise (http://github.com/antirez/linenoise), a minimal line-editing library designed to be a simple and small replacement for readline. I was trying to merge a few pull requests, to fix issues, and doing some refactoring at the same time. It was some kind of nirvana I was feeling: a complete control of small, self-contained, and useful code. There is something special in simple code. Here I’m not referring to simplicity to fight complexity or over engineering, but to simplicity per se, auto referential, without goals if not beauty, understandability and elegance.
The title of this blog post is an apparently trivial to answer question, however it is worth to consider a bit better what performance really means: it is easy to get confused between scalability and performance, and to decompose performance, in the specific case of database systems, in its different main components, may not be trivial. In this short blog post I’ll try to write down my current idea of what performance is in the context of database systems. A good starting point is probably the first slide I use lately in my talks about Redis. This first slide is indeed about performance, and says that performance is mainly three different things.
Today Redis is 5 years old, at least if we count starting from the initial HN announcement , that’s actually a good starting point. After all an open source project really exists as soon as it is public. I’m a bit shocked I worked for five years straight to the same thing. The opportunities for learning new things I had because of the directions where Redis pushed me, and the opportunities to learn new things that I missed because I had almost consistently no time for random hacking, are huge.
In this blog post I’m going to describe a very simple distributed algorithm that is useful in different programming scenarios. The algorithm is useful when you need to take some kind of information synchronized among a number of processes. The information can be everything as long as it is composed of a small number of bytes, and as long as it is idempotent, that is, the current value of the information does not depend on the previous value, and we can just replace an old value, with the new one.
Redis Cluster is finally on its road to reach the first stable release in a short timeframe as already discussed in the Redis google group . However despite a design never proposed for the implementation of Redis Cluster was analyzed and discussed at long in the past weeks (unfortunately creating some confusion: many people, including notable personalities of the NoSQL movement, confused the analyzed proposal with Redis Cluster implementation), no attempt was made to analyze or categorize Redis Cluster itself.
One of the steps to reach the goal of providing a "testable" Redis Cluster experience to users within a few weeks, is some serious testing that goes over the usual "I'm running 3 nodes in my macbook, it works". Finally this is possible, since Redis Cluster entered into the "refinements" stage, and most of the system design and implementation is in its final form already. In order to perform some testing I assembled an environment like that: * Hardware: 6 real computers: 2 macbook pro, 2 macbook air, 1 Linux desktop, 1 Linux tiny laptop called EEEpc running with a single core at 800Mhz.
So finally something really good happened from the Redis criticism thread. At the end of the work day I was reading about Redis as AP and merge operations on Twitter. At the same time I was having a private email exchange with Alexis Richardson (from RabbitMQ, and, my boss). Alexis at some point proposed that perhaps a way to improve safety was to asynchronously ACK the client about what commands actually were not received so that the client could retry. This seemed a lot of efforts in the client side, but somewhat totally opened my view on the matter.
A few days ago I tried to do an experiment by running some kind of “call for critiques” in the Redis mailing list: https://groups.google.com/forum/#!topic/redis-db/Oazt2k7Lzz4 The thread has reached 89 posts so far, probably one of the biggest threads in the history of the Redis google group. The main idea was that critiques are a mix of pointless attacks, and truth, so to extract the truth from critiques can be a good exercise, it means to have some seed idea for future improvements from the part of the population that is not using or is not happy with your system.
Redis unstable has a new command called "WAIT". Such a simple name, is indeed the incarnation of a simple feature consisting of less than 200 lines of code, but providing an interesting way to change the default behavior of Redis replication. The feature was extremely easy to implement because of previous work made. WAIT was basically a direct consequence of the new Redis replication design (that started with Redis 2.8). The feature itself is in a form that respects the design of Redis, so it is relatively different from other implementations of synchronous replication, both at API level, and from the point of view of the degree of consistency it is able to ensure.