<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>
Antirez weblog
</title>
 <link>
http://127.0.0.1:4567
</link>
 <description>Description pending</description> <item><title>
Availability on planet Terah
</title>
 <guid>http://antirez.com/news/57</guid> <link>
http://antirez.com/news/57
</link>
 <description><![CDATA[Terah is a planet far away, where networks never split. They have a single issue with their computer networks, from time to time, single hosts break in a way or the other. Sometimes is a broken power supply, other times a crashed disk, or a software issue completely blocking the system.
<br>
<br>The inhabitants of this strange planet use two database systems. One is imported from planet Earth via the Galactic Exchange Program, and is called EDB. The other is produced by engineers from Terah, and is called TDB. The databases are functionally equivalent, but they have different semantics when a network partition happens. While the database from Earth stops accepting writes as long as it is not connected with the majority of the other database nodes, the database from Terah works as long as the majority of the clients can reach at least a database node (incidentally, the author of this story released a similar software project called Sentinel, but this is just a coincidence).
<br>
<br>Terah users have setups like the following, with three database nodes and three web servers running clients asking queries to the database nodes ("D" denotes a database node, "C" a client).
<br>
<br>              D  D  D
<br>
<br>              C  C  C
<br>
<br>EDB is designed to avoid problems on partitions like this:
<br>
<br>              D1 \ D  D
<br>                 /
<br>              C1 \ C  C
<br>
<br>C1 writing to D1 may result into lost writes if D1 happened to be the master.
<br>
<br>However in Terah net splits are not an issue, they invented a solution for all the network partitions back in Galactic Year 712! Still their technology is currently not able to avoid that single hosts fail.
<br>
<br>There is a sysop from Terah, Daniel Fucbitz, that always complains about EDB. He does not understand why on the Earth… oops on the Terah I mean, his company keeps importing EDB, that causes a lot of troubles. He reasons like this: "If a single node of my network fails, I'm safe with both EDB and TDB, but what happens if one night I'm not lucky and two hosts will fail at the same time?".
<br>
<br>Actually with EDB if two nodes out of the six nodes will fail during the same night, and these nodes happen to be two "D" nodes, the system will stop working. The probability for this to happen is (3/6)*(2/5), that is... 20%!
<br>
<br>On the other hand TDB will never stop working as long as only two nodes will fail.
<br>
<br>And what about three nodes failing at the same time? With EDB this will bring the system down with a probability of 50% (two "D" nodes down) + 5% (all clients down), for a total probability of 55%.
<br>
<br>While TDB would stop working with a probability of just 5% (all the three DB nodes down), plus 15% (master plus two clients down, no promotion possible), plus 5% (all clients down), for a total of 25%.
<br>
<br>Daniel Fucbitz sometimes watches outside his office window, waiting for the third sun to raise, thinking that, yes, on planet Earth is nice to resist to partitions, but it really is not for free at all.
<a href="http://antirez.com/news/57">Comments</a>]]></description> <comments>http://antirez.com/news/57</comments></item>
<item><title>
Reply to Aphyr, part 2
</title>
 <guid>http://antirez.com/news/56</guid> <link>
http://antirez.com/news/56
</link>
 <description><![CDATA[Thanks to Aphyr that replied to my reply here: http://aphyr.com/posts/287-asynchronous-replication-with-failover
<br>
<br>I'll try to continue the good exchange. First of all I would propose the notation Aphyr used in his blog post as the standard for Redis network representation.
<br>As you can see the rules are simple and nice, and very expressive:
<br>
<br>1) R is Redis, S is Sentinel, C is Client.
<br>2) An R inside a square is a master.
<br>3) There is a segment between every instance that can replicate with another one.
<br>4) Letters not separated by spaces, like CS, means machines are inside the same virtual or physical computer.
<br>5) An arrow shows promotion, like |R| -> R
<br>
<br>In order to specify Sentinel quorum I add Q<quorum>. Example Q5.
<br>
<br>A matter of point of view
<br>===
<br>
<br><aphyr> If you use any kind of failover, your Redis system is a distributed store. Heck, reading from secondaries makes Redis a distributed store.
<br>
<br>I think we agree about that, you can surely consider the system as a whole as a distributed system and this is why I recognized the analysis as valid. I just meant that while the system has distributed system properties Sentinel itself is focused into having a specific role, and with this limited role as Aphyr shows Sentinel can't change the fact that Redis is a master / slave system using asynchronous replication.
<br>
<br><antirez> Do you want your Redis master stopping to accept writes when it is no longer able to replicate to its slaves?
<br>
<br><aphyr> Yes. This is required for a CP system with failover.
<br>
<br>Here there is simply a different point of view: Apyhr is reasoning in terms of formal correctness, while I'm reasoning in terms of real-world production environments. Note that the two things have big margins of overlaps. For example AP systems are important in the real world, and similarly even the most theoretician can't deny that splits does not happen at random in most setups, but there is a bigger probability of a split happening along defined lines (like a switch, a wan connection, or along the connection of a single host).
<br>
<br>So basically what Aphyr suggests is that the following partition should stop the master from working (I'll change his notation into ASCII art, pardon):
<br>
<br>             R  R      /  |R|
<br>                       \
<br>                       /   CS   CS   CS
<br>                       \
<br>                       /  Q2
<br>
<br>Even if clients are perfectly able to write to the master and the partition would resolve without issues.
<br>
<br>Who is right here? It is simply a matter of assumptions. Redis Sentinel assumes that in your environment there are the following two characteristics:
<br>
<br>1) You most of the times see single hosts going down.
<br>2) When a partition happen, it has a very large probability to happen in ways you can predict.
<br>
<br>If "1" and "2" are true, you can place Sentinels in a way that will continue your operations even if theoretically you should sacrifice availability.
<br>
<br>When instead this is not true and a completely random (or attacker chosen) partition happens, you'll need to consider the system as an eventual consistent system where the merge step is just destroying the data of the minority side (!).
<br>
<br>More about this later, now following the order of Aphyr's article, I want to open a short parenthesis.
<br>
<br>Redis as an AP system
<br>===
<br>
<br><aphyr> You could do this as an application developer by setting every Redis node to be a primary, and writing a proxy layer which uses, say, consistent hashing and active anti-entropy to replicate writes between nodes. Take a look at Antirez's own experiments in this direction (http://antirez.com/news/36).
<br>
<br>About that, probably it is the good time to generalize a bit the ideas expressed in that blog post Aphyr mentioned.
<br>
<br>When Aphyr attacks Riak in his series, you see that basically with Dynamo-alike systems you either use application-assisted merges of values or you are going to run into data loss. Yet most users, also encouraged by the fact that for instance Riak uses last-write-wins as a default setting, think at AP systems as magical devices that will preserve you data.
<br>
<br>However once you want to follow the right rules, merging stuff with the help of the application, you are d00med again, as it's a PAIN IN THE ASS. It is just for that that Riak switched to last-write-win as a default: systems need to be practical and obey to the user base needs, regardless of the fact that this means data loss, just people should be aware of that (and Aphyr contributed to the good cause). What another parallel with IT Security? Database behavior must be psychologically acceptable for the programmer, like security.
<br>
<br>Now what I think is that you can have the cake and eat it if you mix Redis data types with the Dynamo model, that is, use Redis data structures and characterize every different data structure with a "type specific merge strategy".
<br>
<br>Sets: take the union of the two sets. Possible use: shopping carts.
<br>Lists: augment every element using clocks. On merge add elements not common to both the lists preserving an approximated ordering using clocks.
<br>Sorted sets: take the union of all the elements, score is last-write-wins.
<br>Hashes: take the union of all the fields of the represented object, on value conflict, last write wins.
<br>Strings: last-write-wins
<br>
<br>By moving specific semantics into specific types, the user may choose the right data type depending on the expected behavior on merge. When the need is to retain most of the information even if duplicates will appear, use Sets. When you want to save order, like in time series, use lists, and so forth.
<br>
<br>There is a lot of work to do into understanding what are the most sensible merging strategies in order to model many problems with minimal efforts, the above is just an example. But long story short, this way you have an AP store that does not present you duplicated values asking for a merge, but at the same time as long as you use the right data type, is able to guarantee out of the box some defined merge behavior.
<br>
<br>p.s. I would add "counters" as a native data type for such a store since they require special handling to obtain a good result.
<br>
<br>About topology
<br>===
<br>
<br><antirez> … place your Sentinels and set your quorum so that you are defensive enough against partitions. This way the system will activate only when the issue is really the master node down, not a network problem. Fear data loss and partitions? Have 10 Linux boxes? Put a Sentinel in every box and set quorum to 8.
<br>
<br><aphyr> I… can't parse this statement in a way that makes sense. Adding more boxes to a distributed system doesn't reduce the probability of partitions–and more to the point, trying to determine the state of a distributed system from outside the system itself is fundamentally flawed.
<br>
<br>my point is:
<br>
<br>1) If you place sentinels with clients, you can use the quorum to dictate how big a minority writing to the old master can be (the part of writes you'll lose)
<br>2) If you have ways to predict different network splits probabilities, you can use this superpower to place C, S, and R so that the probability of a "bad" partition (one leading to data loss) is small.
<br>
<br>Also the examples Apyhr is doing there is about sentinels running alone in their instances. This is fair if you want to add to your analysis the further rule that even processes running in the same physical computer have different logical connections to the network, but I think this is a very strict assumption to do. I would assume that if clients and sentinels run in the same host there is a very small probability of partitions like in the examples.
<br>
<br>I think that within the scope of real world networks there is a value in placing the "observer" in different places since you can model the way it fails. Especially when you have a few Redis nodes, and a big number of clients, to dictate the availability from the point of view of Redis nodes may more likely result into unwanted / not necessary fail overs, like in the following example:
<br>
<br>                    C C C C C C C C C C
<br>                         |RS|
<br>                    - - - - - - - - - - - -          
<br>                        RS RS  Q2
<br>
<br>Here you would see a promotion in the side without clients, and all the queries destroyed on rejoin. Or following Apyhr advice to have a pure CP system, this would result into sacrificing availability: all the clients unable to write to the master that is otherwise working.
<br>
<br>I would sleep better with:
<br>
<br>                    CS CS CS CS CS CS
<br>                      CS CS CS CS           Q8
<br>                          |R| R R
<br>
<br>In that setup I know two things:
<br>
<br>1) the system will try hard to go forward as long as there is a DB visible by the majority of clients.
<br>2) if bad partitions happen, for losing data I need to have at max 2 clients isolated with the old master, otherwise only the other side will be able to write and no data loss will result after rejoin.
<br>
<br>So my point is that if you augment CAP with probability of failure and partitions of a given type, and add to the mix a master -> slave setup with asynchronous replication, the result is that to design practical systems apparently violating good theoretical sometimes is a good idea.
<br>
<br>Not just Sentinel
<br>===
<br>
<br>Aphyr than continues saying that all the master -> slave systems with asynchronous replication are like that.
<br>
<br>I don't agree in general, what I mean is that in systems like Redis Cluster, where monitoring is performed among nodes and only the side with the majority of master nodes continues to operate, only provide a *window* of data loss. After a few seconds (depending on the configuration) the minority side stops accepting writes: It is a master -> slave system in its essence but with time-bound data loss on splits.
<br>
<br>The net result is that, yes, still you may lose data, but the fact the data loss is limited in time completely changes the kind of applications you may use a given system for.
<br>
<br>Thanks again to Aphyr for the good exchange.
<a href="http://antirez.com/news/56">Comments</a>]]></description> <comments>http://antirez.com/news/56</comments></item>
<item><title>
Reply to Aphyr attack to Sentinel
</title>
 <guid>http://antirez.com/news/55</guid> <link>
http://antirez.com/news/55
</link>
 <description><![CDATA[In a great series of articles Kyle Kingsbury, aka @aphyr on Twitter, attacked a number of data stores:
<br>
<br>[1] http://aphyr.com/tags/jepsen
<br>
<br>Postgress, Redis Sentinel, MongoDB, and Riak are audited to find what happens during network partitions and how these systems can provide the claimed guarantees.
<br>
<br>Redis is attacked here: http://aphyr.com/posts/283-call-me-maybe-redis
<br>
<br>I said that Kyle "attacked" the systems on purpose, as I see a parallel with the world of computer security here, it is really a good idea to move this paradigm to the database world, to show failure modes of systems against the claims of vendors. Similarly to what happens in the security world the vendor may take the right steps to fix the system when possible, or simply the user base will be able to recognize that under certain circumstances something bad is going to happen with your data.
<br>
<br>Another awesome thing in the Kyle's series is the educational tone, almost nothing is given for granted and the articles can be read by people that never cared about databases to distributed systems experts. Well done!
<br>
<br>In this blog post I'll try to address the points Kyle made about Redis Sentinel, that's the system he tested.
<br>
<br>Sentinel goals
<br>===
<br>
<br>In the post Kyle writes "What are the consistency and availability properties of Sentinel?".
<br>Probably this is the only flaw I saw in this article.
<br>
<br>Redis Sentinel is a distributed *monitoring* system, with support for automatic failover.
<br>It is in no way a shell that wraps N Redis instances into a distributed data store. So if you consider the properties of the "Redis data store + Sentinel", what you'll find is the properties of any Redis master-slave system where there is an external component that can promote a slave into a master under certain circumstances, but has limited abilities to change the fundamental fact that Redis, without Redis Cluster, is not a distributed data store.
<br>
<br>However it is also true that Redis Sentinel also acts as a configuration device, and even with the help of clients, so as a whole it is a complex system with given behavior that's worth analyzing.
<br>
<br>What I'm saying here is that just the goal of the system is:
<br>
<br>1) To promote a slave into a master if the master fails.
<br>2) To do so in a reliable way.
<br>
<br>All the stress here is in the point "2", that is, the fact that sentinels can be placed outside the master-slaves system makes the user able to decide a more objective point of view to declare the master as failing.
<br>
<br>And another property is that Sentinel is distributed enough so that single sentinels can fail at any time, including during the failover process, and the process will still continue unaffected as long as it is still possible to reach the majority.
<br>
<br>I think that the goal of Redis Sentinel is pretty clear so I'm surprised (not in a negative way) that it was tested creating a partition where the old master is in the minority together with a client, and then show that the client was still able to write to the old master. I honestly don't think any user expects something different from Redis Sentinel. That said, I'll ignore this fact from now on and reply to the different parts of the article as there is important information anyway IMHO, especially since, after all, Redis Sentinel + N Redis instances + M Clients is "A System", so Kyle analysis makes sense even under my above assumptions.
<br>
<br>Partitioning the cluster
<br>===
<br>
<br>Ok I just made clear enough that there is no such goal in Sentinel to turn N Redis instances into a distributed store, so basically what happens is that:
<br>
<br>1) Clients in the majority side will be able to continue to write once the failover is complete.
<br>2) Clients in the minority side may possibly write to the old master, and when the network is ok again, the master will be turned into a slave of the new master, so all the writes in the minority side are lost forever.
<br>
<br>So you can say, ok, Sentinel has a limited scope, but could you add a feature so that when the master feels in the minority it no longer accept writes? I don't think it's a good idea. What it means to be in the minority for a Redis master monitored by Sentinels (especially given that Redis and Sentinel are completely separated systems)?
<br>
<br>Do you want your Redis master stopping to accept writes when it is no longer able to replicate to its slaves? Or do you want it when enough Sentinels are down? My guess is that given the goals of the system, instead of going down the road of stopping the master for possibly harmless conditions (or not as bad as a stopped master), just use the fact that Sentinel is very configurable: place your Sentinels and set your quorum so that you are defensive enough against partitions. This way the system will activate only when the issue is really the master node down, not a network problem. Fear data loss and partitions? Have 10 Linux boxes? Put a Sentinel in every box and set quorum to 8.
<br>
<br>Just to be clear, the criticism is a good one, and it shows how Sentinel is not good to handle complex net splits with minimal data loss. Just this was never the goal, and what users were doing with their home-made scripts to handle failover was in the 99% of cases much worse than what Sentinel achieve as failure detection and handling of the failover process.
<br>
<br>Redis consensus protocol
<br>===
<br>
<br>Another criticism is that the Sentinel protocol is complex to analyze, and even requires some help from the client.
<br>
<br>It is true that is a complex protocol because while the agreement is vaguely byzantine looking, actually is a dynamic process without an ordered number of steps to reach an agreement. Simply the state about different things like if a node is failing or not, and who should perform the promotion, is broadcasted continuously among sentinels.
<br>
<br>A majority is basically reached when the state of N nodes (with N >= quorum) that is no older than a given number of seconds, agrees about something.
<br>
<br>Both failure detection and the election of the sentinel doing the failover are reasonable candidates for this informal protocol since the information every sentinel has about the availability of a given instance or sentinel itself is a moving target itself. Also the rest of the system is designed to be resistant against errors in the agreement protocol (the first sentinel recognizing a failure will force all the others to recognized it, and the failover process is auto-detected by the other instances that can monitor the elected slave. Also care is taken to avoid a protocol that is fragile against multiple sentinels doing the failover at the same time if this may ever happen).
<br>
<br>Kyle notes that there is the concept of TILT so that Sentinel is sensible to clock skew and desynchronization. Actually there is no explicit use of absolute time in the protocol nor Sentinels are required to have a synchronized clock at all.
<br>
<br>Just to clarify TILT is a special mode that is used when Sentinel detects its internal state is corrupted in two ways: either the system clock jumped in the past, so a Sentinel can no longer trust its *internal* state, or the clock appears to have jumped in the future, that means, the sentinel process for some reason was blocked for a long time. In both cases such a sentinel will enter TILT mode so it will stop acting for some time, until the state is believed to be already reliable. TILT is basically not part of the Sentinel protocol, but just a programming trick to make a system more reliable in presence of strange behaviors from the operating system. 
<br>
<br>Involvement of the clients
<br>===
<br>
<br>In Sentinel clients involvement is not mandatory since you may want to run a script during a failover so that configuration will change in some permanent way.
<br>
<br>However the suggested mode of operation is to use clients that refresh the information when a reconnection is needed (actually we are going into the direction of forcing a periodic state refresh, and when Sentinel demotes a reappearing old master we'll send a command to the old master that forces all the connections to be dropped, this improves the reliability of the system in a considerable way).
<br>
<br>So in the article I can read:
<br>
<br>* Sentinels could promote a node no clients can see
<br>* Sentinels could demote the only node clients can actually reach
<br>* …
<br>
<br>And so forth. Again here the point is, Sentinel is designed exactly to let you pick your tradeoffs from that point of view, and the documentation suggests that your Sentinels stay in the same machines where you run your clients, web servers, and so forth, not into the Redis server nodes.
<br>
<br>Because indeed almost always the point of view you want to say something is "down" is the point of view of the client.
<br>
<br>Broken things Kyle did not investigated
<br>===
<br>
<br>Kyle did a great work to show you want you should *not* expect from Sentinel.
<br>
<br>There is much more we are to fix, because HA is a complex problem in master -> slave systems. For instance the current version of Sentinel does not handle well enough reconfigured instances that reboot with an old config: sometimes you may just lost a slave that is ignored by Sentinels.
<br>
<br>This and other problems are still a work in progress, and what I'm trying to provide with Redis Sentinel is a monitoring and failover solution that does not suck so much, as in, you can select the point of view of what "down" means, both topologically and as a quorum, and you can stay sure that a few sentinels going away will not break your failover process.
<br>
<br>Redis Cluster
<br>===
<br>
<br>Redis Cluster is a system much more similar to what Kyle had in mind when testing Sentinel. For instance after a split the side with the minority of slaves will stop accepting writes so while there is always a window for data loss, there is in the big picture of things always only a single part of the network that accepts writes.
<br>
<br>I invite you to read the other articles in the Kyle's series, they are very informative.
<br>
<br>Thank you Kyle, please keep attacking databases.
<a href="http://antirez.com/news/55">Comments</a>]]></description> <comments>http://antirez.com/news/55</comments></item>
<item><title>
Redis configuration rewriting
</title>
 <guid>http://antirez.com/news/54</guid> <link>
http://antirez.com/news/54
</link>
 <description><![CDATA[Lately I'm trying to push forward Redis 2.8 enough to reach the feature freeze and release it as a stable release as soon as possible.
<br>Redis 2.8 will not contain Redis Cluster, and its implementation of Redis Sentinel is the same as 2.6 and unstable branches, (Sentinel is taken mostly in sync in all the branches being fundamentally a different project using Redis just as framework).
<br>
<br>However there are many new interesting features in Redis 2.8 that are back ported from the unstable branch. Basically 2.8 it's our usual "in the middle" release, like 2.4 was: waiting for Redis 3.0 that will feature Redis Cluster (we have great progresses about it! See https://vimeo.com/63672368), we'll have a 2.8 release with everything that is ready to be released into the unstable branch. The goal is of course to put more things in the hands of users ASAP.
<br>
<br>The big three new entries into Redis 2.8 are replication partial resynchronizations (already covered in this blog), keyspace events notifications via Pub/Sub, and finally CONFIG REWRITE, a feature I just finished to implement (you can find it in the config-rewrite branch at Github). The post explains what CONFIG REWRITE is.
<br>
<br>An inverted workflow
<br>===
<br>
<br>Almost every unix daemon works like that:
<br>
<br>1) You have a configuration file.
<br>2) When you need to hack the configuration, you modify it and either restart the daemon, or send it a signal to reload the config.
<br>
<br>It's been this way forever, but with Redis I took a different path since the start: as long as I understood people created "farms" of Redis servers either to provision them on-demand, or for internal usage where a big number of Redis servers are used, I really wanted to provide a different paradigm that was more "network oriented".
<br>
<br>This is why I introduced the CONFIG command, with its sub commands GET and SET. At the start the ability of CONFIG was pretty basic, but now you can reconfigure almost every aspect of Redis on the fly, just sending commands to the server. This is extreme enough you can change persistence type when an instance is running. For example just typing:
<br>
<br>    CONFIG SET appendonly yes
<br>
<br>Will switch on the Append Only File, will start a rewrite process to create the first AOF, and so forth. Similarly it is possible to alter from replication to memory limits and policy while the server is running, just interacting with the server with normal commands without any "hook" inside the operating system running Redis.
<br>
<br>The symmetrical command CONFIG GET is used to query the configuration. Some users are more conservative in how they handle their servers and may want to always touch the configurations manually, but my idea was that the two commands provided quite a powerful system to handle a large number of instances in a scriptable way without the use of additional software layers, and avoiding restarts of the server that are costly, especially in the case of an in memory database.
<br>
<br>However there was a major issue, after you modified an important configuration parameter with CONFIG SET, later you had to report the change into the redis.conf file manually, so that after the restart Redis would use the new config. As you can guess this was a huge limitation, basically the CONFIG API was only useful to hack the config live and avoid a reboot, but manual intervention or some other software layer to handle the configuration of your servers was needed anyway.
<br>
<br>So the idea to solve this issue was to add as soon as possible a new command, CONFIG REWRITE, that would rewrite the Redis configuration to report the changes in memory.
<br>So the new work flow would be like that:
<br>
<br>    CONFIG SET appendonly yes
<br>    CONFIG REWRITE
<br>
<br>However I was trying to do a complete refactoring of the config.c file in order to implement this feature easily, but this was the best recipe to delay the feature forever… Finally I decided to implement it before the 2.8 release, without waiting for a refactoring, but implementing the new feature in a way that is refactor-friendly. So basically, we finally have it!
<br>
<br>I believe that now that CONFIG REWRITE somewhat completes the triad of the configuration API, users will greatly benefit from that, both in the case of small users that will do configuration changes from redis-cli in a very reliable way, without a restart, without the possibility of configuration errors in redis.conf, and for big users of course where scripting a large farm of Redis instances can be very useful.
<br>
<br>Before to continue: If you want to play with config rewrite, clone the config-rewrite branch at Github (but the feature will be merged into 2.8 and unstable soon), and play with it.
<br>
<br>A gentle rewrite
<br>===
<br>
<br>Rewriting a configuration file is harder than it seems at first. Actually to do a brutal rewrite is trivial, you just write every configuration parameter with the current value in the new file, and you are done, but this has a number of side effects:
<br>
<br>1) User comments and overall redis.conf structure go away, lost forever.
<br>2) You get a number of things set explicitly to its default value.
<br>3) After a server upgrade, because of "2", maybe you'll run an old default value that now changed.
<br>
<br>So CONFIG REWRITE tries to follow a set of rules to make the rewriting more gentle, touching only the minimum possible, and preserving everything else.
<br>
<br>This is how it works:
<br>
<br>1) Comments are always preserved.
<br>2) If an option was already present in the old redis.conf, the same line is used for the same option in the new file.
<br>3) If an option was not present and is set at its default value, it is not added.
<br>4) If an option was not present, but the new value is no longer its default, the option is appended at the end of the file.
<br>5) All the no longer useful lines in the old configuration file are blanked (for example if there were three "save" options, but only two are used in the new config).
<br>
<br>However if the configuration file for some reason no longer exists, CONFIG REWRITE will create it from scratch. The rules followed are the above anyway, just assuming an empty old configuration file, so the effect is to just produce a configuration file with every option not set to the default value.
<br>
<br>An example
<br>===
<br>
<br>Just to make everything a bit more real, that's an example.
<br>
<br>I start Redis with the following configuration file:
<br>
<br>---
<br># This is a comment
<br>save 3600 10
<br>save 60 10000
<br>
<br># Hello world
<br>dir .
<br>---
<br>
<br>After a CONFIG REWRITE without changing any parameter what I see is:
<br>
<br>---
<br># This is a comment
<br>save 3600 10
<br>save 60 10000
<br>
<br># Hello world
<br>dir "/Users/antirez/hack/redis/src"
<br>---
<br>
<br>As you can see the only difference is that "dir" was turned into an absolute path in that case, only because it was not already. The path is also quoted inside "" as certain options are rewritten in order to support special characters.
<br>
<br>At this point I use the following commands:
<br>
<br>redis 127.0.0.1:6379> config set appendonly yes
<br>OK
<br>redis 127.0.0.1:6379> config set maxmemory 10000000
<br>OK
<br>redis 127.0.0.1:6379> config rewrite
<br>OK
<br>
<br>Now the configuration file looks like that:
<br>
<br>---
<br># This is a comment
<br>save 3600 10
<br>save 60 10000
<br>
<br># Hello world
<br>dir "/Users/antirez/hack/redis/src"
<br>
<br># Generated by CONFIG REWRITE
<br>maxmemory 10000000
<br>appendonly yes
<br>---
<br>
<br>As you can see new configurations are appended at the end.
<br>
<br>Finally I make a change that requires deleting some previous line:
<br>
<br>redis 127.0.0.1:6379> config set save ""
<br>OK
<br>redis 127.0.0.1:6379> config rewrite
<br>OK
<br>
<br>The new config file is the following:
<br>
<br>---
<br># This is a comment
<br>
<br># Hello world
<br>dir "/Users/antirez/hack/redis/src"
<br>
<br># Generated by CONFIG REWRITE
<br>maxmemory 10000000
<br>appendonly yes
<br>---
<br>
<br>Comments are preserved but multiple blank lines are squeezed to a single one.
<br>
<br>Thanks for reading!
<a href="http://antirez.com/news/54">Comments</a>]]></description> <comments>http://antirez.com/news/54</comments></item>
<item><title>
Hacking Italia
</title>
 <guid>http://antirez.com/news/53</guid> <link>
http://antirez.com/news/53
</link>
 <description><![CDATA[Questo post ha lo scopo di presentare alla comunita' italiana interessata ai temi della programmazione e delle startup un progetto nato attorno ad un paio di birre: "Hacking Italia", che trovate all'indirizzo http://hackingitalia.com
<br>
<br>Hacking Italia e' un sito di "social news", molto simile ad Hacker News, il celebre collettore di news per hacker di YCombinator. A che serve un sito italiano, e in italiano se c'e' gia' molto di piu' e di meglio nel panorama internazionale? A mettere assieme una massa critica di persone "giuste" in Italia.
<br>
<br>Mettere assieme le persone significa molto, specialmente in un paese stretto e lungo 1500 chilometri, dove le occasioni di incontri tra programmatori e startupper sono ridotte, i finanziatori nascosti in chissa' quali palazzi, inaccessibili ai piu'. Sono 15 anni che faccio questo mestiere e conosco pochissime persone in Italia, e una quantita' in tutto il resto del mondo... e dire che non e' certo un paese dove manca la passione per il codice e per l'innovazione, come la storia ci ricorda.
<br>
<br>E allora mettersi assieme significa, tanto per iniziare, avere gia' una piccola vetrina di persone a cui presentare la tua idea. Significa anche discutere assieme dei temi che non sono di nessun interesse per chi non opera nel nostro territorio, come le forme societarie e i mille problemi burocratici a cui ci tocca far fronte. Inoltre mentre probabilmente creare un clone dei servizi affermati globalmente, come Youtube o Gmail, per il mercato italiano, e' una operazione senza alcun merito, questo sono significa che non esistono delle startup che potrebbero essere di grande successo e che abbiano come target il territorio italiano: news, ristorazione, business to business, medicina... ci sono infiniti temi che si possono trattare facendo leva sul fatto che le economie di scala consentono, a chi opera in Italia, di fare meglio per gli italiani.
<br>
<br>Per cui se questi temi sono importanti anche per voi, spargete la voce, registratevi, e date il vostro contributo.
<br>
<br>Un po' di background per finire. Il progetto e' nato grazie al fatto che da qualche settimana, qui a Catania, abbiamo iniziato ad incontrarci tra programmatori. Prendiamo una birra, e parliamo di hacking, e non solo. Non avrei mai pensato che questo potesse accadere a dire il vero, parlare di cose davvero tecniche (e interessanti) a pochi chilometri da casa mia. E parlando un po' e' nata questa idea... dunque grazie ad Angelo, Fabio, Geert, Giuseppe, Marcello, e arrivederci sia al pub che sul nuovo sito!
<a href="http://antirez.com/news/53">Comments</a>]]></description> <comments>http://antirez.com/news/53</comments></item>
<item><title>
Redis with an SSD swap, not what you want
</title>
 <guid>http://antirez.com/news/52</guid> <link>
http://antirez.com/news/52
</link>
 <description><![CDATA[Hello! As promised today I did some SSD testing.
<br>
<br>The setup: a Linux box with 24 GB of RAM, with two disks.
<br>
<br>A) A spinning disk.
<br>b) An SSD (Intel 320 series).
<br>
<br>The idea is, what happens if I set the SSD disk partition as a swap partition and fill Redis with a dataset larger than RAM?
<br>It is a lot of time I want to do this test, especially now that Redis focus is only on RAM and I abandoned the idea of targeting disk for a number of reasons.
<br>
<br>I already guessed that the SSD swap setup would perform in a bad way, but I was not expecting it was *so bad*.
<br>
<br>Before testing this setup, let's start testing Redis in memory with in the same box with a 10 GB data set.
<br>
<br>IN MEMORY TEST
<br>===
<br>
<br>To start I filled the instance with:
<br>
<br>./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
<br>
<br>Write load in this way is very high, more than half million SET commands processed per second using a single core:
<br>
<br>instantaneous_ops_per_sec:629782
<br>
<br>This is possible because we using a pipeline of 32 commands per time (see -P 32), so it is possible to limit the number of sys calls involved in the processing of commands, and the network latency component as well.
<br>
<br>After a few minutes I reached 10 GB of memory used by Redis, so I tried to save the DB while still sending the same write load to the server to see what the additional memory usage due to copy on write would be in such a stress conditions:
<br>
<br>[31930] 07 Mar 12:06:48.682 * RDB: 6991 MB of memory used by copy-on-write
<br>
<br>almost 7GB of additional memory used, that is 70% more memory.
<br>Note that this is an interesting value since it is exactly the worst case scenario you can get with Redis:
<br>
<br>1) Peak load of more than 0.6 million writes per second.
<br>2) Writes are completely distributed across the data set, there is no working set in this case, all the DB is the working set.
<br>
<br>But given the enormous pressure on copy on write exercised by this workload, what is the write performance in this case while the system is saving? To find the value I started a BGSAVE and at the same time started the benchmark again:
<br>
<br>$ redis-cli bgsave; ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
<br>Background saving started
<br>^Ct key:rand:000000000000 foo: 251470.34
<br>
<br>250k ops/sec was the lower number I was able to get, as once copy on write starts to happen, there is less and less copy on write happening every second, and the benchmark soon returns to 0.6 million ops per second.
<br>The number of keys was in the order of 100 million here.
<br>
<br>Basically the result of this test is, with real hardware and persisting to a normal spinning disk, Redis performs very well as long as you have enough RAM for your data, and for the additional memory used while saving. No big news so far.
<br>
<br>SSD SWAP TEST
<br>===
<br>
<br>For the SSD test we still use the spinning disk attached to the system in order to persist, so that the SSD is just working as a swap partition.
<br>
<br>To fill the instance even more I just started again redis-benchmark with the same command line, since with the specific parameters, if running forever, it would set 1 billion keys, that's enough :-)
<br>
<br>Since the instance has 24 GB of physical RAM, for the test to be meaningful I wanted to add enough data to reach 50 GB of used memory. In order to speedup the process of filling the instance I disabled persistence for some time using:
<br>
<br>CONFIG SET SAVE ""
<br>
<br>While filling the instance, at some point I started a BGSAVE to force some more swapping.
<br>Then when the BGSAVE finished, I started the benchmark again:
<br>
<br>$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
<br>^Ct key:rand:000000000000 foo: 1034.16
<br>
<br>As you can see the results were very bad initially, probably the main hash table ended swapped. After some time it started to perform in a decent way again:
<br>
<br>$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
<br>^Ct key:rand:000000000000 foo: 116057.11
<br>
<br>I was able to stop and restart the benchmark multiple times and still get decent performances on restarts, as long I was not saving at the same time. However performances continued to be very erratic, jumping from 200k to 50k sets per second.
<br>
<br>…. and after 10 minutes …
<br>
<br>It only went from 23 GB of memory used to 24 GB, with 2 GB of data set swapped on disk.
<br>
<br>As soon as it started to have a few GB swapped performances started to be simply too poor to be acceptable.
<br>
<br>I then tried with reads:
<br>
<br>$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 get key:rand:000000000000
<br>^Ct key:rand:000000000000 foo: 28934.12
<br>
<br>Same issue, 30k ops per second both for GET and SET, and *a lot* of swap activity at the same time.
<br>What's worse is that the system was pretty unresponsive as a whole at this point.
<br>
<br>At this point I stopped the test, the system was slow enough that filling it even more would require a lot of time, and as more data was swapped performances started to get worse.
<br>
<br>WHAT HAPPENS?
<br>===
<br>
<br>What happens is simple, Redis is designed to work in an environment where random access of memory is very fast.
<br>Hash tables, and the way Redis objects are allocated is all based on this concept.
<br>
<br>Now let's give a look at the SSD 320 disk specifications:
<br>
<br>Random write (100% Span) -> 400 IOPS
<br>Random write (8GB Span) -> 23000 IOPS
<br>
<br>Basically what happens is that at some point Redis starts to force the OS to move memory pages between RAM and swap at *every* operation performed, since we are accessed keys at random, and there are no more spare pages.
<br>
<br>CONCLUSION
<br>===
<br>
<br>Redis is completely useless in this way. Systems designed to work in this kind of setups like Twitter fatcache or the recently announced Facebook McDipper need to be SSD-aware, and can probably work reasonably only when a simple GET/SET/DEL model is used.
<br>
<br>I also expect that the pathological case for this systems, that is evenly distributed writes with big span, is not going to be excellent because of current SSD disk limits, but that's exactly the case Redis is trying to solve for most users.
<br>
<br>The freedom Redis gets from the use of memory allows us to serve much more complex tasks at very good peak performance and with minimal system complexity and underlying assumptions.
<br>
<br>TL;DR: the outcome of this test was expected and Redis is an in-memory system :-)
<a href="http://antirez.com/news/52">Comments</a>]]></description> <comments>http://antirez.com/news/52</comments></item>
<item><title>
Log driven programming is a real productivity booster.
</title>
 <guid>http://antirez.com/news/51</guid> <link>
http://antirez.com/news/51
</link>
 <description><![CDATA[One thing, more than everything else, keeps me focused while programming: never interrupt the flow.
<br>
<br>If you ever wrote some complex piece of code you know what happens after some time: your mental model of the software starts to be very complex with different ideas nested inside other ideas, like the structure of your program is, after all.
<br>
<br>So while you are writing this piece of code, you realize that because of it you need to do that other change. Something like "I'm freeing this object here, but it's connected to this two other objects and I need to do this and that in order to ensure consistent state".
<br>
<br>The worst thing you can do is to interrupt what you are currently doing in order to fix the new problem. Instead just write freeMyObject() and don't care, but at the same time, open a different editor, and write:
<br>
<br>* freeMyObject() should make sure to remove references from XYZ bla bla bla.
<br>
<br>When you finished with the current function / task / whatever, re-read your notes and implement what is possible to implement. You'll get new ideas or new things to fix, use the same trick again, and log your ideas without interrupting the flow.
<br>
<br>In this way parts of the program make sense, individually. You can address the other parts later. This is 100 times better than nested-thinking, where you need to stop, do another task, and return back. Humans don't have stack frames.
<br>
<br>For my log I use Evernote because the log needs to have one characteristic: No save, No filenames, Nothing more than typing something. Evernote will save it for you, and is a different physical interface compared to your code editor. This in my experience improves the 2 seconds switch you need to log.
<br>
<br>After some time your log starts to be long. When you realize most of it feels old as your code base and your idea of the system evolved, trace a like like this:
<br>
<br>-------------------- OLD STUFF ---------------------
<br>
<br>And continue logging again. From time to time however, re-read your old logs. You may find some gems.
<a href="http://antirez.com/news/51">Comments</a>]]></description> <comments>http://antirez.com/news/51</comments></item>
<item><title>
An idea for Twitter
</title>
 <guid>http://antirez.com/news/50</guid> <link>
http://antirez.com/news/50
</link>
 <description><![CDATA[After the "sexism gate" I started to use my Twitter account only for private stuff in order to protect the image of Redis and/from my freedom to say whatever I want. It did not worked actually since the reality is that people continue to address you with at-messages about Redis stuff.
<br>
<br>But the good outcome is that now I created a @redisfeed account that I use in order to provide a stream of information to Redis users that are not interested in my personal tweets  not related to Redis. Anyway when I say some important thing regarding Redis with my personal account, I just retweet in the other side, so this is a good setup.
<br>
<br>However... I wonder if Twitter is missing an opportunity for providing a better service here, that is, the concept of "channels".
<br>
<br>Basically I'm a single person, but I've multiple logical streams of informations:
<br>
<br>1) I tweet about Redis.
<br>2) I tweet about other technological stuff.
<br>3) I say things related to my personal life.
<br>4) Sometimes I tweet things in Italian language.
<br>
<br>Maybe there are followers interested in just one or a few of these logical channels, so it would be cool for Twitter users to be able to follow only a subset of the channels of another twitter user.
<br>
<br>Probably this breaks the idea of simplicity of Twitter, but I'm pretty sure there are ways to present such a feature in an interesting way: by default all users have a single channel and following them in general means to follow all the channels, it is only as a refinement and only if the user created multiple channels that you can fine-tune what you follow and what not, so basically the added complexity would be minimal.
<br>
<br>I'm pretty sure that now that Twitter is designed with the average user in mind such a feature will never be implemented actually, without to mention that this may add some serious technological complexity to their infrastructure, but maybe in the long run such a feature may be more vital than we believe now because it is pretty related to the "information diet" concept.
<a href="http://antirez.com/news/50">Comments</a>]]></description> <comments>http://antirez.com/news/50</comments></item>
<item><title>
News about Redis: 2.8 is shaping, I'm back on Cluster.
</title>
 <guid>http://antirez.com/news/49</guid> <link>
http://antirez.com/news/49
</link>
 <description><![CDATA[This is a very busy moment for Redis because the new year started in a very interesting way:
<br>
<br>1) I finished the Partial Resynchronization patch (aka PSYNC) and merged it into the unstable and 2.8 branch. You can read more about it here: http://antirez.com/news/47
<br>2) We finally have keyspace changes notifications: http://redis.io/topics/notifications
<br>
<br>Everything is already merged into our development branches, so the deal is closed, and Redis 2.8 will include both the features.
<br>
<br>I'm especially super excited about PSYNC, as this is a not-feature, simply you don't have to deal with it, the only change is that slaves work A LOT better. I love adding stuff that is transparent for users, just making the system better and more robust.
<br>
<br>What I'm even more excited about is the fact that now that PSYNC and notifications are into 2.8, I'll mostly freeze it and can finally focus on Redis Cluster.
<br>
<br>It's  a lot of time that I wait to finish Redis Cluster, now it's the right time because Redis 2.6 is out and seems very stable, people are just starting to really discovering it and the ways it is possible to use Lua scripting and the advanced bit operations to do more. Redis 2.8 is already consolidated as release, but requires a long beta stage because we touched the inner working of replication. So I can pause other incremental improvements for a bit to focus on Redis Cluster. Basically my plan is to work mostly to cluster as long as it does not reach beta quality, and for beta quality I mean, something that brave users may put into production.
<br>
<br>Today I already started to commit new stuff to Cluster code. Hash slots are now 16384 instead of 4096, this means that we are now targeting clusters of ~1000 nodes. This decision was taken because there are big Redis users with already a few hundred of nodes running.
<br>
<br>Another change is that probably, in order to ship Cluster ASAP, in the first stage I plan to use Redis Sentinel in order to failover master nodes (but Sentinel will be able to accept as configuration a list of addresses of cluster nodes and will fetch all the other nodes using CLUSTER NODES).
<br>
<br>So basically the first version of Redis Cluster to hit a stable release will have the following features:
<br>
<br>1) Automatic partition of key space.
<br>2) Hot resharding.
<br>3) Only single key operations supported.
<br>
<br>The above is already implemented but there is more work to do in order to move all this from alpha to beta quality. There is also a significant amount of work to do in library clients, and I'll try to provide an initial reference implementation based on redis-rb (actually I hope to just write a wrapper library).
<br>
<br>Note that "3" is here to stay, there are currently no plans to extend Cluster to anything requiring keys to be moved back on forth magically. But MIGRATE COPY will provide a way for users to move keys into spare instances to perform computations with multiple keys.
<br>
<br>Of course all this modulo critical bugs. If there is something odd with stable releases I'll stop everything and fix it as usually :-)
<a href="http://antirez.com/news/49">Comments</a>]]></description> <comments>http://antirez.com/news/49</comments></item>
<item><title>
A few thoughts about Open Source Software
</title>
 <guid>http://antirez.com/news/48</guid> <link>
http://antirez.com/news/48
</link>
 <description><![CDATA[For a decade and half I contributed to open source regularly, and still it is relatively rare that I stop to think a bit more about what this means for me. Probably it is just because I like to write code, so this is how I use my time: writing code instead of thinking about what this means… however lately I'm starting to have a few recurring ideas about open source, its relationship with the IT industry, and my interpretation of what OSS is, for me, as a developer.
<br>
<br>First of all, open source for me is not a way to contribute to the free software movement, but to contribute to humanity. This means a lot of things, for instance I don't care about what people do with my code, nor if they'll release back their modifications. I simply want people to use my code in one way or the other.
<br>
<br>Especially I want people to have fun, learn new stuff, and *make money* with my code. For me other people making money out of something I wrote is not something that I lost, it is something that I gained.
<br>
<br>1) I'm having a bigger effect in the world if somebody can pay the bills using my code.
<br>2) If there are N subjects making money with my code, maybe they will be happy to share some of this money with me, or will be more willing to hire me.
<br>3) I can be myself one of the subjects making money with my code, and with other open source software code.
<br>
<br>For all this reasons my license of choice is the BSD licensed, that is the perfect incarnation of "do whatever you want" as a license.
<br>
<br>However clearly not everybody thinks alike, and many programmers contributing to open source don't like the idea that other people can take the source code and create business out of it as a commercial product that is not released under the same license.
<br>To me instead many of the rules that you need to follow to use the GPL license are a practical barrier reducing the actual freedom of what people can do with the source code. Also I've the feeling that receiving back contributions it is not too much related to the license: if something is useful people will contribute back in some way, because maintaining forks is not great. The real gold is where development happens. Unfixed, not evolved code bases are worth zero. If you as an open source developer can provide value, other parties will be more stimulated to get their changes incorporated.
<br>
<br>Anyway, I'm much more happy with less patches merged and more freedom from the point of view of the user, than the reverse, so there is not much to argue for me.
<br>
<br>In my opinion instead what the open source does not get back in a fair amount is money, not patches. The new startups movement, and the low costs of operations of many large IT companies, are based on the existence of so much open source code working well. Businesses should try to share a small fraction of the money they earn with the people that wrote the open source software that is a key factor for their success, and I think that a sane way to redistribute part of the money is by hiring those people to just write open source software (like VMware did with me), or to provide donations.
<br>
<br>Many developers do a lot of work in their free time for passion, only a small percentage happens to be payed for their contribution to open source. Some redistribution may allow more people to focus on the code they write for passion and that possibly has a much *important effect* on the economy compared to what they do at work to get the salary every month. And unfortunately it is not possible to pay bills with pull requests, so why providing help to the project with source contributions is a good and sane thing to do, it is not enough in my opinion.
<br>
<br>You can see all this from a different point of view, but what I see is that a lot of value in the current IT industry is provided by open source software, often written in the spare time, or with important efforts filling the time gaps between one thing and another thing you do in your work time, if your employer is kind enough to allow you to do so.
<br>
<br>What I think is that this is economically suboptimal, a lot of smart coders could provide an economical boost if they could be more free to write what they love and what a lot of people are probably already using to make money.
<a href="http://antirez.com/news/48">Comments</a>]]></description> <comments>http://antirez.com/news/48</comments></item>
<item><title>
PSYNC
</title>
 <guid>http://antirez.com/news/47</guid> <link>
http://antirez.com/news/47
</link>
 <description><![CDATA[Dear Redis users, in the final part of 2012 I repeated many time that the focus, for 2013, is all about Redis Cluster and Redis Sentinel.
<br>
<br>This is exactly what I'm going to do from the point of view of the big picture, however there are many smaller features that make a big difference from the point of view of the Redis user day to day operations. Such features can't be ignored as well. They are less shiny in a feature list, and they are not good to generate buzz and interest in new users, and sometimes boring to code, but they are very important from a practical point of view.
<br>
<br>So I ended the year and I'm starting the new one spending considerable time on a feature that was long awaited by many users having production instances crunching data every day, that is, the ability for a slave to partially resynchronize with the master without requiring a full resynchronization every time.
<br>
<br>The good news is that finally today I've an implementation that works well in my tests. This means that this feature will be present in Redis 2.8, so it is the right time to start making users aware of it, and to describe how it works.
<br>
<br>Some background
<br>---
<br>
<br>Redis replication is a pretty brutal piece of code in many ways:
<br>
<br>1) It works by re-playing on slaves every command that was received in the Redis master that actually produced a change in the data set.
<br>2) From the point of view of slaves, masters are just a bit special clients, but they are almost like normal clients sending commands. No special replication protocol or data format used for replication.
<br>3) It *used to force* a full resynchronization every time a slave connects to a master. This means, at every connection, the slave will receive a copy of the master data set, in form of an RDB file, and load it.
<br>
<br>Because of this characteristics Redis replication has been very reliable from the point of view of corruption. If you always full-resync, there are little chances for inconsistency. Also it was architecturally trivial, because masters are like clients, no special protocol is used and so forth.
<br>
<br>Simple and reliable, what can go wrong? Well, what goes wrong is that sometimes even when simplicity is very important, to do an O(N) work when zero work is needed is not a good idea. I'm looking at you, point three of my list.
<br>
<br>Consider the following scenario:
<br>
<br>* Slave connect to master, and full resync.
<br>* Master and slave chat for one hour.
<br>* Slave disconnects from Master because of some silly network issue for 2 seconds.
<br>
<br>A full resynchronization to reconnect is required. It was a design sacrifice because after all we are dealing with RAM-sized data sets. It can't be so hard. But actually as RAM gets cheaper, and big users more interested in Redis, we have many production instances with big data sets that need to full resync at every network issue.
<br>
<br>Also resynchronization involves unpleasant things:
<br>
<br>1) The disk is involved, since the slave saving the RDB file needs to write that file somewhere.
<br>2) The master is forced to create an RDB file. Not a big deal as this master is supposed to save or write the AOF anyway, but still, more I/O without a good reason.
<br>3) The slave needs to block at some point after reconnection in order to load the RDB file into memory.
<br>
<br>This time it was the case to introduce complexity in order to make things better.
<br>
<br># So now Redis sucks as well?
<br>
<br>MySQL is one of the first databases I get exposed for sure, a few decades ago, and the first time I had to setup replication I was shocked about how much it sucked. Are you serious that I need to enable binary logs and deal with offset?
<br>
<br>Redis replication, that everyone agrees is dead-simple to setup, is more or less a response to how much I dislike MySQL replication from the point of view of the "user interface".
<br>
<br>Even if we needed partial replicaton, I didn't wanted Redis to be like that.
<br>However to perform partial resynchronization you in some way or the other need something like that:
<br>
<br><slave> Hi Master! How are you? Remember that I used to be connected with you and we were such goooood friends?
<br><master> Hey Slave! You still here piece of bastard...
<br><slave> Well, shut up and give me data starting from offset 282233943 before I signal you to the Authority Of The Lazy Databases.
<br><master> Fu@**#$(*@($! ... 1010110010101001010100101 ...
<br>
<br>So the obvious solution is to have a file in the master side with all the data so that when a slave wants to resync, we can provide any offset without problems just reading it from the file. Except that this sucks a lot: We became append-to-disk-bound even if AOF is disabled, need to deal with the file system that can get full, slow (Hey EC2!), and files to rotate one way or the other. Horrid.
<br>
<br>So the following is a description about how Redis partial resynchronization implementation again accepts sacrifices to avoid to suck like that.
<br>
<br># Redis PSYNC
<br>
<br>Redis partial resynchronization does two design sacrifices.
<br>It accepts that the slave will be able to resynchronize only if:
<br>
<br>1) It reconnects in a reasonable amount of time.
<br>2) The master was not restarted.
<br>
<br>Because of this two relaxed requirements, instead of using a file, we can use a simple buffer inside our... Memory! Don't worry, a very acceptable amount of memory.
<br>
<br>So a Redis master is modified in order to:
<br>
<br>* Unify the data that is sent to the slave, so that every slave receives exactly the same things. We were about here already, but SELECT and PING commands were sent in a slave-specific fashion. Now instead the replication output to slaves is unified.
<br>* Take a backlog of what we send to slaves. So for instance we take 10 MB of past data.
<br>* Take a replication global offset, that the user never needs to deal with. We simply provide this offset to the slave, that will increment it every time it receives data. This way the slave is able to ask for partial resynchronization indicating the offset to start with.
<br>
<br>Oh also, we don't want the system to be fragile, so we use the master "run id", that is a concept that was introduced in Redis in the past as an unique identifier of a given instance execution. When the slave synchronizes with the master, it also gets the master run id, so that a next partial resynchronization attempt will be made only if the master is the same, as in, the exact same execution of Redis.
<br>
<br>Also the PSYNC command was introduced as a variant of SYNC for partially resync capable instances.
<br>
<br># How all this works in practice?
<br>
<br>When the slave gets disconnected, you'll see something like this:
<br>
<br>[60051] 17 Jan 16:52:54.979 * Caching the disconnected master state.
<br>[60051] 17 Jan 16:52:55.405 * Connecting to MASTER...
<br>[60051] 17 Jan 16:52:55.405 * MASTER <-> SLAVE sync started
<br>[60051] 17 Jan 16:52:55.405 * Non blocking connect for SYNC fired the event.
<br>[60051] 17 Jan 16:52:55.405 * Master replied to PING, replication can continue...
<br>[60051] 17 Jan 16:52:55.405 * Trying a partial resynchronization (request 6f0d582d3a23b65515644d7c61a10bf9b28094ca:30).
<br>[60051] 17 Jan 16:52:55.406 * Successful partial resynchronization with master.
<br>[60051] 17 Jan 16:52:55.406 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
<br>
<br>See the first line, the slave *caches* the master client structure, so that all the buffers are saved to be reused if we'll be able to resynchronize.
<br>
<br>In the master side we'll see instead:
<br>
<br>[59968] 17 Jan 16:52:55.406 * Slave asks for synchronization
<br>[59968] 17 Jan 16:52:55.406 * Partial resynchronization request accepted. Sending 0 bytes of backlog starting from offset 30.
<br>
<br>So basically, as long as the data is still available as no more than N bytes worth of Redis protocol (of write commands) was sent to the master, the slave will be still able to reconnect. Otherwise a full resynchronization will be performed. How much backlog to allocate is up to the user.
<br>
<br># Current status
<br>
<br>The 'psync' branch on my private repository now seems to work very well, however this is Very Important Code and must be tested a lot. This is what I'm doing right now. When I'm considerably sure the code is solid, I'll merge into unstable and 2.8 branch. When it runs for weeks without issues and starts to be adopted by the brave early adopters, we'll release 2.8 with this and other new features.
<a href="http://antirez.com/news/47">Comments</a>]]></description> <comments>http://antirez.com/news/47</comments></item>
<item><title>
ADS-B wine cork antenna
</title>
 <guid>http://antirez.com/news/46</guid> <link>
http://antirez.com/news/46
</link>
 <description><![CDATA[# Software defined radio is cool
<br>
<br>About one week ago I received my RTLSDR dongle, entering the already copious crew of software defined radio enthusiasts.
<br>
<br>It's really a lot of fun, for instance from my home that is at about 10 km from the Catania Airport I can listen the tower talking with the aircrafts in the 118.700 Mhz frequency with AM modulation, however because of lack of time I was not able to explore this further until the past Sunday.
<br>
<br>My Sunday goal was to use the RTLSDR to see if I was able to capture some ADS-B message from the aircrafts lading or leaving from the airport. Basically ADS-B is a security device that is installed in most aircrafts that is used for collision avoidance and other stuff like this. Every aircraft broadcasts informations about heading, speed, altitude and so forth.
<br>
<br>With software defined radio there are a lot of programs in order to demodulate this information (that is encoded in a fairly simple to decode format). I use "modes_rx" that is free software, just Google for it.
<br>
<br>However this transmissions happen in the 1090 Mhz frequency. My toy antenna nor other rabbit ears antennas I had at home worked at all for this frequency, so I read a few things on Google and tried a simple design that actually works well and takes 10 minutes to build using just a bit of wire and a wine cork.
<br>
<br># Cork wine dipole antenna
<br>
<br>Sorry but I know almost nothing about antennas. However I'll try to provide you the informations I've about the theoretical aspects of this antenna.
<br>
<br>Technically speaking this antenna is an Half Wavelength Dipole. In practical terms it is two pieces aligned parallel wires with a small space between them, with a total length that is half the wavelength of the frequency I want to listen to.
<br>
<br>Speed of light = 300 000 000 meters per second
<br>Frequency I want to listen to = 1090 Mhz, that is, 1090 000 000 Hertz
<br>Wavelength at frequency = 300 000 000 / 1090 000 000 = 275 millimeters.
<br>
<br>The half of 275 millimeters is 137 millimeters more or less, so this is the length of our antenna:
<br>
<br>img://antirez.com/misc/1090_antenna_1.jpg
<br>
<br>Now you can ask, why half the wavelength? But even for a n00b like me this actually makes a lot of sense, look at this:
<br>
<br>img://antirez.com/misc/1090_antenna_4.jpg
<br>
<br>Basically if you imagine a sinusoidal wave,  when one of the two pieces of the antenna is invested by the *high* part of the wave, the other is in the *low* side, and I guess that for induction this creates the current.
<br>
<br>Ok, end of broscience for today.
<br>
<br># How to build it
<br>
<br>Simply take two pieces of wire of 10 centimeters each, and insert then into the cork. Then blend the two wires to emulate the design in the picture trying to make the space between the two wires small enough. Finally cut the two wires so that they are more or less the same length, and for a total of 137 millimeters.
<br>
<br>In the other side of the cork I connected my two wires that go to the RTLSDR. I'm so lazy that I not even soldered the wires... You probably should!
<br>
<br>img://antirez.com/misc/1090_antenna_2.jpg
<br>
<br>Finally the two connected wires go to a PAL-style connector like this:
<br>
<br>
<br>img://antirez.com/misc/1090_antenna_3.jpg
<br>
<br>That in turn is connected with an adapter for the much smaller connector in the RTLSDR USB dongle. Even with all this interruptions along the path I can receive many aircrafts like a boss, even from indoor, like this:
<br>
<br>(-51 0.0000000000) Type 17 BDS0,9-1 (track report) from 3c6313 with velocity 299kt heading 129 VS -320
<br>(-49 0.0000000000) Type 17 BDS0,5 (position report) from 3c6313 at (37.500388, 15.005891) at 9300ft
<br>(-51 0.0000000000) Type 11 (all call reply) from 3c6313 in reply to interrogator 0 with capability level 6
<br>(-51 0.0000000000) Type 17 BDS0,9-1 (track report) from 3c6313 with velocity 298kt heading 129 VS -320
<br>(-51 0.0000000000) Type 17 BDS0,5 (position report) from 3c6313 at (37.499863, 15.006681) at 9300ft
<br>
<br>... And so forth.
<br>
<br>Have fun! And if you have tricks to make the antenna better while retaining the simplicity, please let me know.
<a href="http://antirez.com/news/46">Comments</a>]]></description> <comments>http://antirez.com/news/46</comments></item>
<item><title>
Partial resyncs and synchronous replication.
</title>
 <guid>http://antirez.com/news/45</guid> <link>
http://antirez.com/news/45
</link>
 <description><![CDATA[Currently I'm working on Redis partial resynchronization of slaves as I wrote in previous blog posts.
<br>
<br>The idea is that we have a backlog of the replication stream, up to the specified amount of bytes (this will be in the order of a few megabytes by default).
<br>
<br>If a slave lost the connection, it connects again, see if the master RUNID is the same, and asks to continue from a given offset. If this is possible, we continue, nothing is lost, and a full resynchronization is not needed. Otherwise if the offset is about data we no longer have in the backlog, we full resync.
<br>
<br>Now what's interesting about this is that, in order to make this possible, both the slave and the master know about a global offset that is the replication offset, since the master was ever started.
<br>
<br>Now, if we provide a command that returns this offset, it is simple for a client to simulate synchronous replication in Redis just sending the query, asking for the offset (think about MULTI/EXEC to do that) and then asking the same to the slave. Because Redis replication is very low latency, the client can simply do an optimistic "write, read-offset, read-offset-on-slave" and likely the offset we read on the slave will already be ok to continue (or, we can read it again with some pause).
<br>
<br>This is already something that could be useful, but I wonder if we could build something even better starting from that, that is, a way to send Redis a command that blocks as long as the current replication offset was not acknowledged from at least N connected slaves, and returns when this happened with +OK.
<br>
<br>I'm not promising that this will be available as we need to understand how useful is this and the complexity, but from an initial analysis this could be trivial to implement fast and reliably... and sounds pretty good.
<br>
<br>More news ASAP.
<a href="http://antirez.com/news/45">Comments</a>]]></description> <comments>http://antirez.com/news/45</comments></item>
<item><title>
Twemproxy, a Redis proxy from Twitter
</title>
 <guid>http://antirez.com/news/44</guid> <link>
http://antirez.com/news/44
</link>
 <description><![CDATA[While a big number of users use large farms of Redis nodes, from the point of view of the project itself currently Redis is a mostly single-instance business.
<br>
<br>I've big plans about going distributed with the project, to the extent that I'm no longer evaluating any threaded version of Redis: for me from the point of view of Redis a core is like a computer, so that scaling multi core or on a cluster of computers is the same conceptually. Multiple instances is a share-nothing architecture. Everything makes sense AS LONG AS we have a *credible way* to shard :-)
<br>
<br>This is why Redis Cluster will be the main focus of 2013 for Redis, and finally, now that Redis 2.6 is out and is showing to be pretty stable and mature, it is the right moment to focus on Redis Cluster, Redis Sentinel, and other long awaited improvements in the area of replication (partial resynchronization).
<br>
<br>However the reality is that Redis Cluster is not yet production ready and requires months of work. Still our users already need to shard data on multiple instances in order to distribute the load, and especially in order to use many computers to get a big amount of RAM ready for data.
<br>
<br>The sole option so far was client side sharding. Client side sharding has advantages as there are no intermediate layers between clients and nodes, nor routing of request, so it is a very scalable setup (linearly scalable, basically). However to implement it reliably requires some tuning, a way to take clients configuration in sync, and the availability of a solid client with consistent hashing support or some other partitioning algorithm.
<br>
<br>Apparently there is a big news in the landscape, and has something to do with Twitter, where one of the biggest Redis farms deployed happen to serve timelines to users. So it comes as no surprise that the project I'm talking about in this blog post comes from the Twitter Open Source division.
<br>
<br>Twemproxy
<br>---
<br>
<br>Twemproxy is a fast single-threaded proxy supporting the Memcached ASCII protocol and more recently the Redis protocol:
<br>
<br>https://github.com/twitter/twemproxy
<br>
<br>It is written entirely in C and is licensed under the Apache 2.0 License.
<br>The project works on Linux and AFAIK can't be compiled on OSX because it relies on the epoll API.
<br>
<br>I did my tests using my Ubuntu 12.04 desktop.
<br>
<br>But well, I'm still not saying anything useful. What twemproxy does actually? (Note: I'll focus on the Redis part, but the project is also able to do the same things for memcached as well).
<br>
<br>1) It works as a proxy between your clients and many Redis instances.
<br>2) It is able to automatically shard data among the configured Redis instances.
<br>3) It supports consistent hashing with different strategies and hashing functions.
<br>
<br>What's awesome about Twemproxy is that it can be configured both to disable nodes on failure, and retry after some time, or to stick to the specified keys -> servers map. This means that it is suitable both for sharding a Redis data set when Redis is used as a data store (disabling the node ejection), and when Redis is using as a cache, enabling node-ejection for cheap (as in simple, not as in bad quality) high availability.
<br>
<br>The bottom line here is: if you enable node-ejection your data may end into other nodes when a node fails, so there is no guarantee about consistency. On the other side if you disable node-ejection you need to have a per-instance high availability setup, for example using automatic failover via Redis Sentinel. 
<br>
<br>Installation
<br>---
<br>
<br>Before diving more inside the project features, I've good news, it is trivial to build on Linux. Well, not as trivial as Redis, but… you just need to follow those simple steps:
<br>
<br>apt-get install automake
<br>apt-get install libtool
<br>git clone git://github.com/twitter/twemproxy.git
<br>cd twemproxy
<br>autoreconf -fvi
<br>./configure --enable-debug=log
<br>make
<br>src/nutcracker -h
<br>
<br>It is pretty trivial to configure as well, and there is sufficient documentation in the project github page to have a smooth first experience. For instance I used the following configuration:
<br>
<br>redis1:
<br>  listen: 0.0.0.0:9999
<br>  redis: true
<br>  hash: fnv1a_64
<br>  distribution: ketama
<br>  auto_eject_hosts: true
<br>  timeout: 400
<br>  server_retry_timeout: 2000
<br>  server_failure_limit: 1
<br>  servers:
<br>   - 127.0.0.1:6379:1
<br>   - 127.0.0.1:6380:1
<br>   - 127.0.0.1:6381:1
<br>   - 127.0.0.1:6382:1
<br>
<br>redis2:
<br>  listen: 0.0.0.0:10000
<br>  redis: true
<br>  hash: fnv1a_64
<br>  distribution: ketama
<br>  auto_eject_hosts: false
<br>  timeout: 400
<br>  servers:
<br>   - 127.0.0.1:6379:1
<br>   - 127.0.0.1:6380:1
<br>   - 127.0.0.1:6381:1
<br>   - 127.0.0.1:6382:1
<br>
<br>Basically the first cluster is configured with node ejection, and the second as a static map among the configured instances.
<br>
<br>What is great is that you can have multiple setups at the same time possibly involving the same hosts. However for production I find more appropriate to use multiple instances to use multiple cores.
<br>
<br>Single point of failure?
<br>---
<br>
<br>Another very interesting thing is that, actually, using this setup does not mean you have a single point of failure, since you can run multiple instances of twemproxy and let your client connect to the first available.
<br>
<br>Basically what you are doing with twemproxy is to separate the sharding logic from your client. At this point a basic client will do the trick, sharding will be handled by the proxy.
<br>
<br>It is a straightforward but safe approach to partitioning IMHO.
<br>
<br>Currently that Redis Cluster is not available, I would say, it is the way to go for most users that want a cluster of Redis instances today. But read about the limitations before to get too excited ;)
<br>
<br>Limitations
<br>---
<br>
<br>I think that twemproxy do it right, not supporting multiple keys commands nor transactions. Currently is AFAIK even more strict than Redis Cluster that instead allows MULTI/EXEC blocks if all the commands are about the same key.
<br>
<br>But IMHO it's the way to go, distribute the subset you can distribute efficiently, and pose this as a design challenge early to the user, instead to invest a big amount of resources into "just works" implementations that try to aggregate data from multiple instances, but that will hardly be fast enough once you start to have serious loads because of too big constant times to move data around.
<br>
<br>However there is some support for commands with multiple keys. MGET and DEL are handled correctly. Interestingly MGET will split the request among different servers and will return the reply as a single entity. This is pretty cool even if I don't get the right performance numbers with this feature (see later).
<br>
<br>Anyway the fact that multi-key commands and transactions are not supported it means that twemproxy is not for everybody, exactly like Redis Cluster itself. Especially since apparently EVAL is not supported (I think they should support it! It's trivial, EVAL is designed to work in a proxy like that because key names are explicit).
<br>
<br>Things that could be improved
<br>---
<br>
<br>Error reporting is not always stellar. Sending a non supported command closes the connection. Similarly sending just a "GET" from redis-cli does not report any error about bad number of arguments but hangs the connection forever.
<br>
<br>However other errors from the server are passed to the client correctly:
<br>
<br>redis metal:10000> get list
<br>(error) WRONGTYPE Operation against a key holding the wrong kind of value
<br>
<br>Another thing that I would love to see is support for automatic failover. There are many alternatives:
<br>
<br>1) twemproxy is already able to monitor instance errors, count the number of errors, and eject the node when enough errors are detected. Well it is a shame it is not able to take slave nodes as alternatives, and instead of eject nodes use the alternate nodes just after sending a SLAVE OF NOONE command. This would turn it into an HA solution as well.
<br>
<br>2) Or alternatively, I would love if it could be able to work in tandem with Redis Sentinel, checking the Sentinel configuration regularly to upgrade the servers table if a failover happened.
<br>
<br>3) Another alternative is to provide a way to hot-configure twemproxy so that on fail overs Sentinel could switch the configuration of the proxy ASAP.
<br>
<br>There are many alternatives, but basically, some support for HA could be great.
<br>
<br>Performances
<br>---
<br>
<br>This Thing Is Fast. Really fast, it is almost as fast as talking directly with Redis. I would say you lose 20% of performances at worst.
<br>
<br>My only issue with performances is that IMHO MGET could use some improvement when the command is distributed among instances.
<br>
<br>After all if the proxy has similar latency between it and all the Redis instances (very likely), if the MGETs are sent at the same time, likely the replies will reach the proxy about at the same time. So I expected to see almost the same numbers with an MGET as I see when I run the MGET against a single instance, but I get only 50% of the operations per second. Maybe it's the time to reconstruct the reply, I'm not sure.
<br>
<br>Conclusions
<br>---
<br>
<br>It is a great project, and since Redis Cluster is yet not here, I strongly suggest Redis users to give it a try.
<br>
<br>Personally I'm going to link it in some visible place in the Redis project site. I think the Twitter guys here provided some real value to Redis itself with their project, so…
<br>
<br>Kudos!
<a href="http://antirez.com/news/44">Comments</a>]]></description> <comments>http://antirez.com/news/44</comments></item>
<item><title>
Redis Crashes
</title>
 <guid>http://antirez.com/news/43</guid> <link>
http://antirez.com/news/43
</link>
 <description><![CDATA[Premise: a small rant about software reliability.
<br>===
<br>
<br>I'm very serious about software reliability, and this is not just a good thing.
<br>It is good in a sense, as I tend to work to ensure that the software I release is solid. At the same time I think I take this issue a bit too personally: I get upset if I receive a crash report that I can't investigate further for some reason, or that looks like almost impossible to me, or with an unhelpful stack trace.
<br>
<br>Guess what? This is a bad attitude because to deliver bugs free software is simply impossible. We are used to think in terms of labels: "stable", "production ready", "beta quality". I think that these labels are actually pretty misleading if not put in the right perspective.
<br>
<br>Software reliability is an incredibly complex mix of ingredients.
<br>
<br>1) Precision of the specification or documentation itself. If you don't know how the software is supposed to behave, you have things that may be bugs or features. It depends on the point of view.
<br>2) Amount of things not working accordingly to the specification, or causing a software crash. Let's call these just software "errors".
<br>3) Percentage of errors that happen to be in the subset of the software that is actually used and stressed by most users.
<br>4) Probability that the conditions needed for a given error to happen are met.
<br>
<br>So what happens is that you code something, and this initial version will be as good as good and meticulous are you as a programmer, or as good is your team and organization if it is a larger software project involving many developers. But this initial version contains a number of errors anyway.
<br>
<br>Then you and your users test, or simply use, this new code. All the errors that are likely to happen and to be recognized, because they live in the subset of the code that users hit the most, start to be discovered. Initially you discover a lot of issues in a short time, then every new bug takes more time to be found, probabilistically speaking, as it starts to be in the part of the code that is less stressed, or the conditions to make the error evident are unlikely to be met.
<br>
<br>Basically your code is never free from errors. "alpha", "beta", "production read", and "rock solid" are just names we assign to the probability we believe there is for a serious error to be discovered in a reasonable time frame.
<br>
<br>Redis crashes
<br>===
<br>
<br>Redis users are not likely to see Redis crashing without a good reason (Redis will abort on purpose in a few exceptional conditions). However from time to time there are bugs in Redis that will caused a server crash under certain conditions.
<br>
<br>What makes crashes different from other errors is that in complex systems crashes are the perfect kind of "likely hard to reproduce error". Bug fixing is all about creating a mental model of how the software works to understand why it is not behaving as expected. Every time you can trigger the bug again, you add information to your model: eventually you have enough informations to understand and fix the bug.
<br>
<br>Crashes on Redis are crashes happening on a long running program, that interacts with many clients that are sending command at different times, in a way that the sequence of commands and events is very hard to reproduce.
<br>
<br>Crashes stop the server, provide little clues, and are usually very hard to reproduce. This means little information, poor model of what is happening, and ultimately, very hard to fix bugs.
<br>
<br>If this is not enough, crashes are not only bad for developers, they are also very bad for users, as the software stops working after a crash, that is one of the biggest misbehaviors you can expect from a software.
<br>
<br>This is why I hate Redis (and other software) crashes.
<br>
<br>Stack traces
<br>===
<br>
<br>Are there ways to make crashes more *friendly*?
<br>
<br>If a crash is reproducible, and if the user has time to spent with you, possibly from a far away time zone, maybe you can make a crash more friendly with a debugger. If you are lucky enough that data is not important for the user, you may also get a core dump (that contains a copy of the data set in the case of Redis).
<br>
<br>However most of the crashes will simply not happen again in a practical amount of time, or the user can't help debugging, or data is too important to send you a core. So one of the first things I did to improve Redis crashes was to make it print a stack trace when it crashes.
<br>
<br>This is an example of what you get if you send DEBUG SEGFAULT to the server:
<br>
<br>EDIS BUG REPORT START: Cut & paste starting from here ===
<br>[32827] 26 Nov 15:19:14.158 #     Redis 2.6.4 crashed by signal: 11
<br>[32827] 26 Nov 15:19:14.158 #     Failed assertion: <no assertion failed> (<no file>:0)
<br>[32827] 26 Nov 15:19:14.158 # --- STACK TRACE
<br>0   redis-server                        0x0000000103a15208 logStackTrace + 88
<br>1   redis-server                        0x0000000103a16544 debugCommand + 68
<br>2   libsystem_c.dylib                   0x00007fff8c5698ea _sigtramp + 26
<br>3   ???                                 0x0000000000000000 0x0 + 0
<br>4   redis-server                        0x00000001039ec145 call + 165
<br>5   redis-server                        0x00000001039ec77f processCommand + 895
<br>6   redis-server                        0x00000001039f7fd0 processInputBuffer + 160
<br>7   redis-server                        0x00000001039f6dfc readQueryFromClient + 396
<br>8   redis-server                        0x00000001039e82d3 aeProcessEvents + 643
<br>9   redis-server                        0x00000001039e851b aeMain + 59
<br>10  redis-server                        0x00000001039ef33e main + 1006
<br>11  libdyld.dylib                       0x00007fff91ef97e1 start + 0
<br>12  ???                                 0x0000000000000001 0x0 + 1
<br>
<br>... more stuff ...
<br>
<br>[32827] 26 Nov 15:19:14.163 # --- CURRENT CLIENT INFO
<br>[32827] 26 Nov 15:19:14.163 # client: addr=127.0.0.1:56888 fd=5 age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=debug
<br>[32827] 26 Nov 15:19:14.163 # argv[0]: 'debug'
<br>[32827] 26 Nov 15:19:14.163 # argv[1]: 'segfault'
<br>[32827] 26 Nov 15:19:14.163 # --- REGISTERS
<br>[32827] 26 Nov 15:19:14.163 # 
<br>RAX:0000000000000000 RBX:00007fb8e4800000
<br>RCX:00000000000003e8 RDX:00007fff79804788
<br>RDI:0000000000000000 RSI:0000000103a4e2bf
<br>RBP:00007fff5c2194c0 RSP:00007fff5c2193e0
<br>R8 :0000000050b37a62 R9 :0000000000000019
<br>R10:0000000000000001 R11:0000000002000000
<br>R12:0000000000026c0f R13:0000000103a6bc80
<br>R14:00007fb8e3600000 R15:00007fb8e3600178
<br>RIP:0000000103a16544 EFL:0000000000010246
<br>CS :000000000000002b FS:0000000000000000  GS:0000000000000000
<br>[32827] 26 Nov 15:19:14.164 # (00007fff5c219458) -> 00007fb8e3600000
<br>[32827] 26 Nov 15:19:14.164 # (00007fff5c219450) -> 0000000000000001
<br>[32827] 26 Nov 15:19:14.164 # (00007fff5c219448) -> 0000000000000000
<br>[32827] 26 Nov 15:19:14.164 # (00007fff5c219440) -> 00007fff5c219470
<br>
<br>... more stuff ...
<br>
<br>And a lot more, including the INFO output and the full list of connected clients and their state.
<br>
<br>Printing stack traces on crash is already a huge improvement over "segmentation fault, core dumped". A lot of times I can fix the issue just from the output I see, without asking anything to the user. Add to this that we have the arguments of the currently executed command.
<br>
<br>I'm crazy enough that Redis logs registers and part of the stack as well. If you have this trace, and a copy of the redis-server executable that caused it (and users will send it to you easily), you can inspect a lot of state just from that.
<br>
<br>EVERY MISSION CRITICAL SOFTWARE should do this.
<br>
<br>Broken memory
<br>===
<br>
<br>So far so good. That kind of reporting on crashes, plus a gentle and helpful community of users, can help you improve the quality of the software you ship, a lot. But what happens if the bug is an hard one? A memory violation in a random place like a dictionary lookup in the context of a simple GET operation? Well, you are in big troubles. You can't ignore a bug like that, but inspecting it take a lot of time.
<br>
<br>And guess what? Many times you do a lot of work for nothing, as Redis crashes only because the user was using broken memory.
<br>
<br>Think along these lines: Redis uses memory a lot, and memory errors will easily amplify when you use a lot of data structures with pointers. If there is a broken bit it will easily end crashing the server in a short time.
<br>
<br>Now let's scale this to the sum of all the memory that every single Redis instance out there is using right now. What you get is that Redis is like a parallel memory test working on many computers, waiting to spot some bit error. And as Redis gets more known and deployed? More memory tested every second.
<br>
<br>The problem is that this giant memory test will not write "Hey, you got a problem in your memory!". The output will be a new issue in the Redis GitHub issue system, and me investigating non existing issues for hours.
<br>
<br>First countermeasure
<br>===
<br>
<br>At some point, after the first endless investigations, I started to be smarter: when a bug looked suspicious I started the investigation with "Please, can you run a memory test to verify the computer memory and CPU are likely ok?".
<br>
<br>However this requires the user to reboot the machine and run memtest86. Or at least to install some user space memory testing program like memtester available in most Linux distributions. Many times the user has no physical access at all to the box, or there is no "box" at all, the user is using a virtual machine somewhere.
<br>
<br>After some time I realized that if Redis could be able to test the computer memory easily without requiring additional software, I would be able to receive more memory test reports from users.
<br>
<br>This is why recent Redis instance can perform a memory test if you write something like that:
<br>
<br>./redis-server --test-memory 4096
<br>
<br>Where "4096" is the number of megabytes of RAM to test.
<br>
<br>The test is not perfect but it showed to be pretty effective to detect normal memory errors, however there are more subtle errors like retention errors that happen if you set some memory cell to a given value, and wait some time, and then retry to read it, that are not detected by this test. But well, apparently most memory errors are simpler than that and can be more easily detected.
<br>
<br>The quality of the detection also depends on the amount of time the user will let the memory test to run. There are errors that just happen rarely.
<br>
<br>Anyway this test improved my experience with Redis crashes a lot. Sometimes I see a bug report, I just ask the user to test the RAM, and I get a message where the user acknowledges that there was a problem with a memory module.
<br>
<br>However users rarely run a more accurate memory test like memtest86 for days, like they should after a crash. Many times it is simply not practical at all. So even when I get a "memory is fine" report I simply accept it as a matter of facts that memory is ok and I investigate the issue closely, but, if I never saw a crash like this reported at least one more time in recent times with different hardware, and there is no way to reproduce such a crash, after some efforts I'll simply stop investigating, taking the issue open for some more time, and finally closing it if no other similar crashes are reported.
<br>
<br>Testing more memory
<br>===
<br>
<br>Even not considering that there are false negatives, with the current redis-server --test-memory approach there are two bigger problems.
<br>
<br>1) Not every user will actually run the test after a crash.
<br>2) Users running Redis using virtualization, possibly using a virtual machine provider like EC2, will never know if they are testing the same phyisical memory pages when they run redis-server --test-memory.
<br>
<br>The first problem could be addressed simply using a policy: a crash without a memory test is not investigated if it looks suspicious. However this is not good for the Redis project nor for the users: sometimes an user will file a report and will simply have no more time to help us. Some other time it can't stop the server. And we developers may miss the opportunity to track a bug just because of a bad attitude and the possibility of a memory error.
<br>
<br>The second problem, related to people using virtualization, is even worse. Even when users want to help you, then can't in a reliable way.
<br>
<br>So what about testing memory just when a crash happens? This would allow Redis to test the memory of every single computer where a crash has happened. The bug report will be annotated with the result of the test, providing an interesting hint about the state of the memory without further help from the user.
<br>
<br>And even better, the test will test exactly the memory as allocated at the moment of the crash! It will test exactly the physical memory pages that Redis is using, that is just perfect for environments like EC2.
<br>
<br>The problem is, how to do it?
<br>
<br>How to test on crashes
<br>===
<br>
<br>My first idea was to test memory incrementally inside the allocator.
<br>Like, from time to time, if you allocate some memory, run a fast memory test on it and log it on the Redis log and in the INFO output if a problem was detected.
<br>
<br>In theory it is nice, and I even implemented the idea. The problem is, it is completely unreliable. If the broken memory is allocated for something that is never deallocated later, it will never be tested again. Worse than that, it takes a lot of time to test the whole memory incrementally small piece after small piece, and what about testing every single location? The allocator itself uses "internal" memory that is never exposed to the user, and we are missing all these pages.
<br>
<br>Bad idea... and... the implementation I wrote was not working at all as the CPU cache made it completely useless, as testing small pieces of memory incrementally results in a constant cache hit.
<br>
<br>The second try was definitely better, and was simply to test the whole space of allocated memory, but only when a crash happens.
<br>
<br>At first this looks pretty hard: at least you need to get a lot more help from the memory allocator you are using. I don't think jemalloc has an out of the box way to report the memory regions allocated so far. Also if we are crashing, I'm not sure how reliable asking the allocator to report memory regions could be.
<br>As a result of a single bit error, it is very easy to see the error propagating at random locations.
<br>
<br>There are other problems. After a crash we want to be able to dump a core that is meaningful. If during the memory test we fill our memory with patterns, the core file will be completely useless. This means that the memory test needed to be conceived so that at the end of the test the memory was left untouched.
<br>
<br>The proc filesystem /proc/<pid>/maps
<br>===
<br>
<br>The Linux implementation of the proc filesystem makes Linux an exceptionally introspective operating system. A developer needs minimal efforts to be able to access informations that are usually non exposed to the user space. Actually the effort required is so small as to parse a text file.
<br>
<br>The "maps" file of the proc filesystem shows line after line all the memory mapped regions for a given process and their permissions (read, write, execute).
<br>The sum of all the reported regions is the whole address space that can be accessed in some way by the specified process.
<br>
<br>Some of this maps are "special" maps created by the kernel itself, like the process stack. Other maps are memory mapped files like dynamic libraries, and others are simply regions of memory allocated by malloc(), either using the sbrk() syscall or an anonymous mmap().
<br>
<br>The following two lines are examples of maps
<br>(obtained with "cat /proc/self/maps)
<br>
<br>7fb1b699b000-7fb1b6b4e000 r-xp 00000000 08:05 15735004                   /lib/x86_64-linux-gnu/libc-2.15.so
<br>7fb1b6f5f000-7fb1b6f62000 rw-p 00000000 00:00 0
<br>
<br>The first part of each line is the address range of the memory mapped area, followed by permissions "rwxp" (read, write, execute, private), the second is the offset in case of a memory mapped file, then there is the device id, that is 00:00 for anonymous maps, and finally the inode and file name for memory mapped files.
<br>
<br>We are interested to check all the heap allocated memory that is readable and writable, so a simple grep will do the trick:
<br>
<br>$ cat /proc/self/maps | grep 00:00 | grep rw
<br>
<br>01cbb000-01cdc000 rw-p 00000000 00:00 0                                  [heap]
<br>7f46859f9000-7f46859fe000 rw-p 00000000 00:00 0 
<br>7f4685c05000-7f4685c08000 rw-p 00000000 00:00 0 
<br>7f4685c1e000-7f4685c20000 rw-p 00000000 00:00 0 
<br>7fffe7048000-7fffe7069000 rw-p 00000000 00:00 0                          [stack]
<br>
<br>Here we can find both the heap allocated using the setbrk() system call, and the heap allocated using anonymous maps. However there is one thing that we don't want to scan, that is the stack, otherwise the function that is testing the memory itself would easily crash.
<br>
<br>So thanks to the Linux proc filessystem the first problem is no longer a big issue, and we use some code like this:
<br>
<br>int memtest_test_linux_anonymous_maps(void) {
<br>    FILE *fp = fopen("/proc/self/maps","r");
<br>
<br>    ... some more var declaration ...
<br>
<br>    while(fgets(line,sizeof(line),fp) != NULL) {
<br>        char *start, *end, *p = line;
<br>
<br>        start = p;
<br>        p = strchr(p,'-');
<br>        if (!p) continue;
<br>        *p++ = '\0';
<br>        end = p;
<br>        p = strchr(p,' ');
<br>        if (!p) continue;
<br>        *p++ = '\0';
<br>        if (strstr(p,"stack") ||
<br>            strstr(p,"vdso") ||
<br>            strstr(p,"vsyscall")) continue;
<br>        if (!strstr(p,"00:00")) continue;
<br>        if (!strstr(p,"rw")) continue;
<br>
<br>        start_addr = strtoul(start,NULL,16);
<br>        end_addr = strtoul(end,NULL,16);
<br>        size = end_addr-start_addr;
<br>
<br>        start_vect[regions] = start_addr;
<br>        size_vect[regions] = size;
<br>        printf("Testing %lx %lu\n", start_vect[regions], size_vect[regions]);
<br>        regions++;
<br>    }
<br>
<br>    ... code to actually test the found memory regions ...
<br>    /* NOTE: It is very important to close the file descriptor only now
<br>     * because closing it before may result into unmapping of some memory
<br>     * region that we are testing. */
<br>    fclose(fp);
<br>}
<br>
<br>CPU cache
<br>===
<br>
<br>The other problem we have is the CPU cache. If we try to write something to a given memory address, and read it back to check if it is ok, we are actually only stressing the CPU cache and never hitting the memory that is supposed to be tested.
<br>
<br>Actually writing a memory test that bypasses the CPU cache, without the need to resort to CPU specific tricks (like memory type range registers), is easy:
<br>
<br>1) Fill all the addressable memory from the first to the last with the pattern you are testing.
<br>2) Check all the addressable memory, from the first to the last,  to see if the pattern can be read back as expected.
<br>
<br>Because we do it in two passes, as long as the size of the memory we are testing is larger than the CPU cache, we should be able to test the memory in a reliable way.
<br>
<br>But there is a problem: this test destroys the content of the memory, that is not acceptable in our case, remember that we want to be able to provide a meaningful core dump if needed, after the crash?
<br>
<br>On the other side, writing a memory test that does not destroy the memory content, but that is not able to bypass the cache, is also easy: for each location save the value of the location on the stack, test the location writing patterns and reading the patterns back, and finally set the correct value back to the tested location. However this test is completely useless as long as we are not able to disable the CPU cache.
<br>
<br>How to write a memory test that:
<br>
<br>A) Is able to bypass the cache.
<br>B) Does not destroy the memory content.
<br>C) Is able to, at least, to test every memory bit in the two possible states.
<br>
<br>Well that's what I asked myself during the past weekend, and I found a simple solution that works as expected (as tested in a computer with broken memory, thanks Kosma! See credits at the end of the post).
<br>
<br>This is the algorithm:
<br>
<br>1) Take a CRC64 checksum of the whole memory.
<br>2) Invert the content of every location from the first to the last (With "invert" I mean, bitwise complement, so that every 1 is turned into 0, and every 0 turned into 1).
<br>3) Swap every adjacent location content. So I swap the content at addresses 0 and 1, 2 and 3, ... and so forth.
<br>4) Swap again (step 3).
<br>5) Invert again (step 2).
<br>6) Take a CRC64 checksum of the whole memory.
<br>7) Swap again.
<br>8) Swap again.
<br>9) Take a CRC64 checksum of the whole memory again.
<br>
<br>If the CRC64 obtained at step 1, 6, and 9 are not the same, there is some memory error.
<br>
<br>Now let's check why this is supposed to work: It is trivial to see how if memory is working as expected, after the steps I get back the original memory content, since I swap four times, and invert two times. However what happens if there are memory errors?
<br>
<br>Let's do a test considering memory locations of just two bits for simplicity. So I've something like:
<br>
<br>01|11|11|00
<br>      |
<br>      +----- this bit is broken and is always set to 1.
<br>
<br>(step 1: CRC64)
<br>After step 2: 10|00|10|11 (note that the broken bit is still 1 instead of 0)
<br>After step 3: 00|10|11|10 (adjacent locations swapped)
<br>After step 4: 10|00|10|11 (swapped again)
<br>After step 5: 01|11|11|00 (inverted again, so far, no errors detected)
<br>(step 6: CRC64)
<br>After step 7: 11|01|10|11
<br>After step 8: 01|11|11|10 (error!)
<br>(step 9: CRC64)
<br>
<br>The CRC64 obtained at step 9 will differ.
<br>Now let's check the case of a bit always set to 0.
<br>
<br>01|11|01|00
<br>      |
<br>      +----- this bit is broken and is always set to 0.
<br>
<br>(step 1: CRC64)
<br>After step 2: 10|00|00|11 (invert)
<br>After step 3: 00|10|01|00 (swap)
<br>After step 4: 10|00|00|01 (swap)
<br>After step 5: 01|11|01|10 (invert)
<br>(step 6: CRC64)
<br>After step 7: 11|01|00|01 (swap)
<br>After step 8: 01|11|01|00 (swap)
<br>(step 9: CRC64)
<br>
<br>This time is the CRC64 obtained at step 6 that will differ.
<br>You can check what happens if you flip bits in the adjacent location, but either at step 6 or 9 you should be always able to see a different checksum.
<br>
<br>So basically this test does two things: first it amplifies the error using an adjacent location as helper, then use checksums to detect the error. The steps are performed always as "read + write" operations acting sequentially from the first to the last memory location to disable as much as possible the CPU cache.
<br>
<br>You can find the code implementing this in the "memcheck-on-crash" branch on github, "debug.c" file:
<br>
<br>https://github.com/antirez/redis/blob/memtest-on-crash/src/debug.c#L669
<br>
<br>The kernel could do it better
<br>===
<br>
<br>After dealing with many crash reports that are actually due to memory errors, I'm starting to think that kernels are missing an incredible opportunity to make computers more reliable.
<br>
<br>What Redis is doing could be done incrementally, a few pages per second, by the kernel with no impacts to actual performance. And the kernel is in a particularly good position:
<br>
<br>1) It could detect the error easily bypassing the cache.
<br>2) It could perform more interesting value retaining error tests writing patterns to pages that will be reused much later in time, and checking if the pattern matches before the page is reused.
<br>3) The error could be logged in the system logs, making the user aware before a problem happens.
<br>4) It could exclude the broken page from being used again, resulting in safer computing.
<br>
<br>I hope to see something like that in the Linux kernel in the future.
<br>
<br>The life of a lonely cosmic ray
<br>===
<br>
<br>A bit flipping at random is not a problem solely related to broken memory. Perfectly healthy memory is also subject, with a small probability, to bit flipping because of cosmic rays.
<br>
<br>We are talking, of course, of non error correcting memory. The more costly ECC memory can correct a single bit error and can detect two bits errors halting the system. However many (most?) servers are currently using non ECC memory, and it is not clear if Amazon EC2 and other cloud providers are using or not ECC memory (it is not ok that this information is not clearly available in my opinion, given the cost of services like EC2 and the possible implications of not using ECC memory).
<br>
<br>According to a few sources, including IBM, Intel and Corsair, a computer with a few GB of memory of non-ECC memory is likely to incur to *several* memory errors every year.
<br>Of course you can't detect errors with a memory test if the bit flipping was caused by a cosmic ray hitting your memory, so to cut a long story short:
<br>
<br>Users reporting isolated, impossible to understand and reproduce crashes, not using ECC memory, can't be taken as a proof that there is a bug even if the fast on-crash memory test passes, if the more accurate redis --test-memory passes, and even if they run memtest86 for several days after the event.
<br>
<br>Anyway not every bit flipped is going to trigger a crash after all, because as somebody on stack overflow said:
<br>
<br>    "Most of the memory contains data, where the flip won't be that visiblp"
<br>
<br>The bytes representing 'p' and 'e' are not just 1 bit away, but I find the sentence to be fun anyway.
<br>
<br>Credits
<br>===
<br>
<br>A big Thank You to Kosma Moczek that reported how my initial approach was not working due to the CPU cache, and donated ssh access to a fast computer with a single bit memory error.
<a href="http://antirez.com/news/43">Comments</a>]]></description> <comments>http://antirez.com/news/43</comments></item>
<item><title>
Redis children can now report amount of copy-on-write
</title>
 <guid>http://antirez.com/news/42</guid> <link>
http://antirez.com/news/42
</link>
 <description><![CDATA[This is one of this trivial changes in Redis that can make a big difference for users. Basically in the unstable branch I added some code that has the following effect, when running Redis on Linux systems:
<br>
<br>[32741] 19 Nov 12:00:55.019 * Background saving started by pid 391
<br>[391] 19 Nov 12:01:00.663 * DB saved on disk
<br>[391] 19 Nov 12:01:00.673 * RDB: 462 MB of memory used by copy-on-write
<br>
<br>As you can see now the amount of additional memory used by the saving child is reported (it is also reported for AOF rewrite operations).
<br>
<br>I think this is big news for users as instead to see us developers and other Redis experts handwaving about the amount of copy-on-write being proportional to number of write ops per second and time used to produce the RDB or AOF file, now they get a number :-)
<br>
<br># How it is obtained?
<br>
<br>We use the /proc/<pid>/smaps, so yes, this is Linux only.
<br>Basically it is the sum of all the Private_Dirty entries in this file for the child process (actually you could measure it on the parent side and it is the same).
<br>
<br>I verified that the number we obtain actually corresponds very well with the physical amount of memory consumed during a save, in different conditions, so I'm very confident we provide an accurate information.
<br>
<br># Why a number in the log file instead of an entry in the INFO output?
<br>
<br>Because even before calling wait3() from the parent, as long as the child exits we no longer have this information. So to display this information in INFO we need some inter process communication to move this info from the child to the parent. Not rocket science but for now I avoided adding extra complexity. The current patch is trivial enough that we could backport it into 2.6 for the joy of many users:
<br>
<br>https://github.com/antirez/redis/commit/3bfeb9c1a7044cd96c1bd77677dfe8b575c73c5f
<br>https://github.com/antirez/redis/commit/49b645235100fc214468b608c1ba6cdbc320fa88
<br>
<br>The log is produced at NOTICE level (so it is displayed by default).
<a href="http://antirez.com/news/42">Comments</a>]]></description> <comments>http://antirez.com/news/42</comments></item>
<item><title>
Memory errors and DNS
</title>
 <guid>http://antirez.com/news/41</guid> <link>
http://antirez.com/news/41
</link>
 <description><![CDATA[Memory errors in computers are so common that you can register domain names similar to very famous domain names, but altered in one bit, and get a number of requests per hour:
<br>
<br>http://dinaburg.org/bitsquatting.html
<a href="http://antirez.com/news/41">Comments</a>]]></description> <comments>http://antirez.com/news/41</comments></item>
<item><title>
On Twitter, at Twitter
</title>
 <guid>http://antirez.com/news/40</guid> <link>
http://antirez.com/news/40
</link>
 <description><![CDATA[On Twitter:
<br>
<br>@War3zRub1 "Hahaha it's silly how people use Redis when they need a reverse proxy"
<br>@C4ntc0de "ZOMG! Use a real message queue, Redis is not a queue!"
<br>@L4m3tr00l "My face when Redis BLABLABLA..."
<br>
<br>Meanwhile *at* Twitter:
<br>
<br>OP1: "Hey guys, there is a spike in the number of lame messages today, load is increasing..."
<br>OP2: "Yep noticed, it's the usual troll fiesta trowing shit at Redis, 59482 messages per second right now."
<br>OP1: "Ok, no prob, let's spawn two additional Redis nodes to serve their timelines as smooth as usually".
<br>
<br>TL;DR: http://www.infoq.com/presentations/Real-Time-Delivery-Twitter
<a href="http://antirez.com/news/40">Comments</a>]]></description> <comments>http://antirez.com/news/40</comments></item>
<item><title>
Eventual consistency: when, and how?
</title>
 <guid>http://antirez.com/news/39</guid> <link>
http://antirez.com/news/39
</link>
 <description><![CDATA[This post by Peter Bailis is a must read. "Safety and Liveness: Eventual consistency is not safe" [1].
<br>
<br>[1] http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/
<br>
<br>An extreme TL;DR of this is.
<br>
<br>1) In an eventually consistent system, when all the nodes will agree again after a partition?
<br>2) In an eventually consistent system, HOW the nodes will agree about inconsistencies?
<br>3) In immediately consistent systems, when I'm no longer able to write? When I'm no longer able to read?
<br>
<br>Basically:
<br>
<br>"1" is time (or more practically, conditions needed to merge).
<br>"2" is safety (or more practically, merge strategy).
<br>"3" is availability (or more practically, how much of the system can be down, for me to be still able to write and read).
<a href="http://antirez.com/news/39">Comments</a>]]></description> <comments>http://antirez.com/news/39</comments></item>
<item><title>
Optimizing the TCP/IP checksum calculation. Interesting low level journey.
</title>
 <guid>http://locklessinc.com/articles/tcp_checksum/</guid> <link>
http://locklessinc.com/articles/tcp_checksum/
</link>
 <description><![CDATA[<a href="http://antirez.com/news/38">Comments</a>]]></description> <comments>http://antirez.com/news/38</comments></item>
<item><title>
Welcome to RethinkDB
</title>
 <guid>http://antirez.com/news/37</guid> <link>
http://antirez.com/news/37
</link>
 <description><![CDATA[There is a new DB option out there, I know it took a long time to be developed. While I don't know very well how it works I hope it will be an interesting player in the database landscape.
<br>
<br>My initial feeling is that it will compete closely with Riak and MongoDB (the system seems more similar to MongoDB itself, but if it can scale well multi-nodes people that don't need high write availability may pick an immediate consistent database such as RethinkDB instead of Riak for certain applications).
<br>
<br>Welcome to RethinkDB :-)
<br>
<br>http://www.rethinkdb.com/
<a href="http://antirez.com/news/37">Comments</a>]]></description> <comments>http://antirez.com/news/37</comments></item>
<item><title>
Redis data model and eventual consistency
</title>
 <guid>http://antirez.com/news/36</guid> <link>
http://antirez.com/news/36
</link>
 <description><![CDATA[While I consider the Amazon Dynamo design, and its original paper, one of the most interesting things produced in the field of databases in recent times, in the Redis world eventual consistency was never particularly discussed.
<br>
<br>Redis Cluster for instance is a system biased towards consistency than availability. Redis Sentinel itself is an HA solution with the dogma of consistency and master slave setups.
<br>
<br>This bias for consistent systems over more available but eventual consistent systems has some good reasons indeed, that I can probably reduce to three main points:
<br>
<br>1) The Dynamo design partially rely on the idea that writes don't modify values, but rewrite an entirely new value. In the Redis data model instead most operations modify existing values.
<br>2) The other assumption Dynamo does about values is about their size, that should be limited to 1 MB or so. Redis keys can hold multi million elements aggregate data types.
<br>3) There were a good number of eventual consistent databases in development or already production ready when Redis and the Cluster specification were created. My guess was that to contribute a distributed system with different trade offs could benefit more the user base, providing a less popular option.
<br>
<br>However every time I introduce a new version of Redis I take some time to experiment with new ideas, or ideas that I never considered to be viable. Also the introduction of MIGRATE, DUMP and RESTORE commands, the availability of Lua scripting, and the ability to set millisecond expires, for the first time makes possible to easily create client orchestrated clusters with Redis.
<br>
<br>So I spent a couple of hours playing with the idea of the Redis data model and an eventual consistent cluster that could be developed as a library, without any change to the Redis server itself. It was pretty fun, in this blog post I share a few of my findings and ideas about the topic.
<br>
<br>Partitioning
<br>===
<br>
<br>In this design data is partitioned using consistent hashing as in Dynamo.
<br>On writes data is transferred to N nodes in the preference list, if there are unavailable nodes, the next nodes are used.
<br>Reads use N+M nodes in order to reach nodes outside the preference list and account for changes in the hash ring (for instance the addition of a new node).
<br>
<br>In my tests I used the consistent hashing implementation inside the redis-rb client distribution, slightly modified for my needs.
<br>
<br>Writes
<br>===
<br>
<br>Writes are performed sending the same command to every node in the preference list one after the other, skipping not available nodes.
<br>
<br>No cross-nodes locking is performed while writing, so reads may find an update only into a subset of nodes. This problem is handled in reads. In general in this client orchestrated design, most of the complexity is in the reading side.
<br>
<br>While performing writes or reads, not available nodes (not responding after an user configured timeout) are temporary suspended from next requests for a configurable amount of time (for instance one minute) in order to avoid increasing latency for every request. In my tests I used a simple errors counter that suspended the node after N successive errors in a row.
<br>
<br>Reads
<br>===
<br>
<br>Reads are performed sending the same command to every node in the preference list, and an additional number of successive nodes.
<br>
<br>Read operations are of two kinds:
<br>
<br>1) Active read operations. In this kind of reads the results from the different nodes are compared in order to detect a possible inconsistency, that will in turn trigger a merge operation if needed.
<br>2) Passive read operations, that are operations where the different replies are simply filtered in order to return the most suitable result, without triggering a merge.
<br>
<br>For instance GET is an active read operation that can trigger a merge operation if an inconsistency is found in the result.
<br>
<br>ZRANK instead is a read passive operation. Read passive operations use a command-dependent winner result selection. For instance in the specific case of ZRANK the most common reply among nodes is returned. In the case every node returns a different rank for the specified element, the smallest rank is returned (the one with minor integer value).
<br>
<br>Inconsistency detection
<br>===
<br>
<br>Inconsistencies are detected while performing reads, in a type-dependent and operation-dependent way.
<br>
<br>For example the GET command detects inconsistencies for the string type, checking if at least one value among the results returned by the contacted nodes is different from others.
<br>
<br>However it is easy to see how in presence of high write load it is likely for a GET to see a write only partially propagated to a subset of nodes. In this case an inconsistency should not be detected to avoid useless merge operations.
<br>
<br>The solution to this problem is to ignore differences if the key was updated very recently (in less than a few milliseconds). I implemented this system using PSETEX command from Redis 2.6 to create short living keys with time to live in the order of milliseconds.
<br>
<br>So the actual inconsistency detection for strings is performed with the following two tests:
<br>
<br>1) One or more values are different (including non existing or having a wrong type).
<br>2) If the first condition is true, the same nodes are contacted using an EXIST operation against a key that signal a recent change on the key. Only if all the nodes will return false inconsistency is considered as valid and the merge operation triggered.
<br>
<br>For this system to work, the SET command in the library implementing the system is supposed to use the PSETEX command before sending the actual SET command.
<br>
<br>Inconsistency in aggregate data types
<br>===
<br>
<br>More complex data types use more complex inconsistency detection algorithms, and value-level short living keys to signal recent changes.
<br>
<br>For instance for the Set type, SISMEMBER is a read active operation that can detect an inconsistency if the following two conditions are true.
<br>
<br>1) At least one node returned a different result for SISMEMBER for a particular value inside a set.
<br>2) The value was not added or removed from the set very recently in any of the involved nodes.
<br>
<br>Merge operation
<br>===
<br>
<br>The merge stage is in my opinion the most interesting part of the experiment, because the Redis data model is different and more complex compared to the Dynamo data model.
<br>
<br>I used a type-dependent merging strategy that the database user can use to pick different trade offs for the different needs of an application that requires a very high database writes availability.
<br>
<br>* Strings are merged using the last-write wins.
<br>* Sets are merged performing the set union of all the conflicting versions.
<br>* Lists are merged inserting missing values in the head and tail side of the list trying to retain the insertion order, up to the first common value on both sides.
<br>* Hashes are merged adding common and uncommon fields, using the most recent update in case of different values.
<br>* Sorted set, a merge similar to Set an union is performed. I did not experimented a lot with this, so it's a work in progress.
<br>
<br>For instance the specific example in the Amazon Dynamo paper of the shopping cart would be easily modeled using the Set type, so that old items may resurrect but items don't get lost.
<br>
<br>On the other side when approximated ordering of events is important a list could be more suitable.
<br>
<br>In my tests, Strings, Hash values, and List elements were prefixed with a binary 8 byte microsecond resolution time stamp, so it was sensible to client clock skews.
<br>
<br>Lua scripting is very suitable when performing a client orchestrated merge operation. For instance when transmitting the winner value to old nodes I used the following Lua script:
<br>
<br>   if (redis.call("type",KEYS[1]) ~= "string" or
<br>        redis.call("get",KEYS[1]) ~= ARGV[1])
<br>    then
<br>        redis.call("set",KEYS[1],ARGV[2])
<br>    end
<br>
<br>So that values are replaced only if they are still found to be the old invalid version.
<br>
<br>Merging large values
<br>===
<br>
<br>One problem about merging of large values in a client orchestrated cluster is that, for instance, performing a client driven union of two big sets could require an important amount of time using the Redis vanilla API.
<br>
<br>However fortunately a big help comes from the DUMP, RESTORE and MIGRATE commands. The MIGRATE command in the unstable branch of Redis has now support for a COPY and REPLACE option that makes it much more useful in this scenario.
<br>
<br>For instance in order to perform the union of two sets in two nodes A and B, it is possible to just MIGRATE COPY (COPY means, do not remove the source key) the set from one node to the other node using a temporary key, and then calling SUNIONSTORE to finish the work.
<br>
<br>MIGRATE is pretty efficient and can transfer large keys in short amounts of time, however as in Redis Cluster itself, the design described in this blog post is not suitable for applications with a big number of very large keys.
<br>
<br>Conclusions
<br>===
<br>
<br>There are many open problems and implementation details that I omitted in this blog post, but I hope I provided some background for further experimentations.
<br>
<br>I hope to continue this work in the next weeks or months in a best effort way, but at this point it is not clear if this will ever reach the form of an usable library, however I would love to see something like that in a production ready form.
<br>
<br>If I'll have news about this work I'll write a new blog post for sure.
<a href="http://antirez.com/news/36">Comments</a>]]></description> <comments>http://antirez.com/news/36</comments></item>
<item><title>
Slave partial synchronization work in progress
</title>
 <guid>http://antirez.com/news/35</guid> <link>
http://antirez.com/news/35
</link>
 <description><![CDATA[You can follow the commits in the next days in the "psync" branch at github:
<br>
<br>https://github.com/antirez/redis/commits/psync
<a href="http://antirez.com/news/35">Comments</a>]]></description> <comments>http://antirez.com/news/35</comments></item>
<item><title>
On the importance of testing your failover solutions.
</title>
 <guid>http://antirez.com/news/34</guid> <link>
http://antirez.com/news/34
</link>
 <description><![CDATA[From an HN comment[1]
<br>
<br>"(Geek note: In the late nineties I worked briefly with a D&D fanatic ops team lead. He threw a D100 when he came in every morning. Anything >90 he picked a random machine to failover 'politely'. If he threw a 100 he went to the machine room and switched something off or unplugged something. A human chaos monkey)."
<br>
<br>[1] http://news.ycombinator.com/item?id=4736220
<a href="http://antirez.com/news/34">Comments</a>]]></description> <comments>http://antirez.com/news/34</comments></item>
<item><title>
Client side highly available Redis Cluster, Dynamo-style.
</title>
 <guid>http://antirez.com/news/33</guid> <link>
http://antirez.com/news/33
</link>
 <description><![CDATA[I'm pretty surprised no one tried to write a wrapper for redis-rb or other clients implementing a Dynamo-style system on top of Redis primitives.
<br>
<br>Basically something like that:
<br>
<br>1) You have a list of N Redis nodes.
<br>2) On write, use consistent hashing and write the same thing to M nodes (M configurable).
<br>3) On reads, read from M nodes and pick the most common reply to return to the client. For all the non-matching replies, use DUMP / RESTORE in Redis 2.6 to update the value of nodes that are in the minority.
<br>4) To avoid problems with ordering and complex values, optionally implement some way to lock a key when it's the target of a non plain SET/GET/DEL ... operation. This does not need to be race conditions free, it is just a good idea to avoid to end with keys in desync.
<br>
<br>OK the fourth point needs some explanation.
<br>
<br>Redis is a bit harder to distribute in this way compared to other plain key-value systems because there are operations that modify the value instead of completely rewriting it. For instance LPUSH is such a command, while SET instead rewrites the value at every time.
<br>
<br>When a command completely rebuilds a value, out of order operations are not an huge issue. Because of latency you can still have a scenario like this:
<br>
<br>CLIENT A> "SET key1 value1" in the first node.
<br>CLIENT B> "SET key1 value2" in the first node.
<br>CLIENT B> "SET key1 value2" in the second node.
<br>CLIENT A> "SET key1 value1" in the second node.
<br>
<br>So you end with the same key with two different values (you can use vector clocks, or ask the application about what is the correct value).
<br>
<br>However to restore a problem like this involves a fast write.
<br>
<br>Instead if the same happens during an LPUSH against lists with  a lot of values, a simple last value desync may force the update of the whole list that could be slower (even if DUMP / RESTORE are pretty impressive performance wise IMHO)
<br>
<br>So you could use the first node in the hash ring and the Redis primitives to perform a simple locking operation in order to make sure that operations such LPUSH are serialized across nodes.
<br>
<br>But to cut a long story short, this would be an interesting weekend project to do possibly with useful consequences, as Redis 2.6 now allows you to use DUMP/RESTORE to synchronize a value much faster and atomically.
<a href="http://antirez.com/news/33">Comments</a>]]></description> <comments>http://antirez.com/news/33</comments></item>
<item><title>
Why it is Awesome to be a Girl in Tech
</title>
 <guid>http://antirez.com/news/32</guid> <link>
http://antirez.com/news/32
</link>
 <description><![CDATA[I'm proud to be mentioned in this well-thought and non-bigot post: http://www.nerdess.net/waffling/why-it-awesome-be-girl-tech/
<br>
<br>Also in perfect accordance with the hacking culture, the post is sort of an HOWTO for girls that want to be involved in IT.
<a href="http://antirez.com/news/32">Comments</a>]]></description> <comments>http://antirez.com/news/32</comments></item>
<item><title>
Designing Redis replication partial resync
</title>
 <guid>http://antirez.com/news/31</guid> <link>
http://antirez.com/news/31
</link>
 <description><![CDATA[In this busy days I had the idea to focus on a single, non-huge, self contained project for some time, that could allow me to work focused as much as hours as possible, and could provide a significant advantage to the Redis community.
<br>
<br>It turns out, the best bet was partial replication resync. An always wanted feature that consists in the ability to a slave to resynchronize to a master without the need of a full resync (and RDB dump creation on the master side) if the replication link was interrupted for a short time, because of a timeout, or a network issue, or similar transient issue.
<br>
<br>The design is different compared to the one in the Github feature request that I filed myself some time ago, now we have the REPLCONF command that is just perfect to exchange replication information between master and slave.
<br>
<br>Btw these are the main ideas of the design I'm refining and implementing.
<br>
<br>1) The master has a global (for all the slaves) backlog of user configurable size. For instance you can allocate 100 MB of memory for this. As slaves are served in the replicationFeedSlaves() function this backlog is updated to contain the latest 100 MB of data (or whatever is the configured size of the backlog).
<br>2) There is also a global counter, that simply starts at zero when a Redis instance is started, and is updated inside replicationFeedSlaves(), that is a global replication offset. This offset can identify a particular part of the outgoing stream from the master to the slaves at any given time.
<br>3) If we got no slaves at all for too much time, the backlog buffer is destroyed and not updated at all, so we don't pay any performance/memory penalty if there are no slaves. Of course the buffer also starts unused when a new instance is started and is initialized only when we get the first slave.
<br>
<br>Ok now, when a slave connects to a master, it uses the command:
<br>
<br>    REPLCONF get-stream-info
<br>
<br>It gets two informations this way:
<br>
<br>1) The master run id.
<br>2) The master replication offset of the first byte this slave is going to receive in the replication stream.
<br>
<br>The slave will make sure to update this offset as it receives data, so every slave knows the offset of the data it is consuming, from the point of view of the master global offset.
<br>
<br>The master backlog is implemented using a circular buffer, so no memory move or reallocation operations are needed, it's just a copy of bytes.
<br>
<br>Ok, this is the setup.
<br>
<br>Now what happens after a short disconnection?
<br>
<br>1) The slave gets disconnected, it needs to reconnect to the master, but the client structure of the latest master connection is not discarded when a disconnection happens. It is saved to see if it's possible to reuse it later.
<br>2) The slave reconnects and uses REPLCONF get-stream-info to get the master id and replication offset.
<br>3) If the master run id matches, we can try a partial resync.
<br>4) We use REPLCONF set-partial-resync-offset <offset>, Where <offset> is the offset of the latest byte we received with the previous replication link, plus one.
<br>5) Now the master can reply with an error if there is not enough backlog to to start the replication with the specified offset, or can reply +OK if it's possible.
<br>6) If it is OK a partial resync is initiated and the master will simply provide the slave with all the backlog the slave asked for, plus all the new updates.
<br>7) The slave on the other side will reuse the saved master client structure, and will simply update the socket. The replication state is also marked as CONNECTED.
<br>
<br>Everything fine!
<br>
<br>Ok it's not *that* trivial but you got the idea, we have a backlog, and every slave has information about what it's consuming.
<br>
<br>This is going to enter Redis 2.8 of course.
<a href="http://antirez.com/news/31">Comments</a>]]></description> <comments>http://antirez.com/news/31</comments></item>
<item><title>
Why Github pull requests lack support for labels?
</title>
 <guid>http://antirez.com/news/30</guid> <link>
http://antirez.com/news/30
</link>
 <description><![CDATA[I love Github issues, it is one of the awesome things at Github IMHO: as simple as possible but actually under the hood pretty full featured.
<br>
<br>However one of the things I love more is labels. It is a truly powerful thing to organize issues in a project-specific way. Unfortunately if an issue is a pull request, no labels can be attached. I wonder why.
<br>
<br>Also I would love the ability to merge against multiple branches instead of the taget one, directly from the web UI.
<a href="http://antirez.com/news/30">Comments</a>]]></description> <comments>http://antirez.com/news/30</comments></item>
<item><title>
If you trust simplicity, this could be a good argument
</title>
 <guid>http://antirez.com/news/29</guid> <link>
http://antirez.com/news/29
</link>
 <description><![CDATA[I assume you already read the AWS report[1] about recent troubles. I think it is a very good argument you could use at work against design complexity and in favor of designing stuff that are at a complexity level where analysis of failure modes and prevention is actually possible.
<br>
<br>[1] https://aws.amazon.com/message/680342/
<a href="http://antirez.com/news/29">Comments</a>]]></description> <comments>http://antirez.com/news/29</comments></item>
<item><title>
On complexity and failure
</title>
 <guid>http://antirez.com/news/28</guid> <link>
http://antirez.com/news/28
</link>
 <description><![CDATA[From a comment on Hacker News:
<br>
<br>(link: http://news.ycombinator.com/item?id=4705387)
<br>
<br>--- quoted comment ---
<br>Full disclosure: I work for an AWS competitor.
<br>While none of the specific AWS systemic failures may themselves be foreseeable, it is not true that issues of this nature cannot be anticipated: the architecture of their system (and in particular, their insistence on network storage for local data) allows for cascading failure modes in which single failures blossom to systemic ones. AWS is not the only entity to have made this mistake with respect to network storage in the cloud; I, too, was fooled.[1]
<br>We have learned this lesson the hard way, many times over: local storage should be local, even in a distributed system. So while we cannot predict the specifics of the next EBS failure, we can say with absolute certainty that there will be a next failure -- and that it will be one in which the magnitude of the system failure is far greater than the initial failing component or subsystem. With respect to network storage in the cloud, the only way to win is not to play.
<br>
<br>[1] http://joyent.com/blog/network-storage-in-the-cloud-delicious-but-deadly
<a href="http://antirez.com/news/28">Comments</a>]]></description> <comments>http://antirez.com/news/28</comments></item>
<item><title>
Redis 2.6.1 is out
</title>
 <guid>http://antirez.com/news/27</guid> <link>
http://antirez.com/news/27
</link>
 <description><![CDATA[Achievement unlocked: releasing a Redis version the same day your daughter was born ;-)
<br>
<br>But that was a bad issue as there was a bug preventing compilation on pretty old Linux systems that are still pretty widespread (RHLE5 & similar).
<br>
<br>Redis 2.6.1 fixes just that issue and is available as usually at http://redis.io as a tar.gz or at github/antirez/redis as a "2.6.1" tag.
<a href="http://antirez.com/news/27">Comments</a>]]></description> <comments>http://antirez.com/news/27</comments></item>
<item><title>
Redis Bit Operations Use Case at CopperEgg
</title>
 <guid>http://antirez.com/news/26</guid> <link>
http://antirez.com/news/26
</link>
 <description><![CDATA[I really trust both in the usefulness of Redis bit operations and the fact that our community in the future should have documentation about Redis Patterns. So an article from CopperEgg where a bit operations pattern is described is good for sure :)
<br>
<br>http://copperegg.com/redis-bit-operations-use-case-at-copperegg/
<a href="http://antirez.com/news/26">Comments</a>]]></description> <comments>http://antirez.com/news/26</comments></item>
<item><title>
Greta was born a few hours ago
</title>
 <guid>http://antirez.com/news/25</guid> <link>
http://antirez.com/news/25
</link>
 <description><![CDATA[25 October 2012 01:06, she is 3350 grams of a funny little thing :-)
<a href="http://antirez.com/news/25">Comments</a>]]></description> <comments>http://antirez.com/news/25</comments></item>
<item><title>
Back to technology
</title>
 <guid>http://antirez.com/news/24</guid> <link>
http://antirez.com/news/24
</link>
 <description><![CDATA[It's a more quite time now. Redis 2.6 released, the sexism issue almost forgotten. Time to relax, be wise, and focus on work. Right, but, that's not me. I've a few more things to say about what happened, and to reply to the many people that asked me why I felt "obligated" to stop using my Twitter account as before, with a mix of work, thoughts on technology, and personal stuff.
<br>
<br>I can change idea easily if it is the case, but this time it was not the case. As much as people that criticised me for my blog post may think that I've a problem, I also think they have huge limits. Oh well, different opinions, I don't like you, you don't like me, I don't freaking care after all. I don't think on the same line as most people alive if that's the matter.
<br>
<br>So, is a bad reaction about a blog post, that was about an argument I usually don't write about, enough to change my social medias usage?
<br>
<br>Well, it is not. What shocked me was the *source* of many of the extremely poor replies. In the next hours I started to think more and more about the problem. Wait, I said to myself, that's exactly what happened with tech conversations on Twitter in the previous months, multiple times: sarcasm, insults, poor arguments.
<br>
<br>Or even more subtle than that: a few months ago there was an episode about somebody in a company competing with Redis making jokes about Redis durability. Again, an odd source for such a joke, but I did not replied at all, after all it is a joke. You don't understand jokes otherwise, never mind if this is actually a way to get zillion of retweets and provide a bad, untrue message about a competing product.
<br>
<br>But well, that's the issue: conversations on Twitter are not arguments, they are mostly exchanges of opinions, or jokes.
<br>
<br>Now let's travel in space and time. Go 20 years ago in the past, and land in a great place that was called Usenet. The first day I joined internet in 1995 I remained so deeply impressed by Usenet I could not sleep: there was the knowledge there. There was the hobbyist and the emeritus professor talking together about the same topic. Every possible topic was covered, there were years of archives. What an incredible cave of gold…
<br>
<br>Now think at the archive of Twitter messages in 20 years from now. Not a good feeling, eh?
<br>
<br>However the counter argument could be that Twitter does not need to match Usenet to be worthwhile after all, it can be a good media even if archives will go in obsolescence after a week or a day. That's fine but let's analyse the problem a bit more closely.
<br>
<br>On Twitter, even if conversations are shorter, limited to 140 chars, they are still conversations among the same individuals, conceptually, that 20 years ago were writing messages on Usenet and into technical mailing lists such as BUGTRAQ. What's the difference?
<br>
<br>The huge difference is that Twitter technical messages are mostly made of personal opinions. Usenet was all about information and arguments. That's the real problem.
<br>
<br>If you say on Twitter "Perl looks just like line noise" you likely get a zillion rewteets. If you  wrote this on Usenet you would, most of the times, be just moderated or ignored. Or you could argument it, and get arguments as reply: "You may not like Perl syntax, but Perl is an advanced very high level programming language that probably supports all the major features that you love in your favourited very high level programming language."
<br>
<br>Possibly 30 messages later someone was accusing somebody else of being Hitler, but well, 30 messages after. 29 messages were mostly arguments, and information.
<br>
<br>Twitter is a good broadcasting media if you treat it mostly one-way, and was able to take the Redis community informed about developments in the course of three years, but good ideas start in the Redis mailing list, not on Twitter. This is how the first scripting implementation was triggered for instance:
<br>
<br>http://files.catwell.info/presentations/2011-osdcfr-redis-iidx/img03.png
<br>
<br>But the worst thing about Twitter used for technical "conversations" was that I was part of the problem as well. Looking at my tweets history, I mostly wrote about Redis, but many times I expressed my opinions without arguments. Sometimes my freaking stupid, misinformed, misleading fucking opinion.
<br>
<br>Now I'm seeing a lot of shifts lately in our industry. A few things sincerely are odd, it seems like the environment is deteriorating, or maybe it's just me, I'm getting old perhaps. But well, I can write that code is like a poem in the Redis Manifesto, but after all, we are supposed to be engineers talking about technology.
<br>
<br>Technology is about code, informations, and arguments. That's what I'll be focused on in the future.
<br>
<br>The @redisfeed account on Twitter is a good way to exploit what Twitter is good for: broadcasting. Think at it like RSS for humans. My @antirez account is not going to be closed, I want to use it to personal uses, and also because Twitter is a good way to reference a single person on internet. For instance, if I say @dhh, this is probably the best short identifier for David. But I stopped to play to the game of useless opinions, finally.
<br>
<br>I don't hate the players, I hate the game.
<a href="http://antirez.com/news/24">Comments</a>]]></description> <comments>http://antirez.com/news/24</comments></item>
<item><title>
HN comment about Linus
</title>
 <guid>http://antirez.com/news/23</guid> <link>
http://antirez.com/news/23
</link>
 <description><![CDATA[h2s writes about Linus:
<br>
<br>"I love this guy's balanced approach to steering the kernel. Somebody asked whether a bunch of security-related patches would be getting into Linus' tree, and his response was great.
<br>Basically, he spent a few minutes explaining how security people tend to think that problems are either security problems or not worth thinking about. They see things in black and white and only care about increasing security at any cost. He said performance fanatics can be the same in their approach to improving performance, and he tries not to treat security or performance patches as being too massively different from any other types of patches such as ones for correctness.
<br>Also, a big fuck-you to this trend for shoehorning mindless Reddit memes into everything. Who the fuck wastes a question to Linus Torvalds on "Do you like cats?".
<br>
<br>Very good points, IMHO.
<br>
<br>Yesterday during the Redis conf there was as usually a complain about Redis not binding only to 127.0.0.1 by default. Guess what? Redis is a networked server and in many setups clients are not in the same box. In most setups the server is not exposed to internet at all. So why on the earth to save people that put Redis servers exposed on the internet I should ruin the experience of all the other guys?
<br>
<br>Btw the original link of the comment is this: http://news.ycombinator.com/item?id=4687624
<a href="http://antirez.com/news/23">Comments</a>]]></description> <comments>http://antirez.com/news/23</comments></item>
<item><title>
About the recent EC2 issues.
</title>
 <guid>http://antirez.com/news/22</guid> <link>
http://antirez.com/news/22
</link>
 <description><![CDATA[I don't like people that are using recent EC2 problems to get an easy advantage / marketing. Stuff go down and cloud services are not magical, it is better to adjust the expectations.
<br>
<br>But there are other reasons why people IMHO should consider going bare metal.
<br>
<br>* EC2 (and similar services) are extremely costly. With 100 euros per month you can rent a beast of a dedicated server with 64 GB of RAM and fast RAID disks.
<br>
<br>* As you can see you are not down-time safe, and to be down together with a zillion of other sites may be a good excuse with your boss maybe, but does not change your uptime percentage, so it's a poor shield.
<br>
<br>* A few problems you prevent in the sysop side, are translated in issues with the software you run (especially DBs) because of the poor disk performance and in general poor predictability of behaviour.
<br>
<br>* It's not bad to understand operations since the start, it is not a wasted effort at all, it is an effort, but it is also a good gym to have a deeper understanding of your production stack.
<br>
<br>That said, with the money you save, you are likely to be able to duplicate your entire stack in two bare metal providers easily. This means that you have a disaster-recovery ready architecture that you can switch using DNS if things go as bad as yesterday with EC2.
<a href="http://antirez.com/news/22">Comments</a>]]></description> <comments>http://antirez.com/news/22</comments></item>
<item><title>
Redis 2.6 is out!
</title>
 <guid>http://antirez.com/news/21</guid> <link>
http://antirez.com/news/21
</link>
 <description><![CDATA[Redis 2.6 is finally out and I think that now that we reached this point we'll start to see how the advantages of a release that was already exploited in production by few, will turn into a big advantage for the rest of the community.
<br>
<br>Scripting, bitops, and all the big features are good additions but my feeling is that Redis 2.6 is especially significative as a step forward in the maturity of the Redis implementation. This does not mean that's bug free, it's new code and we'll likely discover bugs especially in the early days as with every new release that starts to be adopted more and more.
<br>
<br>What I'm talking about when I say Redis 2.6 is more mature is that it is in general a more "safe" system to run in production: latency spikes due to mass-expire events or slow disks are handled much better, the system is more observable with the software watchdog if something goes wrong, MONITOR itself shows commands before their execution and as they were sent by the client, slaves are read-only by default, if RDB persistence fails Redis stops accepting new writes by default, and I can continue with a list of features and fixes that are the result of experience with bad behaviour in production of the 2.4 release.
<br>
<br>Now the followings are the tasks at hand:
<br>
<br>* Redis Sentinel
<br>* Redis Cluster
<br>* Redis 2.8
<br>* All the small advantages that will make 2.8 a safer release compared to 2.6
<br>
<br>I don't want to comment too much but let me say this: in the next months you'll see Redis Cluster to become reality.
<br>
<br>For now we can enjoy Redis 2.6 and see how our wondeful community will take advantage of it :-)
<a href="http://antirez.com/news/21">Comments</a>]]></description> <comments>http://antirez.com/news/21</comments></item>
<item><title>
Redis Conf live streaming
</title>
 <guid>http://antirez.com/news/20</guid> <link>
http://antirez.com/news/20
</link>
 <description><![CDATA[And there is Pieter Noordhuis on stage right now!
<br>
<br>http://redisconf.com/video/
<a href="http://antirez.com/news/20">Comments</a>]]></description> <comments>http://antirez.com/news/20</comments></item>
<item><title>
Github: where you see how cool humanity can be.
</title>
 <guid>http://antirez.com/news/19</guid> <link>
http://antirez.com/news/19
</link>
 <description><![CDATA[You are there in the morning with your coffee in front of you, scanning pull requests and bug reports, then you see a conversation around a commit among a few guys that modified the code to make it better, than there is another one suggesting to improve it in another way. You click in the account names and you see this people with their transparent eyes and your trust in humanity is restored.
<a href="http://antirez.com/news/19">Comments</a>]]></description> <comments>http://antirez.com/news/19</comments></item>
<item><title>
Mission accomplished: videos talks for Redis Conf...
</title>
 <guid>http://antirez.com/news/18</guid> <link>
http://antirez.com/news/18
</link>
 <description><![CDATA[Takeaways:
<br>
<br>1) Making videos is in some way harder than doing a talk live.
<br>2) Screen Flow is awesome, but could be improved with more video editing capabilities, apparently you can't "cut" the video.
<br>3) The problem is to upload files when they are big and you have normal ADSL connection :-)
<br>
<br>But it feels good to be able to send the video talks a few days in advance, so the conf organizers will be able to perform editing, filter audio if needed, whatever.
<a href="http://antirez.com/news/18">Comments</a>]]></description> <comments>http://antirez.com/news/18</comments></item>
<item><title>
Today is the day...
</title>
 <guid>http://antirez.com/news/17</guid> <link>
http://antirez.com/news/17
</link>
 <description><![CDATA[of the final recording of the videos I'll send to the Redis Conf. That was hard! The timing of the conf was not excellent for my attending, but producing the video was also less trivial than I thought, but finally I've the slides, an idea about what to say, and the ScreenFlow skills ;) Maybe after this experience I'll produce some video tutorial of Redis new features as I introduce it, in order to accelerate the adoption of new things in our community.
<br>
<br>Now back to work...
<a href="http://antirez.com/news/17">Comments</a>]]></description> <comments>http://antirez.com/news/17</comments></item>
<item><title>
Estimating Redis memory usage
</title>
 <guid>http://antirez.com/news/16</guid> <link>
http://antirez.com/news/16
</link>
 <description><![CDATA[Good article by Josiah Carlson on Redis memory usage estimation, including copy-on-write worst case during BGSAVE:
<br>
<br>https://groups.google.com/d/msg/redis-db/02oq_DNZA3s/l_uEwDT3d4sJ
<a href="http://antirez.com/news/16">Comments</a>]]></description> <comments>http://antirez.com/news/16</comments></item>
<item><title>
Me on twitter in the latest days
</title>
 <guid>http://sphotos-f.ak.fbcdn.net/hphotos-ak-prn1/68001_426768714052028_655703565_n.jpg</guid> <link>
http://sphotos-f.ak.fbcdn.net/hphotos-ak-prn1/68001_426768714052028_655703565_n.jpg
</link>
 <description><![CDATA[<a href="http://antirez.com/news/15">Comments</a>]]></description> <comments>http://antirez.com/news/15</comments></item>
<item><title>
Almost 1000 followers for @redisfeed in a couple of days
</title>
 <guid>http://antirez.com/news/14</guid> <link>
http://antirez.com/news/14
</link>
 <description><![CDATA[On twitter I read a few concerns about inability to read what I think about tech non-redis topics. First of all, thanks to everybody interested in my thoughts :) Second, this blog is exactly the place where I'll post everything like that.
<br>
<br>So:
<br>
<br>@redisfeed -> Redis news, mostly low traffic, high signal.
<br>@antirez -> Will be converted into my personal account, mostly italian language, non work related.
<br>@zeritna -> Will be simply dismissed.
<br>This blog -> Everything about day by day Redis development, personal opinion about sexism, sky driving, shit eating and japanese food.
<br>@antirezdotcom -> A Twitter account that publish everything posted at antirez.com (rss feed to twitter service)
<br>
<br>This is a better approach to take people informed. For one the Redis twitter account will be also communicated to Pieter Noordhuis, that is one thing that was not possible before as it was my personal account.
<br>
<br>Second, I had to avoid telling too many things about me in my @antirez account. Now I'll convert it into a personal account that you may want to follow only if you are interested in me as a person (family, friends, ..., mostly).
<br>
<br>Third, this blog is much better than tweets to express tech opinions. The reality is, 140 chars are too little for a lot of things, at least in my opinion, and a full blog post is too time consuming.
<br>
<br>Let's see how it goes :-) I'm sure that in the long run everything will be better than before about Redis and my private "social existence".
<a href="http://antirez.com/news/14">Comments</a>]]></description> <comments>http://antirez.com/news/14</comments></item>
<item><title>
Planet Hunters discovered a confirmed planet, using Redis!
</title>
 <guid>https://twitter.com/planethunters/status/258317519390138368</guid> <link>
https://twitter.com/planethunters/status/258317519390138368
</link>
 <description><![CDATA[<a href="http://antirez.com/news/13">Comments</a>]]></description> <comments>http://antirez.com/news/13</comments></item>
<item><title>
High Scalability: Playtomic's Move From .NET To Node, using Redis. Case study.
</title>
 <guid>http://highscalability.com/blog/2012/10/15/simpler-cheaper-faster-playtomics-move-from-net-to-node-and.html</guid> <link>
http://highscalability.com/blog/2012/10/15/simpler-cheaper-faster-playtomics-move-from-net-to-node-and.html
</link>
 <description><![CDATA[<a href="http://antirez.com/news/12">Comments</a>]]></description> <comments>http://antirez.com/news/12</comments></item>
<item><title>
Next days...
</title>
 <guid>http://antirez.com/news/11</guid> <link>
http://antirez.com/news/11
</link>
 <description><![CDATA[I'm going to be a bit away from Redis code and this blog as I need to freaking focus on finishing the video talks for the Redis Conference that is very near at this point...
<br>
<br>I'm also tuning and testing the final bits into 2.6 to make sure to release it ASAP :-)
<br>
<br>See you in a couple of days.
<br>
<br>p.s. also my wife feels contraction since a few hours, so maybe Greta is going to birth in a few... (!!!)
<a href="http://antirez.com/news/11">Comments</a>]]></description> <comments>http://antirez.com/news/11</comments></item>
<item><title>
New 2.6 MONITOR behaviour with transactions
</title>
 <guid>http://antirez.com/news/10</guid> <link>
http://antirez.com/news/10
</link>
 <description><![CDATA[The commit message says it all:
<br>
<br>    Fix MULTI / EXEC rendering in MONITOR output.
<br>    
<br>    Before of this commit it used to be like this:
<br>    
<br>    MULTI
<br>    EXEC
<br>    ... actual commands of the transaction ...
<br>    
<br>    Because after all that is the natural order of things. Transaction
<br>    commands are queued and executed *only after* EXEC is called.
<br>    
<br>    However this makes debugging with MONITOR a mess, so the code was
<br>    modified to provide a coherent output.
<br>    
<br>    What happens is that MULTI is rendered in the MONITOR output as far as
<br>    possible, instead EXEC is propagated only after the transaction is
<br>    executed, or even in the case it fails because of WATCH, so in this case
<br>    you'll simply see:
<br>    
<br>    MULTI
<br>    EXEC
<br>    
<br>    An empty transaction.
<a href="http://antirez.com/news/10">Comments</a>]]></description> <comments>http://antirez.com/news/10</comments></item>
<item><title>
Random victim on Twitter
</title>
 <guid>http://antirez.com/news/9</guid> <link>
http://antirez.com/news/9
</link>
 <description><![CDATA[These are the real victims of flamewars:
<br>
<br>"After the recent affair with @antirez's blogpost, I'm seriously considering forgoing further usage of #Redis in all my projects." (@nathell)
<br>
<br>Obvious considerations:
<br>
<br>1) Somebody will be happy about this as there was definitely a force trying to resonate as much as possible what was happening in the worst way.
<br>2) The guy should read my blog post seriously.
<br>
<br>But this is how it goes, you write a blog post that is a point of view about how to handle sexism (focusing on its effects, not the cause that anyway at the most subtle level, for instance a wrong promotion, can simply be denied by the author of the sexism), and the net result thanks to a number of people in part evil, in part simply bigot as hell, is that I'm sexist.
<br>
<br>Let's explain how it works using a parallel example:
<br>
<br>So now the Bay Area is in a place of the world where Death Penalty is currently active, in 2012, and a lot of software is produced there. I guess that likely there are people that are for death penalty right now, writing free software. Death penalty for me is the most inacceptable shit, but I don't have a taboo talking about it with somebody that instead favours it.
<br>
<br>But... as much as concerns you that are not using Redis because of my article that suggests moving the focus on the effects of sexism instead of the cause itself (that risks of creating a difference that can be discriminatory itself), I suggest you emailing all the people that wrote the software you are using to make sure they are against death penalty. At least this would be logic (and incredibly silly).
<br>
<br>Manipulating masses for fun & profit is easy apparently.
<a href="http://antirez.com/news/9">Comments</a>]]></description> <comments>http://antirez.com/news/9</comments></item>
<item><title>
Working on issue #713
</title>
 <guid>https://github.com/antirez/redis/issues/713</guid> <link>
https://github.com/antirez/redis/issues/713
</link>
 <description><![CDATA[<a href="http://antirez.com/news/8">Comments</a>]]></description> <comments>http://antirez.com/news/8</comments></item>
<item><title>
Ask your questions on Stack Overflow if you wish
</title>
 <guid>http://antirez.com/news/7</guid> <link>
http://antirez.com/news/7
</link>
 <description><![CDATA[Hey, since I'm no longer active on Twitter, that was a channel where from time to time I replied to requests from users, I'm now making sure I reply to a few questions on Stack Overflow every day, lurking on the "Redis" tag:
<br>
<br>http://stackoverflow.com/questions/tagged/redis
<br>
<br>I can't ensure you that I'll reply to your question, but I can ensure you I'll be there often to reply to a few questions a day, and together we can create a Redis ecosystem in one of the best places on the internet where to give or receive help, that is Stack Overflow.
<br>
<br>p.s. Stack Overflow is also implemented using Redis for many interesting things!
<a href="http://antirez.com/news/7">Comments</a>]]></description> <comments>http://antirez.com/news/7</comments></item>
<item><title>
Snappy Dashboards with Redis
</title>
 <guid>http://blog.togo.io/how-to/snappy-dashboards-with-redis/</guid> <link>
http://blog.togo.io/how-to/snappy-dashboards-with-redis/
</link>
 <description><![CDATA[<a href="http://antirez.com/news/6">Comments</a>]]></description> <comments>http://antirez.com/news/6</comments></item>
<item><title>
New site look improved a bit.
</title>
 <guid>http://antirez.com/news/5</guid> <link>
http://antirez.com/news/5
</link>
 <description><![CDATA[Now it's like a mix between a twitter timeline and a blog. I fixed the RSS feed, but still could be generated better than that. Well the point is that as far as I've the min needed to improve it in the future will be easy and fun.
<br>
<br>p.s. yes the layout and fonts are a bit of a mess, but it's not going to be too hard to fix it. For now I focused more on what it should display.
<br>
<br>Warning: I forgot to increment the js version counter so it takes a few hard reloads of the page to get the right CSS.
<a href="http://antirez.com/news/5">Comments</a>]]></description> <comments>http://antirez.com/news/5</comments></item>
<item><title>
Set Sketch implementation for Redis
</title>
 <guid>https://groups.google.com/forum/#!msg/redis-db/8Zt_6hJo09k/FJCnrr9OSikJ</guid> <link>
https://groups.google.com/forum/#!msg/redis-db/8Zt_6hJo09k/FJCnrr9OSikJ
</link>
 <description><![CDATA[<a href="http://antirez.com/news/4">Comments</a>]]></description> <comments>http://antirez.com/news/4</comments></item>
<item><title>
New Twitter account for Redis news
</title>
 <guid>http://antirez.com/news/3</guid> <link>
http://antirez.com/news/3
</link>
 <description><![CDATA[The new Twitter account @redisfeed will be used to provide informations about Redis new releases, critical bugs, and everything else is important for people that are using or plan to use Redis. Please follow us! http://twitter.com/redisfeed
<a href="http://antirez.com/news/3">Comments</a>]]></description> <comments>http://antirez.com/news/3</comments></item>
<item><title>
Exotic Data Structures
</title>
 <guid>http://concatenative.org/wiki/view/Exotic%20Data%20Structures</guid> <link>
http://concatenative.org/wiki/view/Exotic%20Data%20Structures
</link>
 <description><![CDATA[<a href="http://antirez.com/news/2">Comments</a>]]></description> <comments>http://antirez.com/news/2</comments></item>
<item><title>
Welcome to the new site!
</title>
 <guid>http://antirez.com/news/1</guid> <link>
http://antirez.com/news/1
</link>
 <description><![CDATA[Hi visitor! This blog was conceived for low traffic blogging. Now that I plan to don't use my Twitter accounts the old blog engine was not good enough.
<br>
<br>The simplest thing to do was to take Lamer News and create a quick modified version that could be used as a blog engine... that's the result, for now. I hope to evolve it, but the point here is, I can write both long posts and very small ones that can be read directly from the home page.
<br>
<br>It is also possible to write just short titles linking to external site, that is a feature I plan to use as well to link to interesting Google Groups comments and stuff like that.
<br>
<br>Finally I feel like I've a voice again!
<br>
<br>Edit: note that you can find all the old posts in the "old site" link (check the footer).
<a href="http://antirez.com/news/1">Comments</a>]]></description> <comments>http://antirez.com/news/1</comments></item>
</channel></rss>