Yesterday I lost all my blog data in a rather funny way. When I installed this new blog engine, that is basically a Lamer News slightly modified to serve as a blog, I spinned a Redis instance manually with persistence *disabled* just to see if it was working and test it a bit. I just started a screen instance, and run something like ./redis-server --port 10000. Since this is equivalent to an empty config file with just "port 10000" inside I was running no disk backed at all. Since Redis very rarely crashes, guess what, after more than one year it was still running inside the screen session, and I totally forgot that it was running like that, happily writing controversial posts in my blog. Yesterday my server was under attack. This caused an higher then normal load, and Linode rebooted the instance. As a result my blog was gone.
News posted by antirez
Today Joyent wrote a blog post in the company blog about an issue that started with this pull request in the libuv project: https://github.com/joyent/libuv/pull/1015#issuecomment-29538615 Basically the developer Ben Noordhuis rejected a pull request involving a change in the documentation to use gender-neutral form instead of “him”. Joyent replied with this incredible post: http://www.joyent.com/blog/the-power-of-a-pronoun. In the blog post you can read: “But while Isaac is a Joyent employee, Ben is not—and if he had been, he wouldn't be as of this morning: to reject a pull request that eliminates a gendered pronoun on the principle that pronouns should in fact be gendered would constitute a fireable offense for me and for Joyent.”
Redis API for data access is usually limited, but very direct and straightforward. It is limited because it only allows to access data in a natural way, that is, in a data structure obvious way. Sorted sets are easy to access by score ranges, while hashes by field name, and so forth. This API “way” has profound effects on what Redis is and how users organize data into it, because an API that is data-obvious means fast operations, less code and less bugs in the implementation, but especially forcing the application layer to make meaningful choices: the database as a system in which you are responsible of organizing data in a way that makes sense in your application, versus a database as a magical object where you put data inside, and then it will be able to fetch and organize data for you in any format.
This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm rewriting it from scratch, so this blog post serves as a first way to share the algorithm with the community. Let's start with the problem to solve. Redis Cluster uses a master - slave design in order to recover from nodes failures. The key space is partitioned across the different masters in the cluster, using a concept that we call "hash slots". Basically every key is hashed into a number between 0 and 16383. If a given key hashes to 15, it means it is in the hash slot number 15. These 16k hash slots are split among the different masters.
Paul Graham managed to put a very important question, the one of the English language as a requirement for IT workers, in the attention zone of news sites and software developers . It was a controversial matter as he referred to "foreign accents" and the internet is full of people that are just waiting to overreact, but this is the least interesting part of the question, so I'll skip that part. The important part is, no one talks about the "English problem" usually, and I always felt a bit alone in that side, like if it was a problem only affecting me, so in this blog post I want to share my experience about English.
Twilio just released a post mortem about an incident that caused issues with the billing system: http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html The problem was about a Redis server, since Twilio is using Redis to store the in-flight account balances, in a master-slaves setup, with multiple slaves in different data centers for obvious availability and data safety concerns. This is a short analysis of the incident, what Twilio can do and what Redis can do to avoid this kind of issues.
Yesterday night I returned back home after a short trip in San Francisco. Before memory fades out and while my feelings are crisp enough, I'm writing a short report of the trip. The point of view is that of a south European programmer exposed for a few days to what is probably the most active information technology ecosystem and economy of the world. Reaching San Francisco === If you want to reach San Francisco from Sicily, there are no direct flights helping you. My flight was a Lufthansa flight from Catania to Munich, and finally from Munich to San Francisco. This is a total of 15 hours flight, plus the stop in Munich waiting for the second flight.
Redis uses streamed asynchronous replication, that's one of the simplest forms of replication you can imagine: a continuos stream of writes is sent to the slaves, without waiting for the slaves to process the writes in any way before replying to the client. I always gave that almost for granted, as I always assumed Redis was not a good match for synchronous replication, that has an higher latency. However recently I tried to fix another issue with Redis replication, that is, timeouts are all up to the slave.
Terah is a planet far away, where networks never split. They have a single issue with their computer networks, from time to time, single hosts break in a way or the other. Sometimes is a broken power supply, other times a crashed disk, or a software issue completely blocking the system. The inhabitants of this strange planet use two database systems. One is imported from planet Earth via the Galactic Exchange Program, and is called EDB. The other is produced by engineers from Terah, and is called TDB. The databases are functionally equivalent, but they have different semantics when a network partition happens. While the database from Earth stops accepting writes as long as it is not connected with the majority of the other database nodes, the database from Terah works as long as the majority of the clients can reach at least a database node (incidentally, the author of this story released a similar software project called Sentinel, but this is just a coincidence).
In a great series of articles Kyle Kingsbury, aka @aphyr on Twitter, attacked a number of data stores:  http://aphyr.com/tags/jepsen Postgress, Redis Sentinel, MongoDB, and Riak are audited to find what happens during network partitions and how these systems can provide the claimed guarantees. Redis is attacked here: http://aphyr.com/posts/283-call-me-maybe-redis I said that Kyle "attacked" the systems on purpose, as I see a parallel with the world of computer security here, it is really a good idea to move this paradigm to the database world, to show failure modes of systems against the claims of vendors. Similarly to what happens in the security world the vendor may take the right steps to fix the system when possible, or simply the user base will be able to recognize that under certain circumstances something bad is going to happen with your data.