Premise: a small rant about software reliability. === I'm very serious about software reliability, and this is not just a good thing. It is good in a sense, as I tend to work to ensure that the software I release is solid. At the same time I think I take this issue a bit too personally: I get upset if I receive a crash report that I can't investigate further for some reason, or that looks like almost impossible to me, or with an unhelpful stack trace. Guess what? This is a bad attitude because to deliver bugs free software is simply impossible. We are used to think in terms of labels: "stable", "production ready", "beta quality". I think that these labels are actually pretty misleading if not put in the right perspective. Software reliability is an incredibly complex mix of ingredients. 1) Precision of the specification or documentation itself. If you don't know how the software is supposed to behave, you have things that may be bugs or features. It depends on the point of view. 2) Amount of things not working accordingly to the specification, or causing a software crash. Let's call these just software "errors". 3) Percentage of errors that happen to be in the subset of the software that is actually used and stressed by most users. 4) Probability that the conditions needed for a given error to happen are met. So what happens is that you code something, and this initial version will be as good as good and meticulous are you as a programmer, or as good is your team and organization if it is a larger software project involving many developers. But this initial version contains a number of errors anyway. Then you and your users test, or simply use, this new code. All the errors that are likely to happen and to be recognized, because they live in the subset of the code that users hit the most, start to be discovered. Initially you discover a lot of issues in a short time, then every new bug takes more time to be found, probabilistically speaking, as it starts to be in the part of the code that is less stressed, or the conditions to make the error evident are unlikely to be met. Basically your code is never free from errors. "alpha", "beta", "production read", and "rock solid" are just names we assign to the probability we believe there is for a serious error to be discovered in a reasonable time frame. Redis crashes === Redis users are not likely to see Redis crashing without a good reason (Redis will abort on purpose in a few exceptional conditions). However from time to time there are bugs in Redis that will caused a server crash under certain conditions. What makes crashes different from other errors is that in complex systems crashes are the perfect kind of "likely hard to reproduce error". Bug fixing is all about creating a mental model of how the software works to understand why it is not behaving as expected. Every time you can trigger the bug again, you add information to your model: eventually you have enough informations to understand and fix the bug. Crashes on Redis are crashes happening on a long running program, that interacts with many clients that are sending command at different times, in a way that the sequence of commands and events is very hard to reproduce. Crashes stop the server, provide little clues, and are usually very hard to reproduce. This means little information, poor model of what is happening, and ultimately, very hard to fix bugs. If this is not enough, crashes are not only bad for developers, they are also very bad for users, as the software stops working after a crash, that is one of the biggest misbehaviors you can expect from a software. This is why I hate Redis (and other software) crashes. Stack traces === Are there ways to make crashes more *friendly*? If a crash is reproducible, and if the user has time to spent with you, possibly from a far away time zone, maybe you can make a crash more friendly with a debugger. If you are lucky enough that data is not important for the user, you may also get a core dump (that contains a copy of the data set in the case of Redis). However most of the crashes will simply not happen again in a practical amount of time, or the user can't help debugging, or data is too important to send you a core. So one of the first things I did to improve Redis crashes was to make it print a stack trace when it crashes. This is an example of what you get if you send DEBUG SEGFAULT to the server: EDIS BUG REPORT START: Cut & paste starting from here === [32827] 26 Nov 15:19:14.158 # Redis 2.6.4 crashed by signal: 11 [32827] 26 Nov 15:19:14.158 # Failed assertion: <no assertion failed> (<no file>:0) [32827] 26 Nov 15:19:14.158 # --- STACK TRACE 0 redis-server 0x0000000103a15208 logStackTrace + 88 1 redis-server 0x0000000103a16544 debugCommand + 68 2 libsystem_c.dylib 0x00007fff8c5698ea _sigtramp + 26 3 ??? 0x0000000000000000 0x0 + 0 4 redis-server 0x00000001039ec145 call + 165 5 redis-server 0x00000001039ec77f processCommand + 895 6 redis-server 0x00000001039f7fd0 processInputBuffer + 160 7 redis-server 0x00000001039f6dfc readQueryFromClient + 396 8 redis-server 0x00000001039e82d3 aeProcessEvents + 643 9 redis-server 0x00000001039e851b aeMain + 59 10 redis-server 0x00000001039ef33e main + 1006 11 libdyld.dylib 0x00007fff91ef97e1 start + 0 12 ??? 0x0000000000000001 0x0 + 1 ... more stuff ... [32827] 26 Nov 15:19:14.163 # --- CURRENT CLIENT INFO [32827] 26 Nov 15:19:14.163 # client: addr=127.0.0.1:56888 fd=5 age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=debug [32827] 26 Nov 15:19:14.163 # argv[0]: 'debug' [32827] 26 Nov 15:19:14.163 # argv[1]: 'segfault' [32827] 26 Nov 15:19:14.163 # --- REGISTERS [32827] 26 Nov 15:19:14.163 # RAX:0000000000000000 RBX:00007fb8e4800000 RCX:00000000000003e8 RDX:00007fff79804788 RDI:0000000000000000 RSI:0000000103a4e2bf RBP:00007fff5c2194c0 RSP:00007fff5c2193e0 R8 :0000000050b37a62 R9 :0000000000000019 R10:0000000000000001 R11:0000000002000000 R12:0000000000026c0f R13:0000000103a6bc80 R14:00007fb8e3600000 R15:00007fb8e3600178 RIP:0000000103a16544 EFL:0000000000010246 CS :000000000000002b FS:0000000000000000 GS:0000000000000000 [32827] 26 Nov 15:19:14.164 # (00007fff5c219458) -> 00007fb8e3600000 [32827] 26 Nov 15:19:14.164 # (00007fff5c219450) -> 0000000000000001 [32827] 26 Nov 15:19:14.164 # (00007fff5c219448) -> 0000000000000000 [32827] 26 Nov 15:19:14.164 # (00007fff5c219440) -> 00007fff5c219470 ... more stuff ... And a lot more, including the INFO output and the full list of connected clients and their state. Printing stack traces on crash is already a huge improvement over "segmentation fault, core dumped". A lot of times I can fix the issue just from the output I see, without asking anything to the user. Add to this that we have the arguments of the currently executed command. I'm crazy enough that Redis logs registers and part of the stack as well. If you have this trace, and a copy of the redis-server executable that caused it (and users will send it to you easily), you can inspect a lot of state just from that. EVERY MISSION CRITICAL SOFTWARE should do this. Broken memory === So far so good. That kind of reporting on crashes, plus a gentle and helpful community of users, can help you improve the quality of the software you ship, a lot. But what happens if the bug is an hard one? A memory violation in a random place like a dictionary lookup in the context of a simple GET operation? Well, you are in big troubles. You can't ignore a bug like that, but inspecting it take a lot of time. And guess what? Many times you do a lot of work for nothing, as Redis crashes only because the user was using broken memory. Think along these lines: Redis uses memory a lot, and memory errors will easily amplify when you use a lot of data structures with pointers. If there is a broken bit it will easily end crashing the server in a short time. Now let's scale this to the sum of all the memory that every single Redis instance out there is using right now. What you get is that Redis is like a parallel memory test working on many computers, waiting to spot some bit error. And as Redis gets more known and deployed? More memory tested every second. The problem is that this giant memory test will not write "Hey, you got a problem in your memory!". The output will be a new issue in the Redis GitHub issue system, and me investigating non existing issues for hours. First countermeasure === At some point, after the first endless investigations, I started to be smarter: when a bug looked suspicious I started the investigation with "Please, can you run a memory test to verify the computer memory and CPU are likely ok?". However this requires the user to reboot the machine and run memtest86. Or at least to install some user space memory testing program like memtester available in most Linux distributions. Many times the user has no physical access at all to the box, or there is no "box" at all, the user is using a virtual machine somewhere. After some time I realized that if Redis could be able to test the computer memory easily without requiring additional software, I would be able to receive more memory test reports from users. This is why recent Redis instance can perform a memory test if you write something like that: ./redis-server --test-memory 4096 Where "4096" is the number of megabytes of RAM to test. The test is not perfect but it showed to be pretty effective to detect normal memory errors, however there are more subtle errors like retention errors that happen if you set some memory cell to a given value, and wait some time, and then retry to read it, that are not detected by this test. But well, apparently most memory errors are simpler than that and can be more easily detected. The quality of the detection also depends on the amount of time the user will let the memory test to run. There are errors that just happen rarely. Anyway this test improved my experience with Redis crashes a lot. Sometimes I see a bug report, I just ask the user to test the RAM, and I get a message where the user acknowledges that there was a problem with a memory module. However users rarely run a more accurate memory test like memtest86 for days, like they should after a crash. Many times it is simply not practical at all. So even when I get a "memory is fine" report I simply accept it as a matter of facts that memory is ok and I investigate the issue closely, but, if I never saw a crash like this reported at least one more time in recent times with different hardware, and there is no way to reproduce such a crash, after some efforts I'll simply stop investigating, taking the issue open for some more time, and finally closing it if no other similar crashes are reported. Testing more memory === Even not considering that there are false negatives, with the current redis-server --test-memory approach there are two bigger problems. 1) Not every user will actually run the test after a crash. 2) Users running Redis using virtualization, possibly using a virtual machine provider like EC2, will never know if they are testing the same phyisical memory pages when they run redis-server --test-memory. The first problem could be addressed simply using a policy: a crash without a memory test is not investigated if it looks suspicious. However this is not good for the Redis project nor for the users: sometimes an user will file a report and will simply have no more time to help us. Some other time it can't stop the server. And we developers may miss the opportunity to track a bug just because of a bad attitude and the possibility of a memory error. The second problem, related to people using virtualization, is even worse. Even when users want to help you, then can't in a reliable way. So what about testing memory just when a crash happens? This would allow Redis to test the memory of every single computer where a crash has happened. The bug report will be annotated with the result of the test, providing an interesting hint about the state of the memory without further help from the user. And even better, the test will test exactly the memory as allocated at the moment of the crash! It will test exactly the physical memory pages that Redis is using, that is just perfect for environments like EC2. The problem is, how to do it? How to test on crashes === My first idea was to test memory incrementally inside the allocator. Like, from time to time, if you allocate some memory, run a fast memory test on it and log it on the Redis log and in the INFO output if a problem was detected. In theory it is nice, and I even implemented the idea. The problem is, it is completely unreliable. If the broken memory is allocated for something that is never deallocated later, it will never be tested again. Worse than that, it takes a lot of time to test the whole memory incrementally small piece after small piece, and what about testing every single location? The allocator itself uses "internal" memory that is never exposed to the user, and we are missing all these pages. Bad idea... and... the implementation I wrote was not working at all as the CPU cache made it completely useless, as testing small pieces of memory incrementally results in a constant cache hit. The second try was definitely better, and was simply to test the whole space of allocated memory, but only when a crash happens. At first this looks pretty hard: at least you need to get a lot more help from the memory allocator you are using. I don't think jemalloc has an out of the box way to report the memory regions allocated so far. Also if we are crashing, I'm not sure how reliable asking the allocator to report memory regions could be. As a result of a single bit error, it is very easy to see the error propagating at random locations. There are other problems. After a crash we want to be able to dump a core that is meaningful. If during the memory test we fill our memory with patterns, the core file will be completely useless. This means that the memory test needed to be conceived so that at the end of the test the memory was left untouched. The proc filesystem /proc/<pid>/maps === The Linux implementation of the proc filesystem makes Linux an exceptionally introspective operating system. A developer needs minimal efforts to be able to access informations that are usually non exposed to the user space. Actually the effort required is so small as to parse a text file. The "maps" file of the proc filesystem shows line after line all the memory mapped regions for a given process and their permissions (read, write, execute). The sum of all the reported regions is the whole address space that can be accessed in some way by the specified process. Some of this maps are "special" maps created by the kernel itself, like the process stack. Other maps are memory mapped files like dynamic libraries, and others are simply regions of memory allocated by malloc(), either using the sbrk() syscall or an anonymous mmap(). The following two lines are examples of maps (obtained with "cat /proc/self/maps) 7fb1b699b000-7fb1b6b4e000 r-xp 00000000 08:05 15735004 /lib/x86_64-linux-gnu/libc-2.15.so 7fb1b6f5f000-7fb1b6f62000 rw-p 00000000 00:00 0 The first part of each line is the address range of the memory mapped area, followed by permissions "rwxp" (read, write, execute, private), the second is the offset in case of a memory mapped file, then there is the device id, that is 00:00 for anonymous maps, and finally the inode and file name for memory mapped files. We are interested to check all the heap allocated memory that is readable and writable, so a simple grep will do the trick: $ cat /proc/self/maps | grep 00:00 | grep rw 01cbb000-01cdc000 rw-p 00000000 00:00 0 [heap] 7f46859f9000-7f46859fe000 rw-p 00000000 00:00 0 7f4685c05000-7f4685c08000 rw-p 00000000 00:00 0 7f4685c1e000-7f4685c20000 rw-p 00000000 00:00 0 7fffe7048000-7fffe7069000 rw-p 00000000 00:00 0 [stack] Here we can find both the heap allocated using the setbrk() system call, and the heap allocated using anonymous maps. However there is one thing that we don't want to scan, that is the stack, otherwise the function that is testing the memory itself would easily crash. So thanks to the Linux proc filessystem the first problem is no longer a big issue, and we use some code like this: int memtest_test_linux_anonymous_maps(void) { FILE *fp = fopen("/proc/self/maps","r"); ... some more var declaration ... while(fgets(line,sizeof(line),fp) != NULL) { char *start, *end, *p = line; start = p; p = strchr(p,'-'); if (!p) continue; *p++ = '\0'; end = p; p = strchr(p,' '); if (!p) continue; *p++ = '\0'; if (strstr(p,"stack") || strstr(p,"vdso") || strstr(p,"vsyscall")) continue; if (!strstr(p,"00:00")) continue; if (!strstr(p,"rw")) continue; start_addr = strtoul(start,NULL,16); end_addr = strtoul(end,NULL,16); size = end_addr-start_addr; start_vect[regions] = start_addr; size_vect[regions] = size; printf("Testing %lx %lu\n", start_vect[regions], size_vect[regions]); regions++; } ... code to actually test the found memory regions ... /* NOTE: It is very important to close the file descriptor only now * because closing it before may result into unmapping of some memory * region that we are testing. */ fclose(fp); } CPU cache === The other problem we have is the CPU cache. If we try to write something to a given memory address, and read it back to check if it is ok, we are actually only stressing the CPU cache and never hitting the memory that is supposed to be tested. Actually writing a memory test that bypasses the CPU cache, without the need to resort to CPU specific tricks (like memory type range registers), is easy: 1) Fill all the addressable memory from the first to the last with the pattern you are testing. 2) Check all the addressable memory, from the first to the last, to see if the pattern can be read back as expected. Because we do it in two passes, as long as the size of the memory we are testing is larger than the CPU cache, we should be able to test the memory in a reliable way. But there is a problem: this test destroys the content of the memory, that is not acceptable in our case, remember that we want to be able to provide a meaningful core dump if needed, after the crash? On the other side, writing a memory test that does not destroy the memory content, but that is not able to bypass the cache, is also easy: for each location save the value of the location on the stack, test the location writing patterns and reading the patterns back, and finally set the correct value back to the tested location. However this test is completely useless as long as we are not able to disable the CPU cache. How to write a memory test that: A) Is able to bypass the cache. B) Does not destroy the memory content. C) Is able to, at least, to test every memory bit in the two possible states. Well that's what I asked myself during the past weekend, and I found a simple solution that works as expected (as tested in a computer with broken memory, thanks Kosma! See credits at the end of the post). This is the algorithm: 1) Take a CRC64 checksum of the whole memory. 2) Invert the content of every location from the first to the last (With "invert" I mean, bitwise complement, so that every 1 is turned into 0, and every 0 turned into 1). 3) Swap every adjacent location content. So I swap the content at addresses 0 and 1, 2 and 3, ... and so forth. 4) Swap again (step 3). 5) Invert again (step 2). 6) Take a CRC64 checksum of the whole memory. 7) Swap again. 8) Swap again. 9) Take a CRC64 checksum of the whole memory again. If the CRC64 obtained at step 1, 6, and 9 are not the same, there is some memory error. Now let's check why this is supposed to work: It is trivial to see how if memory is working as expected, after the steps I get back the original memory content, since I swap four times, and invert two times. However what happens if there are memory errors? Let's do a test considering memory locations of just two bits for simplicity. So I've something like: 01|11|11|00 | +----- this bit is broken and is always set to 1. (step 1: CRC64) After step 2: 10|00|10|11 (note that the broken bit is still 1 instead of 0) After step 3: 00|10|11|10 (adjacent locations swapped) After step 4: 10|00|10|11 (swapped again) After step 5: 01|11|11|00 (inverted again, so far, no errors detected) (step 6: CRC64) After step 7: 11|01|10|11 After step 8: 01|11|11|10 (error!) (step 9: CRC64) The CRC64 obtained at step 9 will differ. Now let's check the case of a bit always set to 0. 01|11|01|00 | +----- this bit is broken and is always set to 0. (step 1: CRC64) After step 2: 10|00|00|11 (invert) After step 3: 00|10|01|00 (swap) After step 4: 10|00|00|01 (swap) After step 5: 01|11|01|10 (invert) (step 6: CRC64) After step 7: 11|01|00|01 (swap) After step 8: 01|11|01|00 (swap) (step 9: CRC64) This time is the CRC64 obtained at step 6 that will differ. You can check what happens if you flip bits in the adjacent location, but either at step 6 or 9 you should be always able to see a different checksum. So basically this test does two things: first it amplifies the error using an adjacent location as helper, then use checksums to detect the error. The steps are performed always as "read + write" operations acting sequentially from the first to the last memory location to disable as much as possible the CPU cache. You can find the code implementing this in the "memcheck-on-crash" branch on github, "debug.c" file: https://github.com/antirez/redis/blob/memtest-on-crash/src/debug.c#L669 The kernel could do it better === After dealing with many crash reports that are actually due to memory errors, I'm starting to think that kernels are missing an incredible opportunity to make computers more reliable. What Redis is doing could be done incrementally, a few pages per second, by the kernel with no impacts to actual performance. And the kernel is in a particularly good position: 1) It could detect the error easily bypassing the cache. 2) It could perform more interesting value retaining error tests writing patterns to pages that will be reused much later in time, and checking if the pattern matches before the page is reused. 3) The error could be logged in the system logs, making the user aware before a problem happens. 4) It could exclude the broken page from being used again, resulting in safer computing. I hope to see something like that in the Linux kernel in the future. The life of a lonely cosmic ray === A bit flipping at random is not a problem solely related to broken memory. Perfectly healthy memory is also subject, with a small probability, to bit flipping because of cosmic rays. We are talking, of course, of non error correcting memory. The more costly ECC memory can correct a single bit error and can detect two bits errors halting the system. However many (most?) servers are currently using non ECC memory, and it is not clear if Amazon EC2 and other cloud providers are using or not ECC memory (it is not ok that this information is not clearly available in my opinion, given the cost of services like EC2 and the possible implications of not using ECC memory). According to a few sources, including IBM, Intel and Corsair, a computer with a few GB of memory of non-ECC memory is likely to incur to *several* memory errors every year. Of course you can't detect errors with a memory test if the bit flipping was caused by a cosmic ray hitting your memory, so to cut a long story short: Users reporting isolated, impossible to understand and reproduce crashes, not using ECC memory, can't be taken as a proof that there is a bug even if the fast on-crash memory test passes, if the more accurate redis --test-memory passes, and even if they run memtest86 for several days after the event. Anyway not every bit flipped is going to trigger a crash after all, because as somebody on stack overflow said: "Most of the memory contains data, where the flip won't be that visiblp" The bytes representing 'p' and 'e' are not just 1 bit away, but I find the sentence to be fun anyway. Credits === A big Thank You to Kosma Moczek that reported how my initial approach was not working due to the CPU cache, and donated ssh access to a fast computer with a single bit memory error.