there is something between append only and reusing blocks

Monday, 07 February 11
One of the most interesting properties of an append only btree is that it's impossible to corrupt. There are other interesting properties as well like concurrent accesses are trivial, since whatever is the root and all the nodes your reader is going to access, they are all valid, just they may be an old representation of the btree.

But there is a price to pay that I'm not happy with, that is, the append only btree requires a compaction process, otherwise it will get bigger and bigger. If your use case is mostly-reads this may be ok, but if you plan to use a lot of writes there is some math to do.

Think at this: if you have many writes that are near to the max disk I/O that your server is able to deliver, your append only btree file size is going to get big of course, and you need compaction. But you are already near to the I/O limit, what happens once you start writing a new file to rewrite the btree? The additional I/O can affect badly the performance of the btree. Can you slow down writes in order to reduce the impact? Unfortunately not, otherwise the rewrite will never be fast enough to compact the file.

I'm all for performance characteristics that are as predictable as possible, so I don't like this design. It is for sure a sensible one for many applications, but I can live better with the idea of a single file that is never rewritten (if not for crash recovery or alike) and that is able to reuse the no longer used blocks.

But guess what? such a btree is much simpler to corrupt, especially if you don't use fsync() in order to implement write barriers. Disks and operating systems can reorder writes, so even if you do something like:
  • Create a new node
  • Link the new node to the existing tree
If there is not an fsync() between the two, it is well possible that writes are reordered, so that the link is created before the node is written on disk. If a crash happens between this two operations you end with a corrupted btree. But are the append only strategy and update in place (and reuse blocks via free lists) so incompatible? Possibly not.

Experimenting with my btree implementation I discovered what is probably the discovery of hot water, that is, likely common practice, but the kind of common practice that you'll never find in an algorithms book. That is, what about if my free list entries have a timeout so that they can't be used before 60 seconds?

This way the btree implementation can always rewrite the modified nodes instead of updating them in place. Even if there is a read currently in progress, we are sure that even if our writes will reuse nodes, no recently freed node will be used, so unless the read will take 60 seconds to complete, everything will go well.

Unfortunately we still need a fixed offset for our root node pointer, so an fsync is still needed on every write in order to make sure that reordering can't corrupt our btree on crash.

With my implementation I'm not still at this stage as I'm starting with the most basic things that can work well, but that's the idea for the next releases.
35769 views*
Posted at 06:02:28 | permalink | 7 comments | print
Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.

Comments

Didier writes:
07 Feb 11, 06:52:57
Hi - maybe random/sequential I/Os should be considered as well. If you isolate an append-only file on a disk, you can guarantee only sequential I/Os will be done. On the other hand, if you rewrite blocks in place, you will likely perform random I/Os, which are much slower on non-SSD hardware. Regards, Didier.
Pierre Habouzit writes:
07 Feb 11, 08:04:45
60 seconds may not be enough if you're really stuck.
What you want is RCU.
Chris writes:
07 Feb 11, 09:16:17
In CouchDB we have a theoretical problem with compaction not keeping up with new writes and therefor never finishing. We've never seen it in real life. The real issue tends to be when people don't have enough free disk space to compact. For steady state performance and to avoid surprises, I know some users who are always compacting.
Anonymous writes:
07 Feb 11, 13:03:38
Seconding the recommendation for RCU. That allows readers to mark their presence in an incredibly lightweight way, and have writers put off reclamation until all readers have finished with the data.
Didier writes:
07 Feb 11, 13:37:14
Hi again - RCU is to optimize locking in memory. In a disk-based database, people will rather call it MVCC (even if they are slight different - see this post from Paul E. McKenney http://www.mail-archive.com/kernelnewbies@nl.linux.org/msg06575.html ). But IMO, this is off-topic here. Redis is mostly single-threaded. Even if the I/Os will be done in a separate thread (or maybe several of them), I doubt concurrency in memory will be a bottleneck.
forchenyun writes:
04 Mar 11, 00:27:21
I think you could consider the implement of Berkeley DB Java Edition.
The compaction would never be the bottleneck because data files are separated every 10M.
Daniel Waterworth writes:
05 Mar 11, 02:14:21
Hi, I wondered if you'd be interested in aodbm ( sf.net/projects/aodbm/ ). aodbm is an append only database manager that uses an immutable B+Tree internally.
comments closed