After Hours Academic

A brief introduction to LSM trees

Log-structured merge (LSM) trees are widely used in key-value stores and as the storage layer of database systems. There is a rich tradeoff space in LSM tree designs which I have been reading recently. Before discussing any particular design tradeoffs (in the coming posts), I am starting with a brief introduction of the LSM tree design.

Structure

A LSM tree stores key-value pairs. The storage is composed of multiple levels, each with exponentially increasing capacity. The smallest level is stored in-memory and every other level is stored in persistent storage (either on-machine HDD/SSD or on a distributed file system). As an example, consider a LSM tree with size ratio 4. The zeroth (smallest in-memory) level could be 1MB in size, the first level (on disk) 4MB, the second level (on disk) 16MB, and so on.

Life of a write

A new write starts at the zeroth level. This in-memory level is typically implemented as a skip list. If the write is for a new key, it increases the space used up by the in-memory level. If it is an update to the value of an existing key (and assuming for simplicity that the new and old values are of the same size), the update is performed in place and there is no increase in the size of the level. Eventually, with enough writes, the zeroth level reaches its capacity. Once that happens, the data in this level is serialized, i.e., converted to a storage-friendly format, and written out to the next level. There are different ways to serialize the in-memory level, but for simplicity, we can assume that all the key-value pairs are written out consecutively without any additional indexing structure on top of them. The serialized data is referred to as a SSTable (Sorted String Table). An SSTable stores the key-value pairs sorted based on the keys. After the serialization to a SSTable and write to the first level, the zeroth level starts anew from size zero and continues until it reaches it capacity again. Tt is written out as a SSTable again to the first level. After enough of these cycles, the first level (which was increasing in size from absorbing all the SSTables writes from the zeroth level) reaches its capacity. When that happens, its data is compacted and written out to the second level.

Compaction is the process whereby multiple SSTables are merged to form a single SSTable. Consider the first level that had multiple SSTable as a result of the the zeroth level reaching its capacity. Any key appears exactly once in an SSTable because the keys were updated in place in the zeroth level. However, a single key can belong to multiple SSTables in the first level due to it being added/updated in between consecutive zeroth to first level SSTable writes. Compaction is the process of de-duplicating the repeated keys in multiple SSTable to keep only the most recent value corresponding to that key. The resulting SSTable from compacting all the SSTables in the first level would contain any key exactly once. This new SSTable is written to the second level.

Over time, the whole process of write to zeroth level (in-place updates for repeated keys), writing out SSTable to the first level once the zeroth level fills up, compacting SSTables in the first level and writing it to the second level repeats again. And again. And again. Eventually, the second level also fills up. And guess what? SSTables in the second level are then compacted and written out to the third level.

This is the life of writes in LSM trees: new writes go to the in-memory level; eventually as levels fill up, they are written out to the next larger level. Obviously, this can't continue perpetually because eventually the durable storage runs out of space. This limitation is no different than any other storage engine's limitation -- any storage engine is limited by the underlying storage's capacity. An LSM tree that hits its storage capacity would just stop accepting new keys, similar to a database stopping to accept new data once it hits is storage capacity.

There are a handful of some design choices (with not-so-handful options for each design choice) for compaction, the most influential being when to compact -- when a new SSTable comes to the level or when the level fills up and needs to be written to the next level? I will discuss these design choices in later posts.

Deletes needs special handling in LSM trees because the key (with old values) may be present at different levels. And it might not even be present in the zeroth level. To handle a delete, the LSM tree adds/updates the key in the zeroth level with a special tombstone value. The tombstone represents that the key-value pair has been deleted. Subsequent compactions of SSTable treat tombstone just as any other value: if it is the latest value, it is retained, otherwise it is discarded in favour of a more recent value (e.g., if the key was added again after being deleted). Eventually, the tombstoned key-value pair can be discarded if it is involved in a compaction at the largest level (which by definition contains the oldest values). A compaction at the largest level where the tombstone is the most recent value for a key implies that we don't need to store the tombstone value any more because there are no older values for the key that would need to be deleted using the tombstone.

The durability of the zeroth level is ensured using a write-ahead log (WAL). All the in-memory changes are also written to the WAL and the WAL is flushed to durable storage (either periodically or for each write).

Life of a read

Not let's consider reads. There are two categories of reads: point reads and range reads. A point read is when you want to get the value corresponding to a key. For a point read, the LSM tree needs to be searched from top to bottom (smallest to largest) levels to find the first level where the key is found. This represents the most recent value of the key (or the fact that the key has been deleted if this is the special tombstone value). A range read is when you want to get all the key-value pairs in between two keys. For range reads, the LSM tree needs to merge key-value pairs from all the levels because a key belonging to the requested range could be in any of the levels.

It is easy to see that LSM trees are optimized for writes. Most writes are completed with just an in-memory update with the writes and compactions to larger levels being performed in the background. Reads on the other hand need to handle the complexity that comes with inter (and intra) level key overlap. There are some optimization that LSM trees use to help with reads too. The most common (almost ubiquitous) is the use of a bloom filter for each level. A bloom filter is a may-exist check. If can tell with absolute certainty that a key may-not-exist in a level. However, it can also say that the key may-exist when it doesn't with a configurable false positive rate (i.e., the fraction of times it says a key may exist when it doesn't). Bloom filters can help avoid checking certain levels for point queries. I will discuss more about bloom filters in coming posts as well.

References

There are a bunch of papers on LSM trees (too many to list, forget read). The first paper that introduces LSM trees is this but I haven't actually read this one. The BigTable paper described a LSM tree based storage engine but there are a bunch of other unrelated-to-LSM-tree-details in the paper (not surprising, because it was describing the entire BigTable system). LevelDB and RocksDB are the two most popular LSM-tree implementations. I found the papers (and particularly the background sections of the papers) from the lineage of Stratos Idreos's group, such as Monkey, Dostoevsky, LSM Compactionary to be particularly helpful in understanding LSM trees.

#computer-science #dbms #lsm-trees