Introduction

During the last article on LSM trees, we discussed how awesome they are for write-heavy storage engines. But we also got a glimpse into why this speed doesn't come free of cost. With time there has been progressing made in the underlying hardware(HDD to SSD) and improvements are being made in the original version of LSM trees to make full utilization of new hardware technologies. As part of this article, we explore WiscKey which is derived from one of the most popular LSM tree implementations(LevelDB) and go on to make enhancements to the LSM tree for exploring its full potential on SSD.

LevelDB Internals

LevelDB which is a key-value store takes its inspiration from BigTable. In addition to basic key-value store functionalities, it provides other well-used features such as range queries, snapshots etc. We will try to understand the basic architecture of LevelDB so that we can see where improvements can be made when this storage is deployed on an SSD.

LevelDB consists mainly of two in-memory Memtables and seven levels(L0 - L6) of on-disk Sorted String Tables(SSTables). For each update operation, LevelDB stores the key-value pair in the (in-memory) Memtable. It also stores the key-value pair in an additional log file for durability guarantees. Once the Memtable reaches its capacity, LevelDB allocates memory for a new Memtable and starts processing update operations on the newly allocated Memtable. The older Memtable is converted to an immutable data structure in the background and is then flushed to disk at levels L0 converting it into a new SSTable.

The total size of each level is limited and increases by a factor of 10 with each level. So whenever the size of a level reaches the limit, the compaction process is triggered which moves the data to lower levels. The compaction process works recursively until all the levels are within the desired storage limit. Another goal of the compaction process is to ensure that all files on a particular level don’t have an overlap for the keys they are storing. This restriction is not applied to L0 level as Memtables is directly flushed on this level. Though, if the number of files on L0 level exceeds eight, LevelDB slows down the write traffic until the compaction process has removed some files to L1 level.

LevelDB Issues

As we can see, the LevelDB tree primarily suffers from high write & read amplification.

Note: Write/Read amplification is the ratio between the amount of data written/read to/from the disk vs the amount of data requested by the client.

Due to the compaction process, LevelDB ends up writing way more data than the size of logical storage. As storage size among levels increases by a factor of 10, the compaction process might end up in a write amplification of 10 when moving data between two levels. Read amplification is also a bottleneck as in the worse case, LevelDB might end up looking at eight levels for a key-value pair. It will need to check eight files(Maximum limit of the level L0) and one file(As there is no overlapping of keys on other levels) each for the remaining levels.

SSD Power

With advancements in hardware technologies and the introduction of SSD, we are equipped with better performance for the random read. These random reads issued concurrently can be close to sequential reads in terms of performance. So an LSM tree implementation should leverage this performance boost as now a random I/O is bearable in contrast to a random I/O on an HDD. SSD also degrade in terms of device lifetime when subjected to high write amplification as SSD’s lifecycle is based upon several write/erase cycles a block can accept before erroring out.

WiscKey

The design of WiscKey aims to leverage these new hardware developments to improve the performance of LSM trees on SSD devices.

WiscKey’s design has the goal of fulfilling the following requirements:

Low write amplification
Low read amplification
SSD optimized

Key-Value Separation

The compaction process in the LSM tree loads an SSTables in memory, merges them and writes it back to disk. This helps in read performance as range queries (Get values for id=1 to 100) will result in sequential access if the data is sorted whereas it will lead to random access for the 100 ids if it is not sorted. So this compaction is more or less a strict requirement to build an efficient storage engine.

Although whenever we think about key-value stores, our initial thought is to picture both the keys and the value together. In real-life scenarios. usually, the value associated with the key ends up consuming most of the storage. Consider a scenario where we are caching responses for an API. The key in this scenario will be usually a user_id that can be limited to 16 bytes(If it’s a UUID) and the response can be a huge JSON blob. So our key-value storage is primarily consumed by the values and we are utilizing only a fraction of storage for the keys.

WiscKey puts forward an idea that for compaction to work, only keys are needed for sorting and not the values. Values can be kept in SSD-friendly storage called value-log(vLog for short) and the LSM tree consists of keys and the memory location of the values. With this discovery, the size of the LSM tree for WiscKey is cut down drastically. With a smaller-sized LSM tree, the write amplification is also reduced as now we are storing a smaller amount of data on disk compared to a typical LSM tree data structure. Also now we are not erasing the large-sized value from multiple files during compaction and this improves SSD lifetime.

An insertion operation for key-value pair results in the value first being written to the vLog and the key being inserted in the LSM tree along with the value’s address. Whenever a key is deleted, it is just removed from the LSM tree and vLog remains untouched. The value associated with the deleted key is garbage collected later.

For read operations, WiscKey searches for a key in the LSM tree and then reads from the memory address associated with the key. Now that the LSM tree of the same size can store more keys when compared to LevelDB, a read operation ends up in fewer levels of table accesses which improves the read amplification.

Challenges & Solutions

Parallel Range Query

The API for range query is exposed by three APIs i.e. Seek(key), Next() & Prev(). This provides the user with an iterator-based functionality using which the user can perform a scan of a range of key-value pairs.

Separating keys from values creates a major problem for range queries. Earlier when we wanted to query a range of key-value pairs, we had an advantage from sorting that all the neighboring keys and most importantly their values are next to each other. In the new design, we have the keys in sorted order but in place of values, we have their vLog addresses. So for a range query, this design can result in multiple random I/O affecting the performance of the range query. To deal with this problem, WiscKey makes use of parallel I/O functionalities of SSD devices. Once the user is returned with an iterator for the Seek(key) API, WiscKey starts working with a prefetching framework in the background. After detecting the user-access pattern, it reads several keys from the LSM tree sequentially and their addresses are inserted into a queue. These addresses are picked up by a list of threads that will fetch the corresponding values from vLog concurrently. The prefetching will ensure that the user doesn’t experience degradation during range queries as the value they are expecting is already fetched and stored in the memory.
Garbage Collection

Garbage collection in a typical LSM tree reclaims the free space(For deleted/overwritten keys) during compaction when a key eligible for garbage collection is found. We discussed that a deletion in WiscKey just results in the key being marked for deletion in the LSM tree. So during compaction, the space used by keys will be freed but we won’t be able to capture the space for values as they are stored separately.

To overcome this challenge, WiscKey introduces a lightweight garbage collector(GC) responsible for reclaiming the space used by values associated with deleted keys. It also presents a new data layout for values stored in vLog. The new data layout is (key-size, value-size, key, value)

The end goal of this GC is to keep the valid values in a sequential range in the vLog. It achieves this by keeping track of the head and tail of vLog. The values lying between the head and tail are considered to be valid range and is searched during lookup. GC starts the process by reading a set of values from the tail of vLog. It then queries the LSM tree to check if a corresponding key is present in the tree and filters out the values for which the key is not found. It then writes the valid values to the head of the vLog. It then deallocates the memory associated with the set of values.

To account for any failures, GC calls fsync() after moving the values to the head and updates the value addresses in the LSM tree before deallocating the original chunk of memory. The updated tail address is also stored in the LSM tree for future GC reference.
Crash Consistency

LSM trees provide atomicity guarantees of key-value insertions in case of failures. With WiscKey as keys and values are stored separately, providing this guarantee becomes complicated. The failure can happen after the values are inserted in the vLog and before the keys and value addresses are inserted in the LSM tree.

WiscKey leverages modern filesystems such as ext4 for providing atomicity guarantees. The file system guarantees that random/incomplete bytes are not added to the file during crashes. As WiscKey adds values sequentially to the end of vLog, if a value V is found to be lost in the failure i.e. the corresponding key is not found in the LSM tree then all the values inserted after this value are also considered to be lost. All these values are now eligible for garbage collection.

For maintaining consistency during lookups, WiscKey first looks up the key in the LSM tree. If the key is not found it just returns an error or a default value. But if the key is found then it performs an additional check for the corresponding entry in vLog. It checks that the value address for vLog falls within the valid range of vLog and that the key associated with the value data structure is the same as the key being queried. If any of the above verification fails then the key is considered to be lost during a failure and is deleted from the LSM tree. In this instance, the new data layout introduced as part of garbage collection ends up proving to be useful for validation.

Optimizations

WiscKey’s core idea of separating keys from values opens up another set of optimizations that can be implemented to enhance the performance of the storage engine.

Value-Log Write Buffer

As part of a write operation, WiscKey first appends the value to vLog. It performs this operation by using a write() system call. To improve the performance of write operations, WiscKey buffers the values in the userspace buffer. The buffer is flushed to memory when it reaches a threshold memory. This reduces the number of system calls and in turn, improves the performance of the write operation.

There is an additional benefit of this optimization that we see during the read operation. During lookup, WiscKey first searches for the key in the buffer and then moves on to read from vLog if the key is not found in the buffer. Though this approach exposes a certain amount of data loss in case of failures.
LSM Tree Log

LSM tree appends the key-value pair to the LSM tree log so that in case of a failure the log can be used to recover inserted key-value pairs that are not yet flushed to SSTables. As WiscKey always writes the values to the vLog along with the corresponding key in the updated data layout discussed above, the LSM tree log can be removed completely without having any impact on the correctness of the system. This optimization avoids an insertion to log for every update and in turn improves the performance of WiscKey.

WiscKey VS LevelDB

WiscKey paper describes performing microbenchmarks to compare both LevelDB and WiscKey on an SSD device.

Load performance

WiscKey tends to be 3 times faster than LevelDB. The primary reason for this is that WiscKey does not write to the LSM-tree log and also buffers vLog writes. With increasing load, LevelDB has low throughput as a major part of bandwidth is consumed in compaction that slows down the write operations(In case when the number of SSTable files in L0 exceeds eight). The throughput of WiscKey is comparatively higher as the LSM tree for WiscKey tends to be significantly small as it stores only keys and not values.
Query performance

Despite checking both the LSM tree and vLog for a random lookup in the case of WiscKey, the throughput of WiscKey ends up being better than LevelDB. This is because LevelDB ends up with high read amplification as it needs to search all levels in case of the worse case. Whereas WiscKey has better-read amplification due to a smaller LSM tree. WiscKey’s design to leverage parallel reads on SSDs also contributes to high throughput for random reads.
Garbage collection

Garbage collection in the case of WiscKey doesn't end up becoming a bottleneck for write operations as in the case of LevelDB. As the GC only reads from the tail of vLog and writes to the head if there are any valid values. During benchmarking while garbage collection is running, WiscKey is at least 70 times faster than LevelDB.
Crash consistency

For benchmarking for crash consistency, a tool named ALICE is used which simulates a set of crashes and checks for inconsistencies. The tool does not report any consistency issues introduced by WiscKey. When comparing the recovery time of databases LevelDB takes 0.7 seconds whereas WiscKey takes 2.6 seconds. WiscKey's slow recovery time is because recovery depends upon the address of the head being stored on disk and a crash can happen just before the address is updated. The recovery time can be improved by increasing the frequency of persisting address of the head to the disk.
Space amplification

Another area that WiscKey considers while benchmarking is space amplification i.e. the ratio of the actual size of the database to the logical size of the database. WiscKey consumes more space as compared to LevelDB to improve upon I/O amplification. Most of the storage solutions have to make a compromise on one of the amplification parameters to improve on others. LevelDB trades higher write amplification for lower space amplification.

Conclusion

With increasing scalability requirements, key-value storage engines play a major role in building robust and scalable applications. LSM trees tend to be the primary data structure for write-heavy applications and improvements done to make them more performant always open up interesting ideas. These ideas range from how the data is structured to how the underlying hardware impacts the performance of the storage engine as we saw in the case of WiscKey.

Also, these ideas are not only theoretical but also convert to actual production systems. WiscKey paper forms the foundation of a key-value store called Badger which acts as the database for Dgraph(A distributed graph database).

You can check a simple implementation of the WiscKey approach over LevelDB here!

WiscKey & SSD Strengthens LSM Tree read