In this article, We will delve into the research paper on Bitcask through a code walkthrough. Such papers are succinct, offering a high-level overview of the technology while also providing the essential components needed to develop a functional implementation.
Bitcask is derived from a NoSQL key-value database known as Riak. Each node in a Riak cluster employs a pluggable in-memory key-value storage system, designed to achieve the following objectives:
Low latency for read and write operations
Robust crash recovery with minimal latency
Intuitive data format
Simplified data backup processes
Bitcask successfully meets these requirements, resulting in a solution that is both easy to comprehend and implement. Let's explore the core design of Bitcask.
Key-Value Store Operations
From a data format perspective, Bitcask is straightforward to comprehend. Visualize Bitcask as a directory within your file system, where all data is stored, and only a single process can write to it at any given time. There is an active
file where data is appended. Handling a write request is as simple as appending a record to this file. Since the file's contents are not updated but merely appended, the write request is processed with minimal delay.
When the active
file reaches a predetermined size, a new active
file is created within the same directory, and the previous active
file becomes immutable. You can read from this file, but no modifications are permitted. In code, it appears as follows:
Another scenario that triggers the creation of a new file is when the database server is shut down and subsequently restarted. Specifically, when the server goes offline, the active
file is designated as immutable. Upon restarting, the server initiates a new file, marking it as active
.
The FileRecord
appended to the active
file follows this format::
The attributes of a FileRecord
are as follows:
CRC (Cyclic Redundancy Check): This is an error-detecting code added to each record. It helps identify any data corruption that may occur during storage.
Timestamp: The time when the record is created.
Key Size: The size of the key in bytes.
Value Size: The size of the value in bytes.
Key: The actual key.
Value: The actual value.
To delete a key, we generate a new FileRecord
with the value set to a tombstone value (a unique value unlikely to be assigned by a user, such as a UUID string). When reading, if a tombstone value is found for a key, it indicates that the key is marked for deletion.
Having discussed how write requests are handled, let's examine how Bitcask processes read operations. Our main objective is to achieve fast reads with minimal disk access. The Bitcask server maintains an in-memory hash table called KeyDir
, where each key is mapped to a structure like the one below:
For each key, we store details about the file where the key was last modified using file_id
, the size of the value as value_sz
, the position of the value in the file as value_pos
, and the timestamp of the FileRecord
. To execute a read operation, we move the file pointer to value_pos
and read bytes equal to value_sz
to obtain the value. This method requires only one disk seek, and the file system’s read-ahead cache further minimizes read latency.
To implement the seek functionality, We can utilize the RandomAccessFile API, which offers a seek method. With the necessary data from KeyDir
, reading the value from the specified file is straightforward:
Recovery
In the previous section, we discussed the efficiency of the read function, enabled by an in-memory hash table called KeyDir
. However, what occurs if the server shuts down and restarts? The in-memory structure would be lost, leaving us without information about the keys or their associated files. This challenge is addressed through a recovery process, where files are read from the Bitcask directory in the order of their creation, beginning with the most recent.
We initiate the recovery by reading from the newest file to the oldest, reconstructing the in-memory structure. When we encounter a key for the first time, we treat the associated record as the most recent, as file records are appended and never modified in place. The most recent file record reflects the latest update to that key before the server shutdown. The database resumes handling key-value store operations only after the recovery process is fully completed.
Compaction
A significant challenge that arises when running Bitcask as previously described is the accumulation of numerous files containing updates for the same key. Although only the most recent file with the relevant record is needed, the recovery process must traverse many files, which increases the startup time for database servers as the file count grows.
To address this, we initiate a background compaction process that consolidates all non-active immutable files, producing output files with the latest updates for all keys. This compaction process also updates the in-memory KeyDir
with details about the merged files. By performing compaction regularly, we maintain an efficient recovery process, enabling the database server to restart swiftly, even with a large number of non-active files.
The original paper outlines this process by creating a hint file for the merged file, which specifies the position and size of values for the keys. I have implemented compaction differently by merging multiple files into a single file without using hint files. There are various edge cases to consider with this approach, and you can review the method and how I addressed these edge cases.
With simple components like the ones described above, Bitcask ends up becoming a viable option for a latency sensitive in-memory datastore. Keep learning! :)