Hadoop Magazine

Cassandra: internal storage

Nowadays, there are a lot of NoSQL solutions, all of them have got advantages and disadvantages affecting on architectural solution depending from project needs.

One of the biggest advantages of Cassandra is a speed of data writes, that makes Cassandra the best decision for set of use cases, such as: storing huge amount of logs, transactions and all types of data, which usually are more written than read. Typically, there are a lot of questions from many developers about how internally Apache Cassandra storage works – what algorithms and data structures are implemented to assure so effective writes and not only writes, but also many others interesting features, so this article describes it.

Log-structured merge-tree

One of the most popular examples, when NoSQL could be more effective solution is in high-performance transaction system, such as banking system that stores all users` transaction.

Transaction system applications typically insert rows in a relevant table to provide an activity trace; at the same time the transaction system generates log records for purposes of system recovery. Both types of generated information can benefit from efficient indexing.

Figure 1: Behavior of typical transaction system

figure 1 Behaviour

Unfortunately, standard disk-based index structures such as the B-tree (that is used in MySql for example) will effectively double the I/O cost of the transaction to maintain an index such as this in real time, increasing the total system cost. Clearly a method for maintaining a real-time index at low cost is desirable. The Log-Structured Merge-tree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components.

An LSM-tree is composed of two or more tree-like component data structures. A two component LSM-tree has a smaller component, which is entirely memory resident, known as the C0 tree (or C0 component), and a larger component, which is resident on disk, known as the C1 tree (or C1 component). Although the C1 component is disk resident, frequently referenced page nodes in C1 will remain in memory buffers as usual (buffers not shown), so that popular high level directory nodes of C1 can be counted on to be memory resident.

Inserts in Log-structured merge-tree

As each new row is generated, a log record to recover this insert is first written to the sequential log file in the usual way. The index entry for the row is then inserted into the memory resident C0 tree, after which it will in time migrate out to the C1 tree on disk; any search for an index entry will look first in C0 and then in C1 .

The C1 tree has a comparable directory structure to a B-tree, but is optimized for sequential disk access. Unlike the C1 tree, the C0 tree is not expected to have a B-tree-like structure.

The rolling merge acts in a series of merge steps.  A read of a multi-page block containing leaf nodes of the C1 tree makes a range of entries in C1 buffer resident. Each merge step then reads a disk page sized leaf node of the C1 tree buffered in this block, merges entries from the leaf node with entries taken from the leaf level of the C0 tree, thus decreasing the size of C0, and creates a newly merged leaf node of the C1 tree.

Finds in Log-structured merge-tree

When an exact-match find or range find requiring immediate response is performed through the LSM-tree index, first the C0 tree and then the C1 tree is searched for the value or values desired. This may imply a slight CPU overhead compared to the B-tree case, since two directories may need to be searched. In LSM-trees with more than two components, there may also be an I/O overhead.

Deletes in log-structured merge-tree

When an indexed row is deleted, if a key value entry is not found in the appropriate position in the C0 tree, a delete node entry can be placed in that position, also indexed by the key value, but noting an entry Row ID (RID) to delete. The actual delete can be done at a later time during the rolling merge process, when the actual index entry is encountered.

Log-structured merge-tree implementation: Memtables and SSTable

Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table). Memtables and SSTables are maintained per column family. Memtables are organized in sorted order by row key and flushed to SSTables sequentially (no random seeking as in relational databases).

Figure 2: New Row insertion

Figure 2 new row

SSTables are immutable (they are not written to again after they have been flushed). This means that a row is typically stored across multiple SSTable files. At read time, a row must be combined from all SSTables on disk (as well as unflushed memtables) to produce the requested data. To optimize this piecing-together process, Cassandra uses an in-memory structure called a bloom filter. Each SSTable has a bloom filter associated with it. The bloom filter is used to check if a requested row key exists in the SSTable before doing any disk seeks.

Cassandra now writes column families to disk using this directory and file naming format:

 

/var/lib/cassandra/data/ks1/cf1/ks1-cf1-hc-1-Data.db

 

Cassandra creates a subdirectory for each column family, which allows a developer or admin to symlink a column family to a chosen physical drive or data volume. This provides the ability to move very active column families to faster media, such as SSD’s for better performance and also divvy up column families across all attached storage devices for better I/O balance at the storage layer.

In the background, Cassandra periodically merges SSTables together into larger SSTables using a process called compaction. Compaction merges row fragments together, removes expired tombstones (deleted columns), and rebuilds primary and secondary indexes. Since the SSTables are sorted by row key, this merge is efficient (no random disk I/O).

Once a newly merged SSTable is complete, the input SSTables are marked as obsolete and eventually deleted by the JVM garbage collection (GC) process. However, during compaction, there is a temporary spike in disk space usage and disk I/O.

Compaction

Compaction impacts read performance in two ways. While a compaction is in progress, it temporarily increases disk I/O and disk utilization which can impact read performance for reads that are not fulfilled by the cache. However, after a compaction has been completed, off-cache read performance improves since there are fewer SSTable files on disk that need to be checked in order to complete a read request.

Find in SSTable

When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question, as well as from any unflushed memtables, to produce the requested data. To optimize this piecing-together process, Cassandra uses an in-memory structure called a bloom filter: each SSTable has a bloom filter associated with it that is used to check if any data for the requested row exists in the SSTable before doing any disk I/O. As a result, Cassandra is very performant on reads when compared to other storage systems, even for read-heavy workloads.

Figure 3: Find request for Row

figure 3

Deletes in SSTable

1) Deleted data is not immediately removed from disk. Data that is inserted into Cassandra is persisted to SSTables on disk. Once an SSTable is written, it is immutable (the file is not updated by further DML operations). This means that a deleted column is not removed immediately. Instead a marker called a tombstone is written to indicate the new column status. Columns marked with a tombstone exist for a configured time period (defined by the gc_grace_seconds value set on the column family), and then are permanently deleted by the compaction process after that time has expired.

2) A deleted column can reappear if routine node repair is not run. Marking a deleted column with a tombstone ensures that a replica that was down at the time of delete will eventually receive the delete when it comes back up again. However, if a node is down longer than the configured time period for keeping tombstones (defined by the gc_grace_seconds value set on the column family), then the node can possibly miss the delete altogether, and replicate deleted data once it comes back up again. To prevent deleted data from reappearing, administrators must run regular node repair on every node in the cluster (by default, every 10 days).

3) The row key for a deleted row may still appear in range query results. When you delete a row in Cassandra, it marks all columns for that row key with a tombstone. Until those tombstones are cleared by compaction, you have an empty row key (a row that contains no columns). These deleted keys can show up in results of get_range_slices() calls. If your client application performs range queries on rows, you may want to have if filter out row keys that return empty column lists.

Conclusion

A B-tree, because it has popular directory nodes buffered in memory, is really a hybrid data structure, which combines the low cost of disk media storage for the majority of the data with the high cost of memory accessibility for the most popular data. The LSM-tree extends this hierarchy to more than one level and incorporates the advantage of merge I/O in performing multi-page disk reads.

The LSM-tree entries could themselves contain records rather than Row IDs pointing to records elsewhere on disk. This means that the records themselves can be clustered by their keyvalue.

On the web

  • http://en.wikipedia.org/wiki/B-tree – wikipedia article about B-tree data structure,
  • http://www.datastax.com/dev/blog/schema-in-cassandra-1-1 – blog entry with example of data schema within Cassandra,
  • http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.2782 – Log-Structured Merge-Tree white paper,
  • http://www.datastax.com/docs/1.1/index – Cassandra documentation.

About the author
SERGEY ENIN

Sergey Enin is a software engineer pas absorbed with BigData problems, mostly working with Cassandra and Ruby. Sergey graduated from Belarus State University with degree in Computer Science and currently hold position of Software Engineering Team Leader within EPAM Systems.

May 5, 2014