Hadoop Magazine

COTS to Cassandra

At NASA’s Advanced Supercomputing Center, we have been running a pilot project for about eighteen months using Apache Cassandra to store IT security event data. While it’s not certain that a Cassandra based solution will go into production eventually, I’d like to share my experiences during the journey.

 

Getting Started

During the sumer of 2011 I began looking for a replacement technology for a commercial event management solution that would handle our anticipated growth in data, improve ingestion and retrieval times, and eliminate a single point of failure in a data store. My team and I researched various approaches, including traditional relational database technologies, and we came to the conclusion that the amount of resources required to implement many of the native Cassandra features (time-to-live, snapshotting, no single point of failure) in another technology, would take away from our goal of quickly solving this problem and moving on to the next one.

Throughout that fall, as we had available time, my team and I set up a small three node Cassandra cluster on a virtualized server and we learned by doing. Using Python and Pycassa, I wrote a data parser and several sample query routines against time series data so we could understand how Cassandra behaved. We added and dropped key spaces, we modified column families, we let data expire and then reloaded it, and we shut down virtual interfaces at critical times in attempt to understand how Cassandra would behave. I came away mostly impressed, but cautious of how little we understood data modeling and how much effort would be required in terms of parallelizing our code for increased performance.

We took our data points, timings, and overall impressions and made a pitch to NASA that Cassandra mapped well to our use-cases and our environment. Some of the highlights were:

  • our environment was tailor made for cassandra being heavily write-centric with far fewer reads

  • replication/availability is a core Cassandra capability

  • data expiration via TTL timestamps on all inserted data would be immensely helpful in keeping an accurate rolling window

  • flexibility in organizing the data

Perhaps the most important thing I tried to convey was that Cassandra was an enabling technology that allowed us to ask increasingly complex questions of our data.

Deployment

As it came time to order the hardware, I decided that we would go with five physical servers divided into various virtual machines.

  • 3 2U servers for Cassandra (48GB RAM, 24 cores, 2 SATA 500GB drives, 6 SAS 600GB drives, RAID card)

  • 1 2U server to handle all the data ingestion

  • 1 2U server to handle our analytics

During the Christmas and New Years holidays, a time when most offices slow down, one member of my team and I got busy with our performance testing. We ran through several different Cassandra configurations and benchmarks before settling on our final solution of three virtual machines per physical server. We decided to base our benchmark configuration off of similar tests performed by the Netflix team. We used the stress tool shipped with Cassandra 1.0.6 :

$ cassandra-stress -D “nodes” -e ONE -l 3 -t 150 -o INSERT -c 10 -r -i 1 -n 2000000

  • nodes” was a comma separated list of all cluster IP’s

  • consistency level of one

  • replication factor of three

  • one hundred fifty threads

  • perform a write operation

  • ten columns per key

  • random key generator

  • report progress in one second intervals

  • two million keys

 Our cassandra.yaml file was configured with two changes from the default. Aside from the seed provider, per the documentation we changed:

concurrent_writes: 56

We first tested two virtual machines per physical server with fully virtualized disks (the blue line). We noticed our operations fluctuated quite dramatically which we assumed to be disks competing for throughput. Next we tried three virtual machines per physical server with an assumption going in that we’d get worse performance as the disks had another image to keep up with (the green line). Finally, we decided to de-virtualize the I/O and bind two disks together in a RAID 0 for each virtual machine (red line). As you can see, this increased our operations per second, as well as smoothing the overall curve, so I knew we were making progress.

 PastedGraphic-1

Here is the corresponding latency graph for each configuration. Again, the red line has the lowest overall latency while remaining the smoothest.

 PastedGraphic-2

Our final virtual machine configuration is as follows:

  • Xen 4.1.2

  • Gentoo Linux (kernel 2.6.38)

  • 7 cores

  • 15 GB RAM

  • 1.2 TB disk (2 SAS disks per VM in a RAID 0)

Each hypervisor is configured with:

  • Xen 4.1.2

  • Gentoo Linux (kernel 2.6.38)

  • 3 cores

  • 512MB RAM

  • 500 GB disk (2 SATA disks in a RAID 1)

I chose not to run bare metal, as several talks with the Cassandra developers on Freenode’s #cassandra channel made it clear that I’d get better Cassandra performance by virtualizing our larger machines into more numerous smaller ones.

Data Modeling

When you ask people about data modeling in Cassandra, one of the first things you inevitably hear is to organize the column families based on the questions you plan to ask of the data. This seems absurdly obvious until you come to the stark realization that you do not have a good handle on every question across every data type and you need flexible models.

By far, the most time spent was in terms of experimenting with various ways of organizing the column families and modifying our Python code to be multi-process for our heavier data feeds. I was fortunate in that all of our data is well structured and was largely time series in nature. This meant that we usually ended up with a column family that looked similar to this:

row key

column_name

column_value

2012-12-01 00:00

ffd82988-1c01-498b-ac53-14ef9cb9467b

JSON blob

fff3ad45-b4b8-42df-a021-6c53ad2095a4

JSON blob

fff76205-75ff-4cca-bb9e-2c963be10f54

JSON blob

2012-12-01 00:01

ffea8e76-5c24-426f-9bb9-a40912e7a8f0

JSON blob

fff753d5-91f1-4045-86ad-8a20d4fa4fc5

JSON blob

Since we were dealing with time series data and we’d often bound our lookups by a timestamp, we’d row partition by time, sometimes in minute intervals (as above), fifteen minute intervals, or even hour intervals depending on how large the data was. Each column would then be a unique entry for that piece of data with the value being a JSON encoded blob. If a particular piece of data occurred over multiple partitions, we’d write that piece of data to each row. Pretty straightforward really.

The drawback to this approach is that without any other column families, we’d be doing a full row scan (sometimes in the tens of thousands of columns) for a single entry, which consisted of decoding each JSON blob and looking for a particular value. In practice, I found that this wasn’t much of a problem for small time intervals as it only took a few seconds per row. However, as I started increasing our query size to hundreds or thousands of rows (1440 rows for a twenty four hour period), I realized that we had to spend some time working with the Python multiprocessing module so we could fetch rows in parallel if we wanted any sort of response time not measured in minutes. While this allowed us to take full advantage of multiple cores, it left a pretty high barrier to entry in terms of development skills required by a group largely staffed by operations folks.

Even without getting into multiprocessing, there were ways to optimize retrieval by creating lookup tables pertaining to values in the JSON blobs. As an example, we could decide to index by IP address:

row key

column_name

column_value

a.b.c.d

fff3ad45-b4b8-42df-a021-6c53ad2095a4

empty

fff76205-75ff-4cca-bb9e-2c963be10f54

empty

a.b.c.e

ffea8e76-5c24-426f-9bb9-a40912e7a8f0

empty

fff753d5-91f1-4045-86ad-8a20d4fa4fc5

empty

 With this column family we simply need to find the IP address in question and then retrieve the column names for every occurrence and go and grab the values from the first column family. A small optimization (at the cost of data redundancy) would be to write the full JSON blob into the column_value so we don’t have to read two tables when looking up by IP address. If the data values are frequently changing, having to ensure consistency across multiple column families is a solvable problem, but a problem nonetheless. In our environment, the data is immutable, so either approach would be fine: it’s speed versus storage.

Takeaways

Cassandra can be daunting if your only background is comprised of single instance relational databases. While I ultimately got things working as expected, there was a significant time and skill investment in doing so. Like any complex technology, it’s expert friendly. The more you understand it, the more you and your organization will get out of it. These days, it’s not uncommon to find someone who can deliver all of these skills alone, but having a well engineered team of these people can significantly reduce the time it takes to go from nothing to full production.

  • systems administrator to manage a distributed cluster and tune the OS & Cassandra

  • business analyst to fully understand the data and questions being asked of it

  • software developer to facilitate the optimal means of retrieving the data and writing the complex analytics

While making this journey, it’s difficult to overstate how much I’ve learned about our data, our uses cases, and Cassandra. To that end, I’m happy to pass on some tidbits I’ve learned from DataStax engineers, Cassandra developers, and the community at large:

Operations

  • try and keep your disk utilization lower than 50% so you won’t get full disks in the event of a node failing

  • if repairing, use the ‘-pr’ option so you don’t repair the same range twice

  • once token ID’s are set, you can remove it from cassandra.yaml, thus allowing each node in the cluster to use the same file

  • using MX4J as an interface to JMX caused random crashes across every node until it was removed

  • have a planned method of managing configurations and upgrades across a distributed cluster

Tuning

As I loaded more and more data into our cluster, I started seeing errors in the Cassandra logs, so I needed to make the following changes to cassandra.yaml:

flush_largest_memtables_at: 0.50

reduce_cache_sizes_at: 0.60

reduce_cache_capacity_to: 0.4

I’ve not gone back and re-run our benchmarks with these changes and the latest versions of Cassandra, as the cluster is under load now (and we aren’t experiencing any performance issues).

Modeling

  • a good understanding of the data and the required analytics cannot be over emphasized

  • don’t use supercolumns

  • reading successive column families can cause query timeouts — be prepared to refactor your data if this happens

  • don’t be afraid of de-normalizing your data even if it changes — have well defined ways of updating every occurrence

  • secondary indexes never worked well for me, so I don’t use them

  • materialized views work very well if you have the disk space

Overall

  • give yourself time to master any complex technology, Cassandra is no different — you can stand up a cluster in less than a day, but getting the most out of it will take longer

  • Cassandra is a work in progress, be prepared for bugs — 1.1.3 removed the ability to drop column families (fixed in 1.1.4)

Conclusion 

Usually the first question someone might ask is “Do I need Cassandra or can I use a relational database?” That can be tough to answer without delving into a myriad of requirements, but one of the best answers I’ve heard was that it makes sense to use Cassandra when your data no longer fits comfortably on one node. As our data set is currently measured in the hundreds of gigabytes, it remains to be seen if Cassandra will move into production given the amount of development expertise required to write the analytics. Cassandra has a SQL-like query language called CQL that allows business analysts familiar with SQL to avoid writing complex Python, Java, or your high level language of choice. With proper schemas, it might be very possible to only use CQL, however we’re not at that point.

by CHRISTOPHER KELLER

(@cnkeller), Solutions Architect, CSC

Christopher Keller is a Solutions Architect at CSC where he specializes in big data technologies for the High Performance Computing group. When he’s not grappling with data, he holds a second degree blue belt in Gracie Jiu- Jitsu.

[email protected]

 

September 19, 2014