What is Cassandra?
Apache Cassandra is the massively scalable open-source NoSQL database. It is an Apache Software Foundation top-level project designed to handle very large amounts of data in real-time while providing continuous availability even across multi-datacenters and the cloud.
Cassandra evolved from work at Google, Amazon and Facebook, and is in use at leading companies such as Netflix, Rackspace, and eBay. The Chair of the Apache Cassandra Project is Jonathan Ellis, co-founder and CTO of DataStax.
Although Cassandra is classified as a NoSQL database, it is important to note that NoSQL is a misnomer. A more correct term would be NoRDBMS, as SQL is only a means of accessing the underlying data. In fact, Cassandra comes with an SQL-like interface called CQL. CQL is designed to take away a lot of the complexity of inserting and querying data via Cassandra.
In this article, we will walk through the basics of what Cassandra is, getting up and running with Cassandra, and a simple Node.js application sample to show the ease of use.
The basic Cassandra schema starts off with the keyspace. The keyspace is synonymous with a database in an RDBMS. The keyspace defines how many times the data is replicated, and in what datacenters the data resides.
In every keyspace there is a set of column families. A column family is like a table in an RDBMS, however, there is no set schema. While you can specify the value type for a specific column, this can also be done on the fly. This feature allows for you to create millions of columns in a single row and a great use case for this is time-series data.
Each column family has a set of rows, which consist of columns. Every column is a tuple that contains the column name, value, time stamp, and optionally a TTL. The column name can be of any supported type, including a composite of several types. These are called composite columns.
Cassandra can query data in several different ways. You can select columns directly by their name; such as you would in an RDBMS. You can also select slices of columns by either their name, part of their name, or their ordinal position. It is important to note that columns are automatically ordered according to their name and not necessarily when they were added.
Getting started with Cassandra
Getting started with Cassandra is fairly easy. You can download the binaries or source as a tarball directly from planetcassandra.org. If you are running OSX and use HomeBrew, you can simply run “brew install cassandra”. There is also a Debian package for the Debian distribution of the Linux operating system.
Testing small clusters locally is easiest done using Cassandra Cluster Manager (CCM). Let’s walk through using CCM to start up a small 3-node cluster on our local machine.
After the cluster starts, you can verify that your cluster is up and running by using the nodetool command:
The nodetool command is your primary means of managing tasks, not related to queries, across the cluster. This includes adding a new node, decommissioning a node, rebalancing the data in the cluster, checking statistics and many more.
Cassandra Query Language (CQL) is an SQL-like language for querying Cassandra. Although CQL has many similarities to SQL, there are some fundamental differences. For example, the CQL adaptation to the Cassandra data model and architecture doesn’t support operations such as JOINs, which make no sense in a non-relational database.
To start using CQL all you need to do is start up the cqlsh shell. There are 2 main versions of CQL, CQL2 and CQL3. The default version of CQL for Cassandra 1.1.x is CQL2, however, CQL3 is preferred and should be used if possible.
Let’s start a shell using CQL3
Now that we are in the shell, we can create our keyspace and column families.
To start using the newly created keyspace, the command is just like in SQL.
Now we can create a column family.
Inserting a row is also very similar to SQL.
As is selecting data:
Creating an application
Creating an application in Cassandra has been made a relatively simple task by the collective efforts of the community in creating easy-to-use drivers that all have a similar API. No matter what language you prefer, there is development going on for integration with Cassandra. For the purposes of this exercise, we will be using Node.js.
Node.js especially lends itself to being a suitable starting point because of its ease of use and out-of-the-box non-blocking IO capabilities.
In this example we will use the ExpressJS framework, as it is the most widely used web framework for Node.js.
First let’s install express and create our app.
Now open the folder node-cassandra using your favorite editor. In the directory list you will see a file called package.json, this is where the application’s dependencies are maintained. Edit that file and add to the dependencies “helenus”:”*”. Helenus is the Node.js driver for Cassandra and will be how we are going to communicate with Cassandra in this application.
Now go back to the command line and, in the application’s root directory, run “npm install” to install all of the application’s dependencies.
Let’s connect to our test cluster using Helenus. First let’s edit the “app.js” file and modify the top requires to add our driver.
The next step is to create our connection pool. The Helenus connection pool gives you a lot of options when it comes to managing the connections. It has automatic detection of dead nodes and attempts to reconnect to the nodes that are considered dead.
If a node goes down during a request, the request will fail and it is up to the client to decide to retry the request or not. The driver will not throw an exception on a downed node unless all nodes are down. This behavior allows for multiple-node outages that would otherwise cause an application to crash.
By default, Helenus will select a connection at random to send the writes to, however this behavior can be overridden. There is an optional function that can be passed when creating the connection that will choose a connection based on any logic. A good example use of this would be to implement a round-robin connection selection.
The only required fields in a Helenus connection are the keyspace and host names, you can also specify the CQL version, user, password, default timeout for connections, and the host pool size. The host pool size is the number of connections to make to each node, it current defaults to 1, however we recommend a minimum of 3 connections per node.
Let’s create our connection pool:
Now that we have specified our connection pool parameters, we can connect. We don’t want to have the web server start taking requests before we establish our connection, so we will put the server startup in the callback function. The callback function is the function that is invoked once the connection has been established.
As you can see, the function takes an argument of error. This parameter will be null unless there was a problem establishing the connection to the Cassandra. In the event there is an error, we do not want to continue any further, thus we will throw the error, causing it to halt the application.
Now that we have connected to the database, we need to make the connection available to the rest of the application. We can do this by adding it to the app configuration.
The last thing we will do in this file is to create a few routes, these will tell our app how we want to handle the data that comes in.
Now we will create the endpoints to the routes we just created. Edit the index.js file in the routes folder and add the index, new, and delete methods.
The functions for these methods take 3 arguments. The first argument is the request, the second is the response and the third is the “next” method, which is used when there is an error that we want to pass to the browser.
First we will create the “index” or user listing. To access Cassandra we will need to gain access to the application, which is an object in the request method.
In the above code, we access the connection pool and call the CQL statement “SELECT * FROM users LIMIT 10”. The callback method takes 2 arguments. The first is the error argument that will be passed to the browser if it exists. The second is the response from Cassandra containing the users returned from the query.
When using the “cql” method, you can also use variable based replacement that will properly escape the input before sending it to Cassandra for processing. This is essential when creating an application that take user input.
The routes will redirect the user to the index after the operation has completed. Now that we have our basic routes created, we can now create our view. A view can take variables passed to it form the render command. In this example we will use Jade, the default template engine in Express.
Since a Cassandra column consists not only of the key and value, but also the TTL and timestamp. The response from the SELECT query will be an object that will allow you to get the column by name and retrieve the TTL, value and timestamp if needed. In the example above, we can get the email address for the user by calling “user.get(‘email’).value”
Now that we have created the view we can start the application.
$ node app.js
You can now point your browser to http://localhost:3000
As you can see, the user we added from the origin CQL statements above is already there. You can also add and remove users via the page.
All of the above code and examples are available on github at:
by Russ Bradberry
Russell Bradberry is the Principal Architect at SimpleReach where he is responsible for architecting and building out highly scalable data solutions. He has put into market a wide range of products including a real time bidding ad server, a rich media ad management tool, a content recommendation system and most recently a real time social intelligence platform. He is also a Datastax Cassandra MVP and the author of the Node.js Cassandra driver Helenus.