Colum-Family data stores are a special class of NoSql system that facilitates to store wide range of data types keeping the design in between traditional Relational Database System and Modern Key-Value stores. Cassandra is a Column-Family data store on its underlying peer-to-peer architecture.
One of the important features of any data stores is capability of the client API for querying the stored data. Cassandra was initially equipped with Thrift: A Multilanguage Remote Procedural call for querying the data. In order to avoid working on complex syntaxes and to give the simplicity of working on familiar SQL like syntax, CQL: Cassandra Query Language was introduced. Due to its similarities with SQL, CQL seen good popularity among the new Cassandra Users. Due to the popularity of CQL, Cassandra community focused more on optimizing the CQL client and took the CQL to a more matured state. On the other hand, Thrift client which is the native client API for Cassandra lost its importance and didn’t see much optimization and slowly being deprecated by the Cassandra community.
Following are my subjective remarks regarding the Cassandra’s CQL client limitations. Any comments regarding my remarks are welcome to be addressed in the comments.
1. False Conception :
It’s good to remain simple, keeping the querying syntax close to familiar SQL like syntax as possible. But I think it’s more important to remain what the system is actually. To give the illusion of SQL, CQL completely masks what is actually happening inside Cassandra and leads to lot of false conception. For example let’s look at the following example taken from a Datastax's Presentation slide.
- How many rows are there in the query output above: Figure 1 ? Nothing to be astonished if most of you answers as 4. But originally, the number of rows in the table output in Figure 1 is only one. The first column called ‘id’ is the Partition Key (Row key). For a given Partition Key there can exists only one row in Cassandra. Am sorry to criticize CQL, but it could be really misguiding for most of the people.
- Then, now tell me how many columns are there in the Row (Figure 1) ? Again, nothing to be astonished if most of you answers as 5. But according to the Cassandra, there are only 4 columns. So, then which one is the extra..? The first column: ‘id’. Because, the first column is chosen as Row key and the columns which are chosen as Row key(s) is/are never stored in any nodes. The Row key is just to find the replicas that contains the particular row.
- And one more confusion could be on the order of the data. Unlike traditional RDBMS, data stored in Cassandra are not based on the insertion order. In Cassandra, by default data are sorted based on the lexicographical order of the composite columns.
- The next probable question would be what if there is no composite columns ?. If there is no composite columns, then an entry for a partition Key will always replaces its old entries (if any).
2. Limitations of dynamic Columns:
One of the main advantages of Cassandra is its schema-less dynamic columns. You no need to decide on the table structure in advance. You can span any new columns you want without defining it in advance. You can add columns only for which you have data and can ignore the columns if you don’t have value for it without need to specify it as 'null'. But the CQL client of Cassandra actually restrict this liberty. With CQL client, you can’t span a new column without altering the table structure as in SQL. And if you don’t have value for a column, you can skip only if it is non-composite columns. If you don’t have value of a column that is declared as composite, then you can’t insert the entries unless specifying it explicitly as ‘null’.
Few weeks before, during a meetup at San Francisco, I had a chance to discuss with one of the senior developer from Datastax about Why enhancement of CQL is more concentrated while Thrift is not ?
The answer I got is, the communication protocol used by CQL is better than Thrift. And it's always better to have your schema ready before writing the application. If your application has a fixed schema, anybody can understand the application and data queries without much confusion. Otherwise only who wrote the application knows what is actually stored inside the Cassandra. This is one of the reasons, why CQL is more concentrated for enhancement. Initially Thrift was faster than CQL, but there were much optimization done on the CQL client to make it much efficient. Currently CQL is much matured and getting faster than Thrift.
But my answer was, The reason was not so rigid, it was the development problem to code the application as explicit as possible for the benefit of others to understand it more easily. Cassandra was made to be schema free and was initially used like that in the production. The answer I got this is ‘It's all the part of the tradeoff..!’
My concern is Cassandra is awesome with its Peer-to-Peer architecture being rich in the available features and tradeoffs. Some of the misconception and limitations of the querying API should not affect the Cassandra’s popularity. The Cassandra community should also focus on the enhancement of Thrift API and Thrift based drivers like Hector. Or if the future versions of CQL address the limitation mentioned earlier that would be great.