Wednesday, October 31, 2012

CQL: Seismic shift or simply a SQL veneer for Cassandra?


If you have recently come to Cassandra and you are happily using CQL, this blog post will probably only serve to confuse you.  You may want to stop here.

If however, you:
  • Are using Cassandra as a BigTable and care about the underlying persistence layer
  • Started using CQL, but wonder why it has crazy limitations that aren't found in SQL
  • Are accustomed to Java APIs and are trying to reconcile that with the advent of CQL
Then, this blog post is for you.  In fact, first you may want to read Jonathan's excellent post to get up to speed.

Now, I don't know the full history of all of it, but Cassandra has struggled with client apis from the beginning.  There are remnants of that struggle littered throughout the code base, and I believe there were a couple false starts along the way. (cassandra$ grep -r "avro" *)

But since I've been using Cassandra, I've been able to hang my hat on Thrift, which is the underlying RPC layer on which many of the Java APIs are built.  The Thrift layer is tied intimately to the storage engine in Cassandra and although it can be a bit cumbersome, it exposed everything we needed to build out app-layer extensions (a REST interface, indexing, and trigger functionality).

That said, Thrift is an albatross.  It provides no abstraction layer between between clients and the server-side implementation.  Concepts that you wish would die are still first class citizens in the Thrift API.  (cassandra$ grep -r "SuperColumn" * | grep thrift)  

An abstraction layer was certainly necessary, and I believe this was the primary purpose of CQL: to shield the client/user from the complexities of the underlying storage mechanism, and changes within that storage layer.  But because we chose a SQL-like abstraction layer, it has left many of the people that came to Cassandra because it is "NO"-SQL, wondering why a SQL-like language was introduced, mucking up their world.  Isn't the simple BigTable-based data model sufficient, why do we have to complicate things and impose schemas on everyone?

First, let me say that I never considered the underlying storage mechanism of Cassandra complex.  To me, it was a lot easier to reason about than many of the complex relational models I've seen over my career.  At the risk of oversimplifying, in Cassandra you really only have one structure to deal with HashMap<RK, SortedMap<CN, V>>  (RK = RowKey, CN=ColumnName, V=Value).   That is straight forward enough, and the only "complexity" here is that the outside HashMap is spread across machines.  No big deal.  But I'll accept that many people have trouble grok'ing the concept vs. the warm and comfy, familiar concepts of RDBMS. 

So, aside from the fact that the native Cassandra data structures might be difficult to grok, why did we add SQL to a NoSQL world?  As I said before in my post on Cassandra terminology,  I think a SQL-like language opens up avenues that weren't possible before. (application integrations with ETL tools, ORM layers, easing the transition for DBAs, etc.)  I truly appreciate why CQL was defined/selected and at the same time I'm sensitive to viewing Cassandra through SQL-colored glasses, and forcing those glasses on others.

Specifically, if it looks, tastes and smells like SQL, people will expect SQL like behavior.  People (and systems) expect to be able to construct arbitrary WHERE clauses, and JOINs.  With those expectations,  and without an understanding of the underlying storage model,  features, functions and even performance might not align well with users expectations. (IMHO, never a good thing)  We may find ourselves explaining BigTable concepts anyway, just to explain why JOINs aren't welcome here.
(or we can just point people to Dean Hiller and playORM, so he can explain why they are. =)

Also, I think we want to be careful not to hide the "simple" structures of the BigTable.  If it becomes cumbersome to interact with the BigTable (the Maps), we'll end up alienating the portion of the community that came to Cassandra for simple dynamic/flexible schemas.  That flexibility and simplicity allowed us to accomodate vast Varieties of data, one of the pillars in the 3 V's of BigData.  We don't want to lose it.

For more and discussion on this, follow and chime in on:
https://issues.apache.org/jira/browse/CASSANDRA-4815

With those caveats in mind, I'm on board with CQL.   I intend to embrace it whole heartedly, especially given the enhancements coming down the pipe.   IMHO, CQL is more than a SQL veneer for Cassandra.  It is the foundation for future feature enhancements.  And although Thrift will be around for a long, long, long, long time, Thrift RPC will begin to fall behind CQL.  There is already evidence of that as CQL is going to provide first-class support for operations on collections in 1.2, with only limited support (via JSON) in Thrift. See:
https://issues.apache.org/jira/browse/CASSANDRA-3647

With that conclusion, we've got our work cut out for us.  All of the enhancements we've developed for Cassandra were built on Thrift (either as AOP against the Thrift API, or directly consuming it to enable embedding).  This includes: cassandra-indexing, cassandra-triggers, and Virgil.  For each, we need to find a path forward that embraces CQL, keeping in mind that CQL is built on an entirely new protocol, listening on an entirely different port.  Additionally,  I'm looking to develop the Spring Data integration layer on CQL.

Anyone looking to get involved, we'd love a hand in migrating Virgil and Cassandra-Triggers forward and creating the initial Spring Data integration!

I'm excited about the future prospects of CQL, but it will take everyone's involvement to ensure that we get the best of all worlds from it, and that we don't lose anything in the transition.
  




No comments: