Brian ONeill's Random Thoughts: April 2013

My good friend Vivek Mishra asked me to review his new book, Instant Apache Cassandra for Developers Starter. (http://www.packtpub.com/apache-cassandra-for-developers/book)

Vivek is a rockstar, leading Kundera, where he cranks out code that allows people to access Cassandra via JPA. (See: https://github.com/impetus-opensource/Kundera)

His book is an excellent primer on Cassandra. The initial sections are clear and concise, describing the necessary fundamentals required to get started. IMHO, to be successful with Cassandra, you need to undersand the distributed storage model. Vivek does a great job of describing this, and the write path, another critical element.

About half way through, the book transitions, focusing much more on example code. Vivek's bias creeps in a bit here, focusing heavily on Kundera. I have mixed emotions about accessing Cassandra from JPA. But I think its absolutely critical if you are attempting to consolidate storage into a single database. If you are, Kundera is perfect. It allows you to use Cassandra like any other relational store.

If instead, you are taking a polyglot approach, or you are using Cassandra specifically for its "NoSQL-ness", then JPA access might obfuscate the power of the simple/scalable data model at the heart of C*. That however may be changing, given the increased use of CQL, where C* has found a way to expose all the "NoSQL-ness" via a SQL-like interface... provided you understand how to translate the two!

Regardless, Vivek did a great job with the book. You will easily save the cost of the book in time getting started with C* and JPA.

BUT...

Be sure to read it through (don't stop at the JPA example!). Vivek saved the best for last. I'd say the best nuggets in the book are in the aptly named section: "Top features you'll want to know about" (pg. 29)

Cassandra's blessing, and its curse, is the wide variety of methods that you can use to access it. (Hector & Astyanax (for Thrift), Virgil (for REST), CQL (for SQL), and Kundera (for JPA)) But you can't fault C* for that, its a thriving inventive community applying C* to all sorts of problems. And given its growth, it may only get worse... but in a good way. (I still hope to revive Spring Data for C* =)

As part of our presentation up at NYC* Big Data Tech Day, we noted that Hadoop didn't really work for us. It was great for ingesting flat files from HDFS into Cassandra, but the map/reduce jobs that used Cassandra as input didn't cut it. We found ourselves contorting our solutions to fit within the map/reduce framework, which required developer-level capabilities. We had to add complexity into the system to do batch management/composition, and in the end the map/reduce jobs took too long to complete.

Eventually, we swapped out Hadoop for Storm. That allowed us to do real-time cumulative analytics. And most recently, we converted our topologies to Trident. Handling all CRUD operations through Storm allowed us to perform roll-up metrics by different dimensions using Trident State. (Additionally, we can write to wide-rows for indexing, etc.)

This is working really well, but we are seeing increasing demand from our data scientists and customers to support "ad hoc" dimensional analysis, dashboards, and reporting. Elastic Search keeps us covered on many of the ad hoc queries, but aside from facets, it has little support for real-time dimensional aggregations, and no support for dashboards and reports.

We turned to the industry to find the best of breed. With some help from others that have traveled this road, (shout out to @elubow), we settled on Vertica, Infobright and Acunu as contenders. I quickly grabbed VM's from each of them and went to work.

WARNING: What I'm about to say is based on a few days experimentation, and largely consists of initial first impressions. It has no basis on real production experience. (yet =)

First up was Acunu. Although each of the VMs functioned as an appliance, when logging into the VM and playing around with things, we were most at home with Acunu. Acunu is backed by Cassandra. Having C* installed and running as the persistence layer was like having an old friend playing wingman on an initial first date. (they can bail you out if things start going south =)

Acunu had a nice REST API and a simple enough web-based UI to manage schemas and dimensions. Within minutes, I was inserting data from a ruby script and playing around with dashboards.... until something went wrong and the server starting throwing OoM's. After a restart, things cleared up, but it left me questioning the stability a bit. (once again, this was a *single* vm running on my laptop, so it wasn't the most robust environment)

Next, I moved on to Vertica. From a features and functions point of view, Vertica looked to be leaps and bounds ahead. It had sophisticated support for R, which would make our data scientists happy. It also has compression capabilities, which will make our IT/Ops guys happy. And it looked to have some sophisticated integration with Hadoop, just in case we ever wanted/needed to support deep analytics that could leverage M/R.

That said, it was far more cumbersome to get up and running, and felt a bit like I went backwards in time. I couldn't find a REST API. (please let me know if someone has one for Vertica) So, I was left to go through the hoop-drill of getting a JDBC client driver, which was not available in public repos, etc. When using the admin tool provided on the appliance, I felt like I was back in middle school (early 90's) installing linux via an ANSI interface on an Intel 8080. In the end however, I grew accustomedto their client (vsql) and was happily hacking away over the JDBC driver and it felt fairly solid.

Although we are still interested in pursuing both Acunu and Vertica, both experiences left me wanting. What we really want is a fully open-source solution (preferably apache license) that we are free to enhance, supplement, etc.... with optional commercial support.

That got me thinking about Edward Capriolo's presentation on Intravert. If I boil down our needs into "must-haves" and "nice-to-haves", what we really *need* is just an implementation of Rainbird. (http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011)

AS AN ASIDE:
Does anyone know what happened to Rainbird? I've been trying to get the answer, to no avail.
http://www.youtube.com/watch?v=84k7o4GdkQg

Now, time for crazy talk...
Intravert provides a slick REST API for CRUD operations on Cassandra. As I said before, I'm a *huge* REST fan. It provides the loose-coupling for everything in our polyglot persistence architecture. Intravert also provides a loosely coupled eventing framework to which I can attached handlers. What if I implemented a handler, that took the CRUD events, and updated additional column families with the dimensional counts/aggregations??? If I then combine that with a javascript framework for charting, how far would that get me? (60-70% solution?)

To be clear, I'm not bashing Vertica or Acunu. Both have solid value propositions and they are both contenders in our options analysis. I'm just mourning the fact that there seems to be no good open-source solution in this space like there are in others. (Neo4j/TitanDB for graphs, Elastic Search/SOLR for search, Kafka/Kestrel for queueing, Cassandra for Storage, etc.)

We are also considering Druid and Infobright, but I haven't gotten to them yet:
https://github.com/metamx/druid

Please don't bash me for early judgments.
I'm definitely interested in hearing people's thoughts.

Tuesday, April 23, 2013

Book Review: Instant Apache Cassandra for Developers Starter from PACKT

Monday, April 1, 2013

BI/Analytics on Big Data/Cassandra: Vertica, Acunu and Intravert(!?)

Big Data Quadfecta @ Philly Emerging Technologies Event