[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show podcast: Evan Chan on the early days of Spark+Cassandra, FiloDB, and cloud computing.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.
In this episode of the O’Reilly Data Show, I sit down with Evan Chan, distinguished engineer at Tuplejump. We talk about the early days of Spark (particularly his contributions to Spark/Cassandra integration), his interesting new open source project (FiloDB), and recent trends in cloud computing.
Bringing Apache Spark & Apache Cassandra together
Datastax credits me with inspiring them to bring Spark into Cassandra … I think they’re very generous about that. I think I was one of the first folks to talk about the possibility of bringing Cassandra and Spark together. The vision that I saw was that Cassandra was really good for real-time updates, but what if we’re able to do more analytical queries on it? Then you could combine, basically, a platform that is really good for real-time updates with analytics.
What is FiloDB?
FiloDB is an analytical database … It is a distributed columnar analytical database. … It’s distributed meaning that, just like Cassandra, runs on multiple nodes. You can spread out your data very easily and query it as a single entity. It is columnar meaning that stores your data in a format that makes it very fast for those little queries. What do I mean by that? That means you might want to find out, for example, the top products that are selling in a department for month X. These are queries that typically show up in business reporting…these kind of queries would benefit greatly from FiloDB.
…We’re seeing there’s really a need for something that allows you to do queries very quickly and interactively but still work with more recent data. Some more recent use cases that really motivated this was around processing, such as geospatial processing … For example, I have a location column. Let’s say that it’s IoT or something else, and I have positions or coordinates. Oftentimes, you need to annotate this data and take the position and same ZIP codes and other kinds of things like that. I saw an opportunity to use columnar storage for it, but nothing that allowed me to take advantage of it very easily.
…One of our core messages is, look, you already have Spark and Cassandra. You ingest real-time data. Now you’re thinking how to add analytics to it. You don’t have to set up a whole complex stack involving Hadoop and a lot of extra stuff. You can simplify your stack a lot and just use what we call a “SMACK stack”: Spark, [Mesos, Akka,] Cassandra, Kafka.
It’s such a [different] landscape now. … You can run everything on Amazon or Google cloud with the data flow. Basically, I think the industry is transitioning from: you have to build a lot of things yourself, to more of a pick and choose (like I’m going to go and see what services I can assemble and integrate all of them). … When you think about testing, and especially when you have, say, a dev cluster, a staging cluster, a production cluster, a lot of times you want to spin up things for tests, like performance. With cloud, it becomes much easier. With data centers, you often can’t find a space to do performance tests. With the cloud provider, I can just spin out a cluster.