Spark + Cassandra: Technical Integration Details

I’ll be hosting a Nov 12th webcast on two of the most popular components in the big data ecosystem: Apache Spark and Apache Cassandra. As highlighted in a recent Databricks blog post, recent improvements to Spark’s shuffle have led to significant speedups (Spark is faster than Hadoop MapReduce, even on disk). While Spark has long worked well with Hadoop (HDFS), it now integrates well with other storage systems like Amazon S3 and Apache Cassandra. In an upcoming webcast, Sameer Farooqui will discuss the state of Spark/Cassandra integration:

This webcast will cover an architecture deep dive around how the Apache Cassandra database integrates with the Apache Spark computation engine.

We will cover:

  • Ideal use cases for Cassandra + Spark
  • Details of how Cassandra’s murmer3 partitioning maps to a Spark RDD’s internal partitioning
  • Considerations when using caching in Spark against C* tables
  • Specific configuration settings relevant to Cassandra + Spark integration
  • The DataStax open source Spark connector for Cassandra 2.x and how it works
  • Introduction to a free ~100 page ‘DevOps’ lab document (licensed under Creative Commons) that Databricks has released around how the integration works
  • Live demo of a Cassandra + Spark cluster (how to read data from a C* table into a Spark RDD, do some transformations on the RDD, write results back into a Cassandra table)
  • Upcoming features in future versions of the connector and current issues to be aware of.
  • On another front: the joint Databricks/O’Reilly Spark Developer Certification exam will be offered for the first time in Strata-Barcelona. Come to Barcelona and become one of the first certified Spark developers!

    Leave a Reply

    %d bloggers like this: