The science of moving dots: the O’Reilly Data Show Podcast

Rajiv Maheswaran talks about the tools and techniques required to analyze new kinds of sports data

[This post originally appeared on the O’Reilly Radar blog.]

Editor’s note: you can subscribe to the O’Reilly Data Show Podcast through iTunes, SoundCloud or through our RSS feed.

Many data scientists are comfortable working with structured operational data and unstructured text. Newer techniques like deep learning have opened up data types like images, video, and audio.

Other common data sources are garnering attention. With the rise of mobile phones equipped with GPS, I’m meeting many more data scientists at start-ups and large companies who specialize in spatio-temporal pattern recognition. Analyzing “moving dots” requires specialized tools and techniques. A few months ago, I sat down with Rajiv Maheswaran founder and CEO of Second Spectrum, a company that applies analytics to sports tracking data. Maheswaran talked about this new kind of data and the challenge of finding patterns:

“It’s interesting because it’s a new type of data problem. Everybody knows that big data machine learning has done a lot of stuff in structured data, in photos, in translation for language, but moving dots is a very new kind of data where you haven’t figured out the right feature set to be able to find patterns from. There’s no language of moving dots, at least not that computers understand. People understand it very well, but there’s no computational language of moving dots that are interacting. We wanted to build that up, mostly because data about moving dots is very, very new. It’s only in the last five years, between phones and GPS and new tracking technologies, that moving data has actually emerged.”

Continue reading “The science of moving dots: the O’Reilly Data Show Podcast”

Spark + Cassandra: Technical Integration Details

I’ll be hosting a Nov 12th webcast on two of the most popular components in the big data ecosystem: Apache Spark and Apache Cassandra. As highlighted in a recent Databricks blog post, recent improvements to Spark’s shuffle have led to significant speedups (Spark is faster than Hadoop MapReduce, even on disk). While Spark has long worked well with Hadoop (HDFS), it now integrates well with other storage systems like Amazon S3 and Apache Cassandra. In an upcoming webcast, Sameer Farooqui will discuss the state of Spark/Cassandra integration:

This webcast will cover an architecture deep dive around how the Apache Cassandra database integrates with the Apache Spark computation engine.

We will cover:

  • Ideal use cases for Cassandra + Spark
  • Details of how Cassandra’s murmer3 partitioning maps to a Spark RDD’s internal partitioning
  • Considerations when using caching in Spark against C* tables
  • Specific configuration settings relevant to Cassandra + Spark integration
  • The DataStax open source Spark connector for Cassandra 2.x and how it works
  • Introduction to a free ~100 page ‘DevOps’ lab document (licensed under Creative Commons) that Databricks has released around how the integration works
  • Live demo of a Cassandra + Spark cluster (how to read data from a C* table into a Spark RDD, do some transformations on the RDD, write results back into a Cassandra table)
  • Upcoming features in future versions of the connector and current issues to be aware of.
  • On another front: the joint Databricks/O’Reilly Spark Developer Certification exam will be offered for the first time in Strata-Barcelona. Come to Barcelona and become one of the first certified Spark developers!