Large-scale Data Science and Machine Learning with Spark

[Full disclosure: I’m an advisor to Databricks.]

At last year’s Spark Summit in SF, Ali Ghodsi gave the first public demo of Databricks Cloud and Workspace. As I noted at the time, it was a showstopper!

This year Ali gave an update and while I wasn’t on hand to see it in person, judging from comments I heard afterwards, it was another great demo (you can watch it here). Last year’s demo centered around Spark Streaming, this year the focus was on building and deploying end-to-end machine learning pipelines. The presentation culminated with a sentiment analysis of live tweets posted during at the conference.

With the introduction of data frames, and the maturation of PySpark, SparkR, and Spark SQL, Spark is much more accessible to data scientists. Databricks layers many more features (on top of Apache Spark) that make large-scale data science much simpler to do. This includes collaboration; notebooks (R, Python, Scala, SQL); pipeline creation, visualization and management; and model deployment tools. In addition Databricks Cloud provides (DevOps) tools that vastly simplify managing data and infrastructure, allowing data science teams to jump right in and do what they do best – explore/analyze data, and build/deploy models.

Leave a Reply

%d bloggers like this: