Apache Spark 1.3, the new Dataframe API, and Spark performance

Over the course of a week, I’ll be hosting two good webcasts featuring Spark release manager Patrick Wendell and Spark committer Kay Ousterhout. Register now!

  • Patrick Wendell: Spark 1.3 and Spark’s New Dataframe API (March 25th at 9 a.m. California time)
  • In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.3 release. Spark 1.3 brings extensions to all of Spark’s major components (SQL, MLlib, Streaming) along with a new cross-cutting Dataframes API. The talk will outline what’s new in Spark 1.3 and provide a deep dive on the dataframe feature. We’ll leave plenty of time for Q and A about the release or about Spark in general.

  • Kay Ousterhout: Making Sense of Spark Performance (April 1st at 9 a.m. California time)

    There has been significant work dedicated to improving the performance of big-data systems like Spark, but comparatively little effort has been spent systematically analyzing the performance bottlenecks of these systems. In this talk, I’ll take a deep dive into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark from UC Berkeley) and one production workload and demonstrate that many commonly-held beliefs about performance bottlenecks do not hold. In particular, I’ll demonstrate that CPU (and not I/O) is often the bottleneck, that network performance can improve job completion time by a median of at most 4%, and that the causes of most stragglers can be identified and fixed. I’ll also demo how the open-source tools I developed can be used to understand performance of other Spark jobs.

Leave a Reply

%d bloggers like this: