Over the course of a week, I’ll be hosting two good webcasts featuring Spark release manager Patrick Wendell and Spark committer Kay Ousterhout. Register now!
- Patrick Wendell: Spark 1.3 and Spark’s New Dataframe API (March 25th at 9 a.m. California time)
- Kay Ousterhout: Making Sense of Spark Performance (April 1st at 9 a.m. California time)
There has been significant work dedicated to improving the performance of big-data systems like Spark, but comparatively little effort has been spent systematically analyzing the performance bottlenecks of these systems. In this talk, I’ll take a deep dive into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark from UC Berkeley) and one production workload and demonstrate that many commonly-held beliefs about performance bottlenecks do not hold. In particular, I’ll demonstrate that CPU (and not I/O) is often the bottleneck, that network performance can improve job completion time by a median of at most 4%, and that the causes of most stragglers can be identified and fixed. I’ll also demo how the open-source tools I developed can be used to understand performance of other Spark jobs.
In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.3 release. Spark 1.3 brings extensions to all of Spark’s major components (SQL, MLlib, Streaming) along with a new cross-cutting Dataframes API. The talk will outline what’s new in Spark 1.3 and provide a deep dive on the dataframe feature. We’ll leave plenty of time for Q and A about the release or about Spark in general.