Apache Spark: Powering applications on-premise and in the cloud

[A version of this post appears on the O’Reilly Radar.] As organizations shift their focus toward building analytic applications, many are relying on components from the Apache Spark ecosystem. I began pointing this out in advance of the first Spark Summit in 2013 and since then, Spark adoption has exploded. With Spark Summit SF rightContinue reading “Apache Spark: Powering applications on-premise and in the cloud”

Apache Spark 1.3, the new Dataframe API, and Spark performance

Over the course of a week, I’ll be hosting two good webcasts featuring Spark release manager Patrick Wendell and Spark committer Kay Ousterhout. Register now! Patrick Wendell: Spark 1.3 and Spark’s New Dataframe API (March 25th at 9 a.m. California time) In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.3Continue reading “Apache Spark 1.3, the new Dataframe API, and Spark performance”

Apache Spark’s journey from academia to industry

[A version of this post appears on the O’Reilly Radar blog.] Three projects from UC Berkeley’s AMPLab have been keenly adopted by industry: Apache Mesos, Apache Spark, and Tachyon. As an early user, it’s been fun to watch Spark go from an academic lab to the most active open source project in big data. InContinue reading “Apache Spark’s journey from academia to industry”

Spark + Cassandra: Technical Integration Details

I’ll be hosting a Nov 12th webcast on two of the most popular components in the big data ecosystem: Apache Spark and Apache Cassandra. As highlighted in a recent Databricks blog post, recent improvements to Spark’s shuffle have led to significant speedups (Spark is faster than Hadoop MapReduce, even on disk). While Spark has longContinue reading “Spark + Cassandra: Technical Integration Details”

Unboxing Apache Spark 1.1

Apache Spark version 1.1 shipped a few weeks ago. I’ve been enjoying enhancements to MLlib, Spark SQL, and Spark Streaming. Next week I’ll be hosting a webcast with Spark’s release manager – and Databricks co-founder – Patrick Wendell. (Full disclosure: I’m an advisor to Databricks.) In this webcast, Patrick Wendell from Databricks will be speakingContinue reading “Unboxing Apache Spark 1.1”

Bits from the Data Store

Semi-regular field notes from the world of data: Apache Spark development community: Josh Rosen of Databricks recently built a tool for browsing pull requests. I like that it lets you scan each of the major components (Spark SQL, Streaming, MLlib, etc.). Now that Spark has become one of the most active open source projects inContinue reading “Bits from the Data Store”

Bits from the Data Store

Semi-regular field notes from the world of data: Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics. [Full disclosure: I’m an advisor to Databricks,Continue reading “Bits from the Data Store”