I had the pleasure of interviewing Ben Horowitz on the main stage at the recent Spark summit in SFO. Ben is co-founder of one of the leading tech venture capital firms a16z, and author of one of my favorite books about entrepreneurship (“The Hard Thing About Hard Things”).
The Spark Summit had a packed lineup, so I tried to cover a wide a variety of topics in the 15 minutes allotted, including cloud computing, open source, untapped opportunities in big data, and “tech bubbles”:
[Full disclosure: I’m an advisor to Databricks.]
At last year’s Spark Summit in SF, Ali Ghodsi gave the first public demo of Databricks Cloud and Workspace. As I noted at the time, it was a showstopper!
This year Ali gave an update and while I wasn’t on hand to see it in person, judging from comments I heard afterwards, it was another great demo (you can watch it here). Last year’s demo centered around Spark Streaming, this year the focus was on building and deploying end-to-end machine learning pipelines. The presentation culminated with a sentiment analysis of live tweets posted during at the conference.
With the introduction of data frames, and the maturation of PySpark, SparkR, and Spark SQL, Spark is much more accessible to data scientists. Databricks layers many more features (on top of Apache Spark) that make large-scale data science much simpler to do. This includes collaboration; notebooks (R, Python, Scala, SQL); pipeline creation, visualization and management; and model deployment tools. In addition Databricks Cloud provides (DevOps) tools that vastly simplify managing data and infrastructure, allowing data science teams to jump right in and do what they do best – explore/analyze data, and build/deploy models.
[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show Podcast: Phil Liu on the evolution of metric monitoring tools and cloud computing.
One of the main sources of real-time data processing tools is IT operations. In fact, a previous post I wrote on the re-emergence of real-time, was to a large extent prompted by my discussions with engineers and entrepreneurs building monitoring tools for IT operations. In many ways, data centers are perfect laboratories in that they are controlled environments managed by teams willing to instrument devices and software, and monitor fine-grain metrics.
During a recent episode of the O’Reilly Data Show Podcast, I caught up with Phil Liu, co-founder and CTO of SignalFx, a SF Bay Area startup focused on building self-service monitoring tools for time series. We discussed hiring and building teams in the age of cloud computing, building tools for monitoring large numbers of time series, and lessons he’s learned from managing teams at leading technology companies.
Evolution of monitoring tools
Having worked at LoudCloud, Opsware, and Facebook, Liu has seen first hand the evolution of real-time monitoring tools and platforms. Liu described how he has watched the number of metrics grow, to volumes that require large compute clusters:
One of the first services I worked on at LoudCloud was a service called MyLoudCloud. Essentially that was a monitoring portal for all LoudCloud customers. At the time, [the way] we thought about monitoring was still in a per-instance-oriented monitoring system. [Later], I was one of the first engineers on the operational side of Facebook and eventually became part of the infrastructure team at Facebook. When I joined, Facebook basically was using a collection of open source software for monitoring and configuration, so these are things that everybody knows — Nagios, Ganglia. It started out basically using just per-instance instant monitoring techniques, basically the same techniques that we used back at LoudCloud, but interestingly and very quickly as Facebook grew, this per-instance-oriented monitoring no longer worked because we went from tens or thousands of servers to hundreds of thousands of servers, from tens of services to hundreds and thousands of services internally.
[A version of this post appears on the O’Reilly Radar.]
As organizations shift their focus toward building analytic applications, many are relying on components from the Apache Spark ecosystem. I began pointing this out in advance of the first Spark Summit in 2013 and since then, Spark adoption has exploded.
With Spark Summit SF right around the corner, I recently sat down with Patrick Wendell, release manager of Apache Spark and co-founder of Databricks, for this episode of the O’Reilly Data Show Podcast. (Full disclosure: I’m an advisor to Databricks). We talked about how he came to join the UC Berkeley AMPLab, the current state of Spark ecosystem components, Spark’s future roadmap, and interesting applications built on top of Spark.
User-driven from inception
From the beginning, Spark struck me as different from other academic research projects (many of which “wither away” when grad students leave). The AMPLab team behind Spark spoke at local SF Bay Area meetups, they hosted 2-day events (AMP Camp), and worked hard to help early users. That mindset continues to this day. Wendell explained:
We were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user-driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. We never really built it for our own use — it was not like we were at a company solving a problem and then we decided, “hey let’s let other people use this code for free”. … From the beginning, we were focused on empowering other people and building platforms for other developers, so I always thought that was quite unique about Spark.