Apache Spark’s journey from academia to industry

[A version of this post appears on the O’Reilly Radar blog.]

Three projects from UC Berkeley’s AMPLab have been keenly adopted by industry: Apache Mesos, Apache Spark, and Tachyon. As an early user, it’s been fun to watch Spark go from an academic lab to the most active open source project in big data. In my recent travels, I’ve met Spark users from companies of all sizes and and from many industries. I’ve also spoken with companies that came of age before Spark was available or mature enough, and many are replacing homegrown tools with Spark (Full disclosure: I’m an advisor to Databricks, a start-up commercializing Apache Spark..)

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

A few months ago, I spoke with UC Berkeley Professor and Databricks CEO Ion Stoica about the early days of Spark and the Berkeley Data Analytics Stack. Ion noted that by the time his students began work on Spark and Mesos, his experience at his other start-up Conviva had already informed some of the design choices:

“Actually, this story started back in 2009, and it started with a different project, Mesos. So, this was a class project in a class I taught in the spring of 2009. And that was to build a cluster management system, to be able to support multiple cluster computing frameworks like Hadoop, at that time, MPI and others. To share the same cluster as the data in the cluster. Pretty soon after that, we thought about what to build on top of Mesos, and that was Spark. Initially, we wanted to demonstrate that it was actually easier to build a new framework from scratch on top of Mesos, and of course we wanted it to be also special. So, we targeted workloads for which Hadoop at that time was not good enough. Hadoop was targeting batch computation. So, we targeted interactive queries and iterative computation, like machine learning. Continue reading

Clustering bitcoin accounts using heuristics

[A version of this post appears on the O’Reilly Radar blog.]

Editor’s note: we’ll explore present and future applications of cryptocurrency and blockchain technologies at our upcoming Radar Summit: Bitcoin & the Blockchain on Jan. 27, 2015, in San Francisco.

A few data scientists are starting to play around with cryptocurrency data, and as bitcoin and related technologies start gaining traction, I expect more to wade in. As the space matures, there will be many interesting applications based on analytics over the transaction data produced by these technologies. The blockchain — the distributed ledger that contains all bitcoin transactions — is publicly available, and the underlying data set is of modest size. Data scientists can work with this data once it’s loaded into familiar data structures, but producing insights requires some domain knowledge and expertise.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

I recently spoke with Sarah Meiklejohn, a lecturer at UCL, and an expert on computer security and cryptocurrencies. She was part of an academic research team that studied pseudo-anonymity (“pseudonymity”) in bitcoin. In particular, they used transaction data to compare “potential” anonymity to the “actual” anonymity achieved by users. A bitcoin user can use many different public keys, but careful research led to a few heuristics that allowed them to cluster addresses belonging to the same user:

“In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. So, if they are a legitimate businessman on the one hand, they can use a certain set of pseudonyms for that activity, and then if they are dealing drugs on Silk Road, they might use a completely different set of pseudonyms for that, and you wouldn’t be able to tell that that’s the same user.

Continue reading

Regulation and decentralization: Defending the blockchain

[A version of this post appears on the O’Reilly Radar blog.]

Editor’s note: our O’Reilly Radar Summit: Bitcoin & the Blockchain will take place on January 17, 2015, at Fort Mason in San Francisco. Andreas Antonopoulos, Vitalik Buterin, Naval Ravikant, and Bill Janeway are but a few of the confirmed speakers for the event. Learn more about the event and reserve your ticket here.

We recently announced a Radar summit on present and future applications of cryptocurrencies and blockchain technologies. In a webcast presentation one of our program chairs, Kieren James-Lubin, observed that we’re very much in the early days of these technologies. He also noted that the technologies are complex enough that most users will rely on service providers (like wallets) to securely store, transfer, and receive cryptocurrencies.

As some of these service providers reach a certain scale, they will start coming under the scrutiny of regulators. Certain tenets are likely to remain: currencies require continuous liquidity and large financial institutions need access to the lender of last resort.

There are also cultural norms that take time to change. Take the example of notaries, whose services seem amenable to being replaced by blockchain technologies. Such a wholesale change would entail adjusting rules and norms across localities, which means going up against the lobbying efforts of established incumbents.

One way to sway regulators and skeptics is to point out that the decentralized nature of the (bitcoin) blockchain can unlock innovation in financial services and other industries. Mastering Bitcoin author Andreas Antonopoulos did a masterful job highlighting this in his recent testimony before the Canadian Senate:

“Traditional models for financial payment networks and banking rely on centralized control in order to provide security. The architecture of a traditional financial network is built around a central authority, such as a clearinghouse. As a result, security and authority have to be vested in that central actor. The resulting security model looks like a series of concentric circles with very limited access to the center and increasing access as we move farther away from the center. However, even the most outermost circle cannot afford open access.

Continue reading

Bitcoin and the Future of Money

I’ll be a hosting a free webcast featuring Andreas Antonopoulos this Wednesday. Author of the new book Mastering Bitcoin, Andreas has emerged as one of the most popular & eloquent proponents of cryptocurrencies and related technologies:

Bitcoin technology is taking the world of finance by storm. Bitcoin and the blockchain technology that is at its core can be used to quickly build secure global financial services on an open and decentralized platform. Join this webcast to learn what bitcoin is, what makes it special, how to get it and how to use it.

For more, come to Bitcoin & the Blockchain: An O’Reilly Radar Summit, January 27, 2015, at Fort Mason in San Francisco.

Hardcore Data Science day: Strata+Hadoop World 2015

My co-organizer Ben Recht and I are proud to announce the return of Hardcore Data Science day to Strata+Hadoop World in California. We have outstanding speakers – 11 talks in total – and I expect the track to sell out (as it has done in the past).

  • Deep Learning enthusiasts will enjoy sessions on its application to speech (Tara Sainath) and vision (Fei-Fei Li)
  • One the most eminent researchers in machine learning, Michael Jordan, is giving a talk on statistical decision theory & big data. He recently participated in a reddit (Ask Me Anything) session and was profiled by IEEE Spectrum (his reaction to that piece is here).
  • Machine-learning: Maya Gupta of Google is giving a talk on interpretable & robust models, Anima Anandkumar (of UC Irvine) will discuss the use of tensors for ML, and John Canny (of UC Berkley) will describe the new BIDMach toolkit.
  • Applications: David Andrzejewski (of SumoLogic) will examine the use of Graph Mining techniques for machine data, Eamonn Keogh (of UC Riverside) will survey methods for mining large-scale time-series, and Chris Re (of Stanford) will talk about recent applications of the DeepDive knowledge base framework.
  • John Myles White will explain why data scientists should consider the Julia programming language, and Alyosha Efros will outline recent progress in Visual Data Mining techniques.

Reserve your spot and sign up soon!

Building Apache Kafka from scratch

[A version of this post originally appeared on the O’Reilly Radar blog.]

In this episode of the O’Reilly Data Show Podcast, Jay Kreps talks about data integration, event data, and the Internet of Things.

At the heart of big data platforms are robust data flows that connect diverse data sources. Over the past few years, a new set of (mostly open source) software components have become critical to tackling data integration problems at scale. By now, many people have heard of tools like Hadoop, Spark, and NoSQL databases, but there are a number of lesser-known components that are “hidden” beneath the surface.

In my conversations with data engineers tasked with building data platforms, one tool stands out: Apache Kafka, a distributed messaging system that originated from LinkedIn. It’s used to synchronize data between systems and has emerged as an important component in real-time analytics.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

In my travels over the past year, I’ve met engineers across many industries who use Apache Kafka in production. A few months ago, I sat down with O’Reilly author and Radar contributor Jay Kreps, a highly regarded data engineer and former technical lead for Online Data Infrastructure at LinkedIn, and most recently CEO/co-founder of Confluent. Continue reading

Decoding bitcoin and the blockchain

[A version of this post originally appeared on the O’Reilly Radar blog.]

When the creators of bitcoin solved the “double spend” problem in a decentralized manner, they introduced techniques that have implications far beyond digital currency. Our newly announced one-day event — Bitcoin & the Blockchain: An O’Reilly Radar Summit — is in line with our tradition of highlighting applications of developments in computer science. Financial services have long relied on centralized solutions, so in many ways, products from this sector have become canonical examples of the developments we plan to cover over the next few months. But many problems that require an intermediary are being reexamined with techniques developed for bitcoin.

How do you get multiple parties in a transaction to trust each other without an intermediary? In the case of a digital currency like bitcoin, decentralization means reaching consensus over an insecure network. As Mastering Bitcoin author Andreas Antonopoulos noted in an earlier post, several innovations lie at the heart of what makes bitcoin disruptive:

“Bitcoin is a combination of several innovations, arranged in a novel way: a peer-to-peer network, a proof-of-work algorithm, a distributed timestamped accounting ledger, and an elliptic-curve cryptography and key infrastructure. Each of these parts is novel on its own, but the combination and specific arrangement was revolutionary for its time and is beginning to show up in more innovations outside bitcoin itself.”

Continue reading