Scikit-Learn 0.16

I’ll be hosting a webcast featuring two of the key contributors to what is arguably one of the most popular machine learning tools today – scikit-learn:

News from Scikit-Learn 0.16 and Soon-To-Be Gems for the Next Release
presented by: Olivier Grisel, Andreas Mueller

This webcast will review Scikit-learn, a widely used open source machine learning library in python, and discuss some of the new features of the recent 0.16 release. Highlights of the last release include new algorithms such as approximate nearest neighbors search, Birch clustering and a path algorithm for logistic regression, probability calibration, as well as improved ease of use and interoperability with the Pandas library. We will also highlight some up-and-coming contributions, such as Latent Dirichlet Allocation, supervised neural networks, and a complete revamping of the Gaussian Process module.

In addition, Olivier will be leading what promises to be a popular tutorial at Strata+Hadoop World in London in early May.

scikit-learn webcast and tutorial

Redefining power distribution using big data

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Erich Nachbar on testing and deploying open source, distributed computing components.

When I first hear of a new open source project that might help me solve a problem, the first thing I do is ask around to see if any of my friends have tested it. Sometimes, however, the early descriptions sound so promising that I just jump right in and try it myself — and in a few cases, I transition immediately (this was certainly the case for Spark).

I recently had a conversation with Erich Nachbar, founder and CTO of Virtual Power Systems, and one of the earliest adopters of Spark. In the early days of Spark, Nachbar was CTO of Quantifind, a startup often cited by the creators of Spark as one of the first “production deployments.” On the latest episode of the O’Reilly Data Show Podcast, we talk about the ease with which Nachbar integrates new open source components into existing infrastructure, his contributions to Mesos, and his new “software-defined power distribution” startup.

Ecosystem of open source big data technologies

When evaluating a new software component, nothing beats testing it against workloads that mimic your own. Nachbar has had the luxury of working in organizations where introducing new components isn’t subject to multiple levels of decision-making. But, as he notes, everything starts with testing things for yourself:

“I have sort of my mini test suite…If it’s a data store, I would just essentially hook it up to something that’s readily available, some feed like a Twitter fire hose, and then just let it be bombarded with data, and by now, it’s my simple benchmark to know what is acceptable and what isn’t for the machine…I think if more people, instead of reading papers and paying people to tell them how good or bad things are, would actually set aside a day and try it, I think they would learn a lot more about the system than just reading about it and theorizing about the system. Continue reading

Apache Spark 1.3, the new Dataframe API, and Spark performance

Over the course of a week, I’ll be hosting two good webcasts featuring Spark release manager Patrick Wendell and Spark committer Kay Ousterhout. Register now!

  • Patrick Wendell: Spark 1.3 and Spark’s New Dataframe API (March 25th at 9 a.m. California time)
  • In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.3 release. Spark 1.3 brings extensions to all of Spark’s major components (SQL, MLlib, Streaming) along with a new cross-cutting Dataframes API. The talk will outline what’s new in Spark 1.3 and provide a deep dive on the dataframe feature. We’ll leave plenty of time for Q and A about the release or about Spark in general.

  • Kay Ousterhout: Making Sense of Spark Performance (April 1st at 9 a.m. California time)

    There has been significant work dedicated to improving the performance of big-data systems like Spark, but comparatively little effort has been spent systematically analyzing the performance bottlenecks of these systems. In this talk, I’ll take a deep dive into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark from UC Berkeley) and one production workload and demonstrate that many commonly-held beliefs about performance bottlenecks do not hold. In particular, I’ll demonstrate that CPU (and not I/O) is often the bottleneck, that network performance can improve job completion time by a median of at most 4%, and that the causes of most stragglers can be identified and fixed. I’ll also demo how the open-source tools I developed can be used to understand performance of other Spark jobs.

Let’s build open source tensor libraries for data science

[A version of this post appears on the O’Reilly Radar blog.]

Tensor methods for machine learning are fast, accurate, and scalable, but we’ll need well-developed libraries.

Data scientists frequently find themselves dealing with high-dimensional feature spaces. As an example, text mining usually involves vocabularies comprised of 10,000+ different words. Many analytic problems involve linear algebra, particularly 2D matrix factorization techniques, for which several open source implementations are available. Anyone working on implementing machine learning algorithms ends up needing a good library for matrix analysis and operations.

But why stop at 2D representations? In a recent Strata + Hadoop World San Jose presentation, UC Irvine professor Anima Anandkumar described how techniques developed for higher-dimensional arrays can be applied to machine learning. Tensors are generalizations of matrices that let you look beyond pairwise relationships to higher-dimensional models (a matrix is a second-order tensor). For instance, one can examine patterns between any three (or more) dimensions in data sets. In a text mining application, this leads to models that incorporate the co-occurrence of three or more words, and in social networks, you can use tensors to encode arbitrary degrees of influence (e.g., “friend of friend of friend” of a user).

Being able to capture higher-order relationships proves to be quite useful. In her talk, Anandkumar described applications to latent variable models — including text mining (topic models), information science (social network analysis), recommender systems, and deep neural networks. A natural entry point for applications is to look at generalizations of matrix (2D) techniques to higher-dimensional arrays. Continue reading

Turning Ph.D.s into industrial data scientists and data engineers

[A version of this post appears on the O’Reilly Radar blog.]

Editor’s note: The ASI will offer a two-day intensive course, Practical Machine Learning, at Strata + Hadoop World in London in May.

Back when I was considering leaving academia, the popular exit route was financial engineering. Many science and engineering Ph.D.s ended up in big Wall Street banks; I chose to be the lead quant at a small hedge fund — it was a natural choice for many of us. Financial engineering was topically close to my academic interests, and working with traders meant access to resources and interesting problems.

Today, there are many more options for people with science and engineering doctorates. A few organizations take science and engineering Ph.D.s, and over the course of 8-12 weeks, prepare them to join the ranks of industrial data scientists and data engineers.

I recently sat down with Angie Ma, co-founder and president of ASI, a London startup that runs a carefully structured “finishing school” for science and engineering doctorates. We talked about how Angie and her co-founders (all ex-physicists) arrived at the concept of the ASI, the structure of their training programs, and the data and startup scene in the UK. [Full disclosure: I’m an advisor to the ASI.] Continue reading