A framework for building and evaluating data products

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Pinterest data scientist Grace Huang on lessons learned in the course of machine learning product launches.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Grace Huang, data science lead at Pinterest. With its combination of a large social graph, enthusiastic users, and multimedia data, I’ve long regarded Pinterest as a fascinating lab for data science. Huang described the challenge of building a sustainable content ecosystem and shared lessons from the front lines of machine learning product launches. We also discussed recommenders, the emergence of deep learning as a technique used within Pinterest, and the role of data science within the company.

Here are some highlights from our conversation:
Continue reading “A framework for building and evaluating data products”

The evolution of GraphLab

[A version of this post appears on the O’Reilly Radar blog.]

Editor’s note: Carlos Guestrin will be part of the team teaching Large-scale Machine Learning Day at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.

I only really started playing around with GraphLab when the companion project GraphChi came onto the scene. By then I’d heard from many avid users and admired how their user conference instantly became a popular San Francisco Bay Area data science event. For this podcast episode, I sat down with Carlos Guestrin, co-founder/CEO of Dato, a start-up launched by the creators of GraphLab. We talked about the early days of GraphLab, the evolution of GraphLab Create, and what’s he’s learned from starting a company.

MATLAB for graphs

Guestrin remains a professor of computer science at the University of Washington, and GraphLab originated when he was still a faculty member at Carnegie Mellon. GraphLab was built by avid MATLAB users who needed to do large scale graphical computations to demonstrate their research results. Guestrin shared some of the backstory:

“I was a professor at Carnegie Mellon for about eight years before I moved to Seattle. A couple of my students, Joey Gonzales and Yucheng Low were working on large scale distributed machine learning algorithms specially with things called graphical models. We tried to implement them to show off the theorems that we had proven. We tried to run those things on top of Hadoop and it was really slow. We ended up writing those algorithms on top of MPI which is a high performance computing library and it was just a pain. It took a long time and it was hard to reproduce the results and the impact it had on us is that writing papers became a pain. We wanted a system for my lab that allowed us to write more papers more quickly. That was the goal. In other words so they could implement this machine learning algorithms more easily, more quickly specifically on graph data which is what we focused on.”

Continue reading “The evolution of GraphLab”

What’s New in Scikit-learn 0.15

Python has emerged as one of the more popular languages for doing data science. The primary reason is the impressive array of tools (the “Pydata” stack) available for addressing many stages of data science pipelines. One of the most popular Pydata tools is scikit-learn, an easy-to-use and highly-efficient machine learning library.

I’ve written about why I like to recommend scikit-learn so I won’t repeat myself here. Next week I’ll be hosting a FREE webcast featuring one of the most popular teachers and speakers in the Pydata community, scikit-learn committer Olivier Grisel:

This webcast will introduce scikit-learn, an Open Source project for Machine Learning in Python and review some new features from the recent 0.15 release such as faster randomized ensemble of decision trees and optimization for the memory usage when working on multiple cores.

We will also review on-going work part of the 2014 edition of the Google Summer of Code: neural networks, extreme learning machines, improvements for linear models, and approximate nearest neighbor search with locality-sensitive hashing.

PredictionIO: an open source machine learning server

PredictionIOPredictionIO a startup that produces an open source machine learning server, has raised a seed round of $2.5M. The company’s engine allows developers to quickly integrate machine learning into products and services. The company’s machine learning server is open source, and is available on Amazon Web Services. As an open source package, the company hopes to attract developers who are interested in “Machine Learning As A Service” but are wary of proprietary solutions.

Machine learning solution providers have traditionally highlighted their suite of algorithms. As I noted in an earlier post, there are different criteria for choosing machine learning algorithms (simplicity, interpretability, speed, scalability, and accuracy). Recently some companies are beginning to highlight tools for managing the analytic lifecycle (deploy/monitor/maintain models).

PredictionIO joins a group of startups (including Wise.io, BigML, Skytree, GraphLab) who develop tools that make it easier for companies to build and deploy (scalable) analytic models. The company is hoping that an open source server is much more attractive to developers and companies. I personally love open source tools, but I think the jury is out on this matter. Particularly for analytics, many large companies are willing to pay for proprietary solutions as long as they meet their needs, and are easy to use and deploy.

Analytics and machine learning are important components of most data applications. But data applications require piecing many other tools in a coherent pipeline (e.g., visualization & interactive analytics, ML & analytics, data wrangling & (realtime) data processing). The recently announced Databricks Cloud has garnered attention precisely because it pulls together many important components into an accessible and massively scalable (distributed computing) platform.

[Full disclosure: I’m an advisor to Databricks.]

Related content:

  • Gaining access to the best machine-learning methods
  • Data scientists tackle the analytic lifecycle
  • Instrumenting collaboration tools used in data projects

    Built-in audit trails can be useful for reproducing and debugging complex data analysis projects

    [A version of this post appears on the O’Reilly Data blog.]

    As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover, companies are beginning to treat analytics as mission-critical software and have real-time dashboards to track model performance.

    Once a model is deemed to be underperforming or misbehaving, diagnostic tools are needed to help determine appropriate fixes. It could well be models need to be revisited and updated, but there are instances when underlying data sources1 and data pipelines are what need to be fixed. Beyond the formal systems put in place specifically for monitoring analytic products, tools for reproducing data science workflows could come in handy.

    Version control systems are useful, but appeal primarily to developers. The recent wave of data products come with collaboration features that target a broader user base. Properly instrumented, collaboration tools are also useful for reproducing and debugging complex data analysis projects. As an example, Alpine Data records all the actions made while working on a data project: a screen displays all recent “actions and changes” and team members can choose to leave comments or questions.

    If you’re a tool builder charged with baking in collaboration, consider how best to expose activity logs as well. Properly crafted “audit trails” can be very useful for uncovering and fixing problems that arise once a model gets deployed in production.

    Alpine Chorus: audit trail

    Related Content:

    (1) Models can be on the receiving end of bad data or the victim of attacks from adversaries.

    Business analysts want access to advanced analytics

    [A version of this post appears on the O’Reilly Data blog and Forbes.]

    I talk with many new companies who build tools for business analysts and other non-technical users. These new tools streamline and simplify important data tasks including interactive analysis (e.g., pivot tables and cohort analysis), interactive visual analysis (as popularized by Tableau and Qlikview), and more recently data preparation. Some of the newer tools scale to large data sets, while others explicitly target small to medium-sized data.

    As I noted in a recent post, companies are beginning to build data analysis tools1 that target non-experts. Companies are betting that as business users start interacting with data, they will want to tackle some problems that require advanced analytics. With business analysts far outnumbering data scientists, it makes sense to offload some problems to non-experts2.

    Moreover data seems to support the notion that business users are interested in more complex problems. I recently looked at data3 from 11 large Meetups (in NYC and the SF Bay Area) that target business analysts and business intelligence users. Altogether these Meetups had close to 5,000 active4 members. As you can see in the chart below, business users are interested in topics like machine learning (1 in 5), predictive analytics (1 in 4), and data mining (1 in 4):

    Key topics of interest: Active members of SF & NYC meetups for business analysts

    Continue reading “Business analysts want access to advanced analytics”

    Six reasons why I recommend scikit-learn

    [A version of this post appears on the O’Reilly Data blog.]

    I use a variety of tools for advanced analytics, most recently I’ve been using Spark (and MLlib), R, scikit-learn, and GraphLab. When I need to get something done quickly, I’ve been turning to scikit-learn for my first pass analysis. For access to high-quality, easy-to-use, implementations1 of popular algorithms, scikit-learn is a great place to start. So much so that I often encourage new and seasoned data scientists to try it whenever they’re faced with analytics projects that have short deadlines.

    I recently spent a few hours with one of scikit-learn’s core contributors Olivier Grisel. We had a free flowing discussion were we talked about machine-learning, data science, programming languages, big data, Paris, and … scikit-learn! Along the way, I was reminded by why I’ve come to use (and admire) the scikit-learn project.

    Commitment to documentation and usability
    One of the reasons I started2 using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate). Contributions to scikit-learn are required to include narrative examples along with sample scripts that run on small data sets. Besides good documentation there are other core tenets that guide the community’s overall commitment to quality and usability: the global API is safeguarded, all public API’s are well documented, and when appropriate contributors are encouraged to expand the coverage of unit tests.

    Models are chosen and implemented by a dedicated team of experts
    scikit-learn’s stable of contributors includes experts in machine-learning and software development. A few of them (including Olivier) are able to devote a portion of their professional working hours to the project.

    Covers most machine-learning tasks
    Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.). And since scikit-learn is developed by a large community of developers and machine-learning experts, promising new techniques tend to be included in fairly short order.

    As a curated library, users don’t have to choose from multiple competing implementations of the same algorithm (a problem that R users often face). In order to assist users who struggle to choose between different models, Andreas Muller created a simple flowchart for users:

    Continue reading “Six reasons why I recommend scikit-learn”