Scikit-Learn 0.16

I’ll be hosting a webcast featuring two of the key contributors to what is arguably one of the most popular machine learning tools today – scikit-learn:

News from Scikit-Learn 0.16 and Soon-To-Be Gems for the Next Release
presented by: Olivier Grisel, Andreas Mueller

This webcast will review Scikit-learn, a widely used open source machine learning library in python, and discuss some of the new features of the recent 0.16 release. Highlights of the last release include new algorithms such as approximate nearest neighbors search, Birch clustering and a path algorithm for logistic regression, probability calibration, as well as improved ease of use and interoperability with the Pandas library. We will also highlight some up-and-coming contributions, such as Latent Dirichlet Allocation, supervised neural networks, and a complete revamping of the Gaussian Process module.

In addition, Olivier will be leading what promises to be a popular tutorial at Strata+Hadoop World in London in early May.

scikit-learn webcast and tutorial

Bits from the Data Store

Semi-regular field notes from the world of data:

  • Michael Jordan (“ask me anything”): The distinguished machine learning and Bayesian researcher from UC Berkeley’s AMPLab has an interesting perspective on machine learning and statistics.

    … while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on? (5) How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken? (6) How do I deal with non-stationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

  • knitr: Is an R package for dynamic report generation. Among other things it lets you embed R code within Markdown and LaTeX.
  • Computer Security and Data Science: A nice curated collection of papers on topics such as intrusion detection, anomaly detection, Internet scale data collection, malware analysis, and intrusion/breach reports.
  • SociaLite: A distributed query language for large-scale graph analysis that targets Python users. Data is stored in (in-memory) tables and programming logic is expressed in rules, which from the Quick Start code aren’t as easy to grok as SQL.

  • Upcoming Webcasts:

    Bits from the Data Store

    Semi-regular field notes from the world of data:

  • Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics.
    [Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
  • yt: Magnetized White Dwarf Binary MergerVisual Exploration with yt: Having recently featured FilterGraph, I asked Physicists and Pydata luminaries Josh Bloom, Fernando Perez, and Brian Granger if they knew any other visualizations tools popular among astronomists. They all recommended yt. It has roots in astronomy but the gallery of examples indicates that scientists from many other domains use it too.
  • Narrative Recommendations: When NarrativeScience started out, I thought of it primarily as a platform for generating short, factual stories for (hyperlocal) news services (a newer startup OnlyBoth seems to be focused on this, their working example being the use of “box scores” to cover “college” teams). More recently NarrativeScience has aimed its technology at the lucrative Business Intelligence market. Starting from structured data, NarrativeScience extracts and ranks facts, and weaves a narrative arc that analysts consume. The company retains the traditional elements of BI tools (tables, charts, dashboards) and supplements it with narrative summaries and recommendations. I like the concept of adding narrative outputs, and as with all relatively new technologies, the algorithms and accompanying user interfaces are bound to get better over time. The technology is largely “language” agnostic, but to reap maximum benefit it does need to be tuned for the specific domain you want to use it in.

    With spreadsheets, you have to calculate. With visualizations, you have to interpret. With narratives, all you have to do is read.

    Narrative Science flow“Future” implementation of NarrativeScience
    Source: founder Kris Hammond’s slides at Cognitive Computing Forum 2014

  • Julia 0.3 has shipped: This promising language just keeps improving. A summer that started with JuliaCon, continued with a steady expansion of libraries, and ends with a major new release.

  • Upcoming Meetups: SF Bay Area residents can look forward to two interesting Spark meetups this coming week.

    Scaling up Data Frames

    New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects

    [A version of this post appears on the O’Reilly Radar blog.]

    Trellis: for data frame post

    Long before the advent of “big data”, analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, data inspection, and data modeling convenient. Among R users this meant proficiency with Data Frames – objects used to store data matrices that can hold both numeric and categorical data. A data.frame is the data structure consumed by most R analytic libraries.

    But not all data scientists use R, and nor is R suitable for all data problems. I’ve been watching with interest the growing number of alternative data structures for business analysis and advanced analytics. These new tools are designed to handle much larger data sets and are frequently optimized for specific problems. And they all use idioms that are familiar to data scientists – either SQL-like expressions, or syntax similar to those used for R data.frame or pandas.DataFrame.

    Continue reading “Scaling up Data Frames”

    What’s New in Scikit-learn 0.15

    Python has emerged as one of the more popular languages for doing data science. The primary reason is the impressive array of tools (the “Pydata” stack) available for addressing many stages of data science pipelines. One of the most popular Pydata tools is scikit-learn, an easy-to-use and highly-efficient machine learning library.

    I’ve written about why I like to recommend scikit-learn so I won’t repeat myself here. Next week I’ll be hosting a FREE webcast featuring one of the most popular teachers and speakers in the Pydata community, scikit-learn committer Olivier Grisel:

    This webcast will introduce scikit-learn, an Open Source project for Machine Learning in Python and review some new features from the recent 0.15 release such as faster randomized ensemble of decision trees and optimization for the memory usage when working on multiple cores.

    We will also review on-going work part of the 2014 edition of the Google Summer of Code: neural networks, extreme learning machines, improvements for linear models, and approximate nearest neighbor search with locality-sensitive hashing.

    Interface Languages and Feature Discovery

    It’s easier to “discover” features with tools that have broad coverage of the data science workflow

    [A version of this post appears on the O’Reilly Data blog and Forbes.]

    Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference.

    Interface languages: Python, R, SQL (and Scala)
    This is a great time to be a data scientist or data engineer who relies on Python or R. For starters there are developer tools that simplify setup, package installation, and provide user interfaces designed to boost productivity (RStudio, Continuum, Enthought, Sense).

    Increasingly, Python and R users can write the same code and run it against many different execution1 engines. Over time the interface languages will remain constant but the execution engines will evolve or even get replaced. Specifically there are now many tools that target Python and R users interested in implementations of algorithms that scale to large data sets (e.g., GraphLab,, Adatao, H20, Skytree, Revolution R). Interfaces for popular engines like Hadoop and Apache Spark are also available – PySpark users can access algorithms in MLlib, SparkR users can use existing R packages.

    In addition many of these new frameworks go out of their way to ease the transition for Python and R users. “… bindings follow the Scikit-Learn conventions”, and as I noted in a recent post, with SFrames and Notebooks GraphLab, Inc. built components2 that are easy for Python users to learn.

    Continue reading “Interface Languages and Feature Discovery”

    Extending GraphLab to tables

    The popular graph analytics framework extends its coverage of the data science workflow

    [A version of this post appears on the O’Reilly Data blog and Forbes.]

    GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With the addition of SFrame, users can leverage GraphLab’s many algorithms on data stored as either graphs or tables. More importantly SFrame increases GraphLab’s coverage of the data science workflow: it allows users with terabyte-sized datasets to clean their data and create new features directly within GraphLab (SFrame performance can scale linearly with the number of available cores).

    The beta version of SFrame can read data from local disk, HDFS, S3 or a URL, and save to a human-readable .csv or a more efficient native format. Once an SFrame is created and saved to disk no reprocessing of the data is needed. Below is Python code that illustrates how to read a .csv file into SFrame, create a new data feature and save it to disk on S3:

    Continue reading “Extending GraphLab to tables”