Scalable Data Science on a Laptop

I’ll be hosting a webcast featuring one of Strata’s most popular speakers: machine-learning expert, Alice Zheng

Here is what data science looks like today:

1. Munge some data:

    a. Process raw data. Stuff it into a database.
    b. Query for specific data. Coax results out through a straw.
    c. Munge data into a format required for the next stage.

2. Do some analysis:

    a. Figure out how to use a data analytics library to generate the results you need.
    b. Dump results out to file/database/hand truck.
    c. Parse out the chunk of output you need. Look at it.
    d. Decide something is not right. Repeat all of the above.

3. Oh right, speed!

    a. Repeat all steps in native code to make it fast.

4. Wait, what about scale?

    a. Repeat all steps with five other tools, write more code to scale up.

In this webcast, we’ll demonstrate doing scalable data science using GraphLab Create, an end-to-end platform for prototyping and deploying data products. You can munge data, query statistics, build sophisticated models, and deploy to the cloud, all from *one* platform—your laptop. With disk-backed data stores, an intuitive Python front-end and efficient C++ back-end, GraphLab Create squeezes out all the power from a single machine, which can be orders of magnitude faster than MapReduce.

Streamlining Feature Engineering

Researchers and startups are building tools that enable feature discovery

[A version of this post appears on the O’Reilly Data blog.]

Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variables are known as features in machine-learning parlance. For many0 data applications, feature engineering and feature selection are just as (if not more important) than choice of algorithm:

Good features allow a simple model to beat a complex model.
(to paraphrase Alon Halevy, Peter Norvig, and Fernando Pereira)

The terminology can be a bit confusing, but to put things in context one can simplify the data science pipeline to highlight the importance of features:

Feature engineering and discovery pipeline

Feature Engineering or the Creation of New Features
A simple example to keep in mind is text mining. One starts with raw text (documents) and extracted features could be individual words or phrases. In this setting, a feature could indicate the frequency of a specific word or phrase. Features1 are then used to classify and cluster documents, or extract topics associated with the raw text. The process usually involves the creation2 of new features (feature engineering) and identifying the most essential ones (feature selection).

Continue reading

Bits from the Data Store

Semi-regular field notes from the world of data:

  • I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At the Hadoop Summit, someone pointed me to Wings – a USC research project that uses techniques from AI to help scientists manage large computational experiments.

    Wings sorkflow system
    Source: Wings project (A workflow for social network analysis)

  • NPR has a Social Science correspondent! It’s about time media organizations dedicate someone to the Social Science beat. One of the things we’re keen on at Strata is how data geeks are increasingly drawing on techniques, tools and ideas from Social Science and Design.
  • Having just published a post on applications built on top of Spark, I wasn’t that surprised to hear from companies leveraging components of the Spark ecosystem. At this week’s Hadoop Summit several companies told me of plans1 to build or port applications to Spark. These conversations, plus the fact that serious applications are beginning to be built on top of Spark, sure makes it appear that my post was perfectly timed. It’s no coincidence that Spark was prominently displayed in the MapR booth.
  • Other chance encounters at the Hadoop Summit prompted me to remind people of Tachyon, the fault-tolerant, distributed, in-memory file system from AMPLab. Tachyon allows sharing of RDDs across frameworks (Spark, Pydata, etc.) and data stores2 (HDFS, Cassandra, Mongodb). There are early signs of adoption3 as well: Tachyon is currently in use in over 10 companies, is part of Fedora, and is commercially supported by Atigeo.

  • Upcoming Webcasts:


    (1) Most if not all, were off-the-record. I’ve also had emails from companies on this very topic.
    (2) Tachyon has a “pluggable underlaying file system” and currently currently supports HDFS, S3, and single-node local file systems.
    (3) From a recent presentation by Haoyuan Li.

    Data Analysis on Streams

    If you’re struggling with analyzing streaming data, I have just the event for you. I’ll be hosting a webcast on June 12th, featuring Mikio Braun, co-founder of streamdrill:

    Analyzing real-time data poses special kinds of challenges, such as dealing with large event rates, aggregating activities for millions of objects in parallel, and processing queries with subsecond latency. In addition, the set of available tools and approaches to deal with streaming data is currently highly fragmented.

    In this webcast, Mikio Braun will discuss building reliable and efficient solutions for real-time data analysis, including approaches that rely on scaling–both batch-oriented (such as MapReduce), and stream-oriented (such as Apache Storm and Apache Spark). He will also focus on use of approximative algorithms (used heavily in streamdrill) for counting, trending, and outlier detection.

    A growing number of applications are being built with Spark

    Many more companies are willing to talk about how they’re using Apache Spark in production

    [A version of this post appears on the O’Reilly Data blog.]

    One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companies are focusing on solving data problems in specific industries rather than building tools from scratch. Virtually all of these components are open source and have contributors across many companies. Organizations are also sharing best practices for building big data applications, through blog posts, white papers, and presentations at conferences like Strata.

    These trends are particularly apparent in a set of technologies that originated from UC Berkeley’s AMPLab: the number of companies that are using (or plan to use) Spark in production1 has exploded over the last year. The surge in popularity of the Apache Spark ecosystem stems from the maturation of its individual open source components and the growing community of users. The tight integration of high-performance tools that address different problems and workloads, coupled with a simple programming interface (in Python, Java, Scala), make Spark one of the most popular projects in big data. The charts below show the amount of active development in Spark:

    Apache Spark contributions
    [Data source: Git logs; chart courtesy of Matei Zaharia]

    For the second year in a row, I’ve had the privilege of serving on the program committee for the Spark Summit. I’d like to highlight a few areas where Apache Spark is making inroads. I’ll focus on proposals2 from companies building applications on top of Spark.

    Continue reading