Bridging the divide: Business users and machine learning experts

[A version of this articles appears on the O’Reilly Radar.] Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. As tools for advanced analytics become more accessible, data scientist’s roles will evolve. Most media stories emphasize a need for expertise in algorithms and quantitative techniquesContinue reading “Bridging the divide: Business users and machine learning experts”

6 reasons why I like KeystoneML

[A version of this article appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Ben Recht on optimization, compressed sensing, and large-scale machine learning pipelines. As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata + Hadoop World in New York, I sat down withContinue reading “6 reasons why I like KeystoneML”

More tools for managing and reproducing complex data projects

A survey of the landscape shows the types of tools remain the same, but interfaces continue to improve. [A version of this post appears on the O’Reilly Radar.] As data projects become complex and as data teams grow in size, individuals and organizations need tools to efficiently manage data projects. A while back, I wroteContinue reading “More tools for managing and reproducing complex data projects”

Building and deploying large-scale machine learning pipelines

[A version of this post appears on the O’Reilly Radar blog.] There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimizationContinue reading “Building and deploying large-scale machine learning pipelines”

Streamlining Feature Engineering

Researchers and startups are building tools that enable feature discovery [A version of this post appears on the O’Reilly Data blog.] Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variablesContinue reading “Streamlining Feature Engineering”

Bits from the Data Store

Semi-regular field notes from the world of data: I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At theContinue reading “Bits from the Data Store”

Instrumenting collaboration tools used in data projects

Built-in audit trails can be useful for reproducing and debugging complex data analysis projects [A version of this post appears on the O’Reilly Data blog.] As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover,Continue reading “Instrumenting collaboration tools used in data projects”

Interface Languages and Feature Discovery

It’s easier to “discover” features with tools that have broad coverage of the data science workflow [A version of this post appears on the O’Reilly Data blog and Forbes.] Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference. Interface languages: Python, R, SQL (and Scala)Continue reading “Interface Languages and Feature Discovery”

IPython: A unified environment for interactive data analysis

[A version of this post appears on the O’Reilly data blog and Forbes.] As I noted in a recent post on reproducing data projects, notebooks have become popular tools for maintaining, sharing, and replicating long data science workflows. Much of that is due to the popularity of IPython1. In development since 2001, IPython grew outContinue reading “IPython: A unified environment for interactive data analysis”