Building and deploying large-scale machine learning pipelines

[A version of this post appears on the O’Reilly Radar blog.] There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimizationContinue reading “Building and deploying large-scale machine learning pipelines”

Instrumenting collaboration tools used in data projects

Built-in audit trails can be useful for reproducing and debugging complex data analysis projects [A version of this post appears on the O’Reilly Data blog.] As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover,Continue reading “Instrumenting collaboration tools used in data projects”

Gaining access to the best machine-learning methods

[A version of this post appears on the O’Reilly Strata blog and Forbes.] For companies in the early stages of grappling with big data, the analytic lifecycle (model building, deployment, maintenance) can be daunting. In earlier posts I highlighted some new tools that simplify aspects of the analytic lifecycle, including the early phases of modelContinue reading “Gaining access to the best machine-learning methods”

Data Analysis: Just one component of the Data Science workflow

[A version of this post appears on the O’Reilly Strata blog.] Judging from articles in the popular press the term data scientist has increasingly come to refer to someone who specializes in data analysis (statistics, machine-learning, etc.). This is unfortunate since the term originally described someone who could cut across disciplines. Far from being confinedContinue reading “Data Analysis: Just one component of the Data Science workflow”

Running batch and long-running, highly available service jobs on the same cluster

[A version of this post appears on the O’Reilly Strata blog.] As organizations increasingly rely on large computing clusters, tools for leveraging and efficiently managing compute resources become critical. Specifically, tools that allow multiple services and frameworks run on the same cluster can significantly increase utilization and efficiency. Schedulers1 take into account policies and workloadsContinue reading “Running batch and long-running, highly available service jobs on the same cluster”

Data scientists tackle the analytic lifecycle

[A version of this post appears on the O’Reilly Strata blog.] What happens after data scientists build analytic models? Model deployment, monitoring, and maintenance are topics that haven’t received as much attention in the past, but I’ve been hearing more about these subjects from data scientists and software developers. I remember the days when itContinue reading “Data scientists tackle the analytic lifecycle”