pydata Archives - Gradient Flow

Scikit-Learn 0.16

I’ll be hosting a webcast featuring two of the key contributors to what is arguably one of the most popular machine learning tools today – scikit-learn: News from Scikit-Learn 0.16 and Soon-To-Be Gems for the Next Release presented by: Olivier Grisel, Andreas Mueller This webcast will review Scikit-learn, a widely used open source machine learningContinue reading “Scikit-Learn 0.16”

Bits from the Data Store

Semi-regular field notes from the world of data: Michael Jordan (“ask me anything”): The distinguished machine learning and Bayesian researcher from UC Berkeley’s AMPLab has an interesting perspective on machine learning and statistics. … while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going toContinue reading “Bits from the Data Store”

Bits from the Data Store

Semi-regular field notes from the world of data: Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics. [Full disclosure: I’m an advisor to Databricks,Continue reading “Bits from the Data Store”

Scaling up Data Frames

New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects [A version of this post appears on the O’Reilly Radar blog.] Long before the advent of “big data”, analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, dataContinue reading “Scaling up Data Frames”

What’s New in Scikit-learn 0.15

Python has emerged as one of the more popular languages for doing data science. The primary reason is the impressive array of tools (the “Pydata” stack) available for addressing many stages of data science pipelines. One of the most popular Pydata tools is scikit-learn, an easy-to-use and highly-efficient machine learning library. I’ve written about whyContinue reading “What’s New in Scikit-learn 0.15”

Interface Languages and Feature Discovery

It’s easier to “discover” features with tools that have broad coverage of the data science workflow [A version of this post appears on the O’Reilly Data blog and Forbes.] Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference. Interface languages: Python, R, SQL (and Scala)Continue reading “Interface Languages and Feature Discovery”

Extending GraphLab to tables

The popular graph analytics framework extends its coverage of the data science workflow [A version of this post appears on the O’Reilly Data blog and Forbes.] GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With theContinue reading “Extending GraphLab to tables”

What I use for data visualization

[A version of this post appears on the O’Reilly Data blog.] Depending on the nature of the problem, data size, and deliverable, I still draw upon an array of tools for data visualization. As I survey the Design track at next month’s Strata conference, I see creators and power users of visualization tools that manyContinue reading “What I use for data visualization”

IPython: A unified environment for interactive data analysis

[A version of this post appears on the O’Reilly data blog and Forbes.] As I noted in a recent post on reproducing data projects, notebooks have become popular tools for maintaining, sharing, and replicating long data science workflows. Much of that is due to the popularity of IPython1. In development since 2001, IPython grew outContinue reading “IPython: A unified environment for interactive data analysis”

A compelling family of DSLs for Data Science

[A version of this post appears on the O’Reilly Data blog.] An important reason why pydata tools and Spark appeal to data scientists is that they both cover many data science tasks and workloads (Spark users can move seamlessly between batch and streaming). Being able to use the same programming style and syntax for workflowsContinue reading “A compelling family of DSLs for Data Science”