August 2014 - Gradient Flow

Bits from the Data Store

Semi-regular field notes from the world of data: Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics. [Full disclosure: I’m an advisor to Databricks,Continue reading “Bits from the Data Store”

Real-world Active Learning

Beyond building training sets for machine-learning, crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans take care of uncertain cases, models handle the routine ones. Active Learning is one of those topics that many data scientists have heard of, few have tried, and a small handful know how toContinue reading “Real-world Active Learning”

Bits from the Data Store

Semi-regular field notes from the world of data (gathered from Scifoo 2014): Filtergraph and the power of visual exploration: A web-based tool for exploring high-dimensional data sets, Filtergraph came out of the lab of Astrophysicist Keivan Stassun. It has helped researchers make several interesting discoveries including a paper (that appeared in Nature) on a techniqueContinue reading “Bits from the Data Store”

Scaling up Data Frames

New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects [A version of this post appears on the O’Reilly Radar blog.] Long before the advent of “big data”, analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, dataContinue reading “Scaling up Data Frames”

What’s New in Scikit-learn 0.15

Python has emerged as one of the more popular languages for doing data science. The primary reason is the impressive array of tools (the “Pydata” stack) available for addressing many stages of data science pipelines. One of the most popular Pydata tools is scikit-learn, an easy-to-use and highly-efficient machine learning library. I’ve written about whyContinue reading “What’s New in Scikit-learn 0.15”