June 2014 - Gradient Flow

Don’t miss the keynotes at the 2014 Spark Summit

There will be major announcements – particularly during the Monday morning keynotes. Fortunately the organizers will livestream the talks (sign up here). An added bonus if you sign-up for the livestream: I’ll be interviewing (keynote) speakers and key members of the Spark community throughout the first two days of the summit.

Scalable Data Science on a Laptop

I’ll be hosting a webcast featuring one of Strata’s most popular speakers: machine-learning expert, Alice Zheng Here is what data science looks like today: 1. Munge some data: a. Process raw data. Stuff it into a database. b. Query for specific data. Coax results out through a straw. c. Munge data into a format requiredContinue reading “Scalable Data Science on a Laptop”

Streamlining Feature Engineering

Researchers and startups are building tools that enable feature discovery [A version of this post appears on the O’Reilly Data blog.] Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variablesContinue reading “Streamlining Feature Engineering”

Bits from the Data Store

Semi-regular field notes from the world of data: I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At theContinue reading “Bits from the Data Store”

Data Analysis on Streams

If you’re struggling with analyzing streaming data, I have just the event for you. I’ll be hosting a webcast on June 12th, featuring Mikio Braun, co-founder of streamdrill: Analyzing real-time data poses special kinds of challenges, such as dealing with large event rates, aggregating activities for millions of objects in parallel, and processing queries withContinue reading “Data Analysis on Streams”

A growing number of applications are being built with Spark

Many more companies are willing to talk about how they’re using Apache Spark in production [A version of this post appears on the O’Reilly Data blog.] One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companiesContinue reading “A growing number of applications are being built with Spark”