Expanding options for mining streaming data

[A version of this post appears on the O’Reilly Data blog.] Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces,Continue reading “Expanding options for mining streaming data”

Reproducing Data Projects

[A version of this post appears on the O’Reilly Strata blog.] As I talk to people and companies building the next generation of tools for data scientists, collaboration and reproducibility keep popping up. Collaboration is baked into many of the newer tools I’ve seen (including ones that have yet to be released). Reproducibility is aContinue reading “Reproducing Data Projects”

Data Scientists and Data Engineers like Python and Scala

[A version of this post appears on the O’Reilly Strata blog.] In exchange for getting personalized recommendations many Meetup members declare1 topics that they’re interested in. I recently looked at the topics listed by members of a few local, data Meetups that I’ve frequented. These Meetups vary in size from 600 to 2,000 total (andContinue reading “Data Scientists and Data Engineers like Python and Scala”

Data Wrangling gets a fresh look

[A version of this post appears on the O’Reilly Strata blog.] Data analysts have long lamented the amount of time they spend on data wrangling. Rightfully so, as some estimates suggest they spend a majority of their time on it. The problem is compounded by the fact that these days, data scientists are encouraged toContinue reading “Data Wrangling gets a fresh look”

Simplifying interactive, realtime, and advanced analytics

[A version of this post appears on the O’Reilly Strata blog and Forbes.] Here are a few observations based on conversations I had during the just concluded Strata NYC conference. Interactive query analysis on Hadoop remains a hot area A recent O’Reilly survey confirmed SQL is an important skill for data scientists. A year afterContinue reading “Simplifying interactive, realtime, and advanced analytics”

Deep Learning oral traditions

[A version of this post appears on the O’Reilly Strata blog.] This past week I had the good fortune of attending two great talks1 on Deep Learning, given by Googlers Ilya Sutskever and Jeff Dean. Much of the excitement surrounding Deep Learning stems from impressive results in a variety of perception tasks, including speech recognitionContinue reading “Deep Learning oral traditions”

Stream Mining essentials

[A version of this post appears on the O’Reilly Strata blog.] A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. TheseContinue reading “Stream Mining essentials”

Semi-automatic method for grading a million homework assignments

[A version of this post appears on the O’Reilly Strata blog.] One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classesContinue reading “Semi-automatic method for grading a million homework assignments”

Gaining access to the best machine-learning methods

[A version of this post appears on the O’Reilly Strata blog and Forbes.] For companies in the early stages of grappling with big data, the analytic lifecycle (model building, deployment, maintenance) can be daunting. In earlier posts I highlighted some new tools that simplify aspects of the analytic lifecycle, including the early phases of modelContinue reading “Gaining access to the best machine-learning methods”

Stream Processing and Mining just got more interesting

[A version of this post appears on the O’Reilly Strata blog.] Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message broker used to store1Continue reading “Stream Processing and Mining just got more interesting”