A compelling family of DSLs for Data Science

[A version of this post appears on the O’Reilly Data blog.] An important reason why pydata tools and Spark appeal to data scientists is that they both cover many data science tasks and workloads (Spark users can move seamlessly between batch and streaming). Being able to use the same programming style and syntax for workflowsContinue reading “A compelling family of DSLs for Data Science”

Financial analytics as a service

[A version of this post appears on the O’Reilly Strata blog.] In relatively short order Amazon’s internal computing services has become the world’s most successful cloud computing platform. Conceived in 2003 and launched in 2006, AWS grew quickly and is now the largest web hosting company in the world. With the recent addition of KinesisContinue reading “Financial analytics as a service”

Expanding options for mining streaming data

[A version of this post appears on the O’Reilly Data blog.] Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces,Continue reading “Expanding options for mining streaming data”

Reproducing Data Projects

[A version of this post appears on the O’Reilly Strata blog.] As I talk to people and companies building the next generation of tools for data scientists, collaboration and reproducibility keep popping up. Collaboration is baked into many of the newer tools I’ve seen (including ones that have yet to be released). Reproducibility is aContinue reading “Reproducing Data Projects”

Data Scientists and Data Engineers like Python and Scala

[A version of this post appears on the O’Reilly Strata blog.] In exchange for getting personalized recommendations many Meetup members declare1 topics that they’re interested in. I recently looked at the topics listed by members of a few local, data Meetups that I’ve frequented. These Meetups vary in size from 600 to 2,000 total (andContinue reading “Data Scientists and Data Engineers like Python and Scala”

Data Wrangling gets a fresh look

[A version of this post appears on the O’Reilly Strata blog.] Data analysts have long lamented the amount of time they spend on data wrangling. Rightfully so, as some estimates suggest they spend a majority of their time on it. The problem is compounded by the fact that these days, data scientists are encouraged toContinue reading “Data Wrangling gets a fresh look”

How companies are using Spark

[A version of this post appears on the O’Reilly Strata blog.] When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be whereContinue reading “How companies are using Spark”

Simplifying interactive, realtime, and advanced analytics

[A version of this post appears on the O’Reilly Strata blog and Forbes.] Here are a few observations based on conversations I had during the just concluded Strata NYC conference. Interactive query analysis on Hadoop remains a hot area A recent O’Reilly survey confirmed SQL is an important skill for data scientists. A year afterContinue reading “Simplifying interactive, realtime, and advanced analytics”

Deep Learning oral traditions

[A version of this post appears on the O’Reilly Strata blog.] This past week I had the good fortune of attending two great talks1 on Deep Learning, given by Googlers Ilya Sutskever and Jeff Dean. Much of the excitement surrounding Deep Learning stems from impressive results in a variety of perception tasks, including speech recognitionContinue reading “Deep Learning oral traditions”

Stream Mining essentials

[A version of this post appears on the O’Reilly Strata blog.] A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. TheseContinue reading “Stream Mining essentials”