Bridging the gap between research and implementation

[A version of this post appears on the O’Reilly Data blog.] One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a bundle. In the meantimeContinue reading “Bridging the gap between research and implementation”

Time-turner: Strata Santa Clara 2014, day 2

There are so many good talks happening at the same time that it’s impossible to not miss out on good sessions. But imagine I had a time-turner necklace and could actually “attend” 2 (maybe 3) sessions happening at the same time. Taking into account my current personal interests and tastes, here’s how my day wouldContinue reading “Time-turner: Strata Santa Clara 2014, day 2”

Time-turner: Strata Santa Clara 2014, day 1

There are so many good talks happening at the same time that it’s impossible to not miss out on good sessions. But imagine I had a time-turner necklace and could actually “attend” 2 (maybe 3) sessions happening at the same time. Taking into account my current personal interests and tastes, here’s how my day wouldContinue reading “Time-turner: Strata Santa Clara 2014, day 1”

Graphs, Time-series, Dataviz, and Crowdsourcing at Strata Santa Clara 2014

There are many fantastic talks at Strata and it can be overwhelming to navigate the schedule. I plan to list talks I’m hoping to catch in a series of “time-turner” posts (check this blog on Wed/Thu at 10 a.m.). But for now let me highlight talks from a few categories: Graphs and Network Analysis: Large-scaleContinue reading “Graphs, Time-series, Dataviz, and Crowdsourcing at Strata Santa Clara 2014”

Big Data solutions through the combination of tools

[A version of this post appears on the O’Reilly Data blog and Forbes.] As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent exampleContinue reading “Big Data solutions through the combination of tools”

How companies are using Spark

[A version of this post appears on the O’Reilly Strata blog.] When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be whereContinue reading “How companies are using Spark”

Simplifying interactive, realtime, and advanced analytics

[A version of this post appears on the O’Reilly Strata blog and Forbes.] Here are a few observations based on conversations I had during the just concluded Strata NYC conference. Interactive query analysis on Hadoop remains a hot area A recent O’Reilly survey confirmed SQL is an important skill for data scientists. A year afterContinue reading “Simplifying interactive, realtime, and advanced analytics”

Stream Mining essentials

[A version of this post appears on the O’Reilly Strata blog.] A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. TheseContinue reading “Stream Mining essentials”

Stream Processing and Mining just got more interesting

[A version of this post appears on the O’Reilly Strata blog.] Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message broker used to store1Continue reading “Stream Processing and Mining just got more interesting”

How Twitter monitors millions of time-series

[A version of this post appears on the O’Reilly Strata blog.] One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, andContinue reading “How Twitter monitors millions of time-series”