11 Essential Features that Visual Analysis Tools Should Have

[A version of this post appears on the O’Reilly Strata blog.] After recently playing with SAS Visual Analytics, I’ve been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first time, conduct exploratory dataContinue reading “11 Essential Features that Visual Analysis Tools Should Have”

Scalable streaming analytics using a single-server

[A version of this post appears on the O’Reilly Strata blog.] For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with HadoopContinue reading “Scalable streaming analytics using a single-server”

Simpler workflow tools enable the rapid deployment of models

[A version os this post appears on the O’Reilly Strata blog.] Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companiesContinue reading “Simpler workflow tools enable the rapid deployment of models”

The re-emergence of Time-series

[A version of this post appeared on the O’Reilly Strata and Radar blogs.] My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasionalContinue reading “The re-emergence of Time-series”

Data Science tools: Are you “all in” or do you “mix and match”?

[A version of this post appears on the O’Reilly Strata blog.] An integrated data stack boosts productivity As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learningContinue reading “Data Science tools: Are you “all in” or do you “mix and match”?”

Python data tools just keep getting better

[A version of this post appeared on the O’Reilly Strata blog.] Here are a few observations inspired by conversations I had during the just concluded PyData conference1. The Python data community is well-organized: Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data analytics projects. The listContinue reading “Python data tools just keep getting better”

Data Science Tools: Fast, easy to use, and scalable

[A version of this post appears on the O’Reilly Strata blog.] Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference. Spark is attracting attention I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packedContinue reading “Data Science Tools: Fast, easy to use, and scalable”

MLbase: Scalable Machine-learning made accessible

[Cross-posted on the O’Reilly Strata blog.] In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some softwareContinue reading “MLbase: Scalable Machine-learning made accessible”

2012 Revenue of some Big Data companies

The chart below is from Wikibon’s estimates1 of the 2012 revenue of some Big Data companies. Using d3 I drew a chart that shows 2012 revenue in millions, as well as the share of revenue derived from services, for a few select/startup companies:         (Click HERE to enlarge) The Big 3 Hadoop Vendors (Cloudera/MapR/Hortonworks): Combined revenueContinue reading “2012 Revenue of some Big Data companies”

Mining Time-series with Trillions of Points: Dynamic Time Warping at scale

Take a similarity measure that’s already well-known to researchers who work with time-series, and devise an algorithm to compute it efficiently at scale. Suddenly intractable problems become tractable, and Big Data mining applications that use the metric are within reach. The classification, clustering, and searching through time series have important applications in many domains. InContinue reading “Mining Time-series with Trillions of Points: Dynamic Time Warping at scale”