How signals, geometry, and topology are influencing data science

[A version of this post appears on the O’Reilly Strata blog.] I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas thatContinue reading “How signals, geometry, and topology are influencing data science”

Improving options for unlocking your graph data

[A version of this post appears on the O’Reilly Strata blog.] The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “pushContinue reading “Improving options for unlocking your graph data”

11 Essential Features that Visual Analysis Tools Should Have

[A version of this post appears on the O’Reilly Strata blog.] After recently playing with SAS Visual Analytics, I’ve been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first time, conduct exploratory dataContinue reading “11 Essential Features that Visual Analysis Tools Should Have”

Scalable streaming analytics using a single-server

[A version of this post appears on the O’Reilly Strata blog.] For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with HadoopContinue reading “Scalable streaming analytics using a single-server”

Tachyon: An open source, distributed, fault-tolerant, in-memory file system

[A version of this post appears on the O’Reilly Strata blog.] In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performingContinue reading “Tachyon: An open source, distributed, fault-tolerant, in-memory file system”

Simpler workflow tools enable the rapid deployment of models

[A version os this post appears on the O’Reilly Strata blog.] Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companiesContinue reading “Simpler workflow tools enable the rapid deployment of models”

Single server systems can tackle Big Data

[A version of this post appears on the O’Reilly Strata blog.] About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.

The re-emergence of Time-series

[A version of this post appeared on the O’Reilly Strata and Radar blogs.] My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasionalContinue reading “The re-emergence of Time-series”

Data Science tools: Are you “all in” or do you “mix and match”?

[A version of this post appears on the O’Reilly Strata blog.] An integrated data stack boosts productivity As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learningContinue reading “Data Science tools: Are you “all in” or do you “mix and match”?”