[A version of this post appeared on the O’Reilly Strata and Radar blogs.]
My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I’ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.
Time-series and big data:
Over the last six months I’ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time-series much less so. Sources include data from consumer devices (gesture recognition & user interface design), sensors (apps for “self-tracking”), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points.
Search and machine-learning at scale:
Before doing anything else, one has to be able to run queries at scale. Last year I wrote about a team of researchers at UC Riverside who took an existing search algorithm (dynamic time-warping2) and got it to scale to time-series with trillions of points. There are many potential applications of their research, one I highlighted is from health care:
… a doctor who needs to search through EEG data (with hundreds of billions of points), for a “prototypical epileptic spike”, where the input query is a time-series snippet with thousands of points.
As the size of data grows, the UCR dynamic time-warping algorithm takes time to finish (it takes a few hours for time-series with trillions of points). In general (academic) researchers who’ve spent weeks or months collecting data are fine waiting a few hours for a pattern recognition algorithm to finish. But users who come from different backgrounds (e.g. web companies) may not be as patient. Fortunately “search” is an active research area and faster (distributed) pattern recognition systems will likely emerge soon.
Once you scale up search, other interesting problems can be tackled. The UCR team is using their dynamic time-warping algorithm in tasks like classification, clustering, and motif3 discovery. Other teams are investigating techniques from signal-processing, pattern recognition, and trajectory tracking.
Some data management tools that target time-series:
One of the more popular sessions at last year’s HBase Conference was on OpenTSDB, a distributed, time series database built on top of HBase. It’s used to store and serve time series metrics, and comes with tools (based on GNUPlot) for charting. Originally named OpenTSDB2, KairosDB was written primarily for Cassandra (but also works with HBase). OpenTSDB emphasizes tools for readying data for charts (interpolating to fill in missing values), KairosDB distinguishes between data and the presentation of data.
Startup TempoDB offers a reasonably priced, cloud-based service for storing, retrieving, and visualizing time-series data. Still a work in progress SciDB is an open source database project, designed specifically for data intensive science problems. The designers of the system plan to make time-series analysis easy to express within SciDB.
(1) I worked on trading strategies for derivatives, portfolio & risk management, and option pricing.
(2) From my earlier post: In a recent paper, the UCR team noted that “… after an exhaustive literature search of more than 800 papers, we are not aware of any distance measure that has been shown to outperform DTW by a statistically significant amount on reproducible experiments”.
(3) Motifs are similar subsequences of a long time series; shapelets are time series primitives that can be used to speed up automatic classification (by reducing the number of “features”).