Take a similarity measure that’s already well-known to researchers who work with time-series, and devise an algorithm to compute it efficiently at scale. Suddenly intractable problems become tractable, and Big Data mining applications that use the metric are within reach.
The classification, clustering, and searching through time series have important applications in many domains. In medicine EEG and ECG readings translate to time-series data collections with billions (even trillions) of points. In fact many research hospitals have trillions of points of EEG data. Other domains where large time series data collections are routine include gesture recognition & user interface design, astronomy, robotics, geology, wildlife monitoring, security, and biometrics.
The problem is that existing algorithms don’t scale1 to sequences with hundreds of billions or trillions of points. Consider a doctor who needs to search through EEG data (with hundreds of billions of points), for a “prototypical epileptic spike”, where the input query is a time-series snippet with thousands of points.
Recently a team of researchers led by Eamonn Keogh of UC Riverside introduced a set of tools for mining time-series with trillions of points. Their approach allows for ultrafast subsequence search under both Dynamic Time Warping and Euclidean Distance. To put their results in perspective, a time series with one trillion points is “… more than all of the time series data considered in all papers ever published in all data mining conferences combined”.