Time series tools are transforming companies. We highlight areas that need to be addressed to enhance their effectiveness.
By Ira Cohen and Ben Lorica.
Time series and temporal data are everywhere. Most of the data that companies collect from users, sensors, and machines come with a date/time stamp. Time series data are used for reports and dashboards, for decision making, and in statistical and machine learning models that power many AI applications. Many organizations use forecasting tools to predict potential issues, make better decisions, and measure their impact. Insights into the future gives businesses an enormous competitive advantage.
Time series have given rise to publicly traded companies, a variety of open source tools, and startups that have collectively raised over a billion dollars. The global market for time series analysis software is expected to grow at a compound annual rate of 11.5% from 2020 to 2027. In spite of their ubiquity and importance, time series data lack the cachet of other data types. The most discussed developments in machine learning and AI in recent years involve text (large language models), visual data (computer vision), audio (speech technologies), or their combinations (DALL·E).
Until recently, time series modeling required immersion and specialization even for data scientists and machine learning engineers. The best performing and most efficient models had roots in statistics and econometrics, not in machine learning. State of the art algorithms and libraries weren’t available in Python, the language used widely in machine learning and data science. The release of Facebook’s Prophet library opened up time series modeling to more people. A variety of time series problems could be addressed using Prophet’s simple, automated approach, and its streamlined API makes it accessible to developers and data scientists who aren’t experts in time series analysis. By now Prophet’s shortcomings are well-documented, nonetheless Prophet inspired the creation of several other simple to use open source libraries that are faster and more accurate.
This post provides an overview of tools for handling and unlocking time series data, as well as a list of suggestions to enhance the effectiveness of current solutions.
An Overview of Time Series Tasks
Data Management: Time series databases first came to our attention about a decade ago when leading technology companies built bespoke systems mainly for IT observability applications. These days they come in many flavors: open source or proprietary; on-premise or hosted; optimized for OLAP or OLTP (see TSBS). Time series databases have also expanded to use cases beyond IT monitoring, to areas such as IoT and manufacturing, real-time intelligence and analytics, cryptocurrencies and Web3, marketing and advertising, asset tracking, logistics, and more.
Data Implementation: Being able to mine large collections of time series is vital for organizations that collect large amounts of time series data. Getting the most out of time series repositories requires efficient representations, data processing, and indexing strategies.
Until recently, time series modeling required immersion and specialization
BI and Analytics: As time series databases and scalable tools for data ingestion, stream processing, and analytics have developed, it has become easier to generate reports and dashboards that get updated in real time. Time series databases are extremely efficient for temporal queries and provide easy access to temporal information (usually via SQL).
Modeling: Building statistical or machine learning models against specific time series is a common task. This includes applications like forecasting and anomaly detection, as well as complex multivariate models such as the trading algorithms used in quantitative finance. A few popular open source libraries for time series modeling are listed in Figure 3. Unlike areas like computer vision, speech, and text where neural models are dominant, statistical and tree-based machine learning models remain prevalent in time series modeling. In fact, the popular libraries in Figure 3 mainly provide models with roots in statistics, econometrics, finance, and classic machine learning.
Data Augmentation: In addition to data repositories and data markets that carry time series data, there are now synthetic data generation tools that target data for time series modeling. We spoke with the founders of synthetic generation startups, and they highlighted the growing number of time series applications of their tools.
Time Series Data Mining: Once proper indexes, efficient representations, and similarity measures have been created, time series collections can be explored using data mining techniques. Classification, clustering, and searching through a large number of time series have important applications in many domains. Medical EEGs and ECGs result in time-series data collections that collectively have billions of points. Some research hospitals store trillions of points of EEG data. Other domains where large time series data collections are routine include IT and application performance monitoring, gesture recognition & user interface design, astronomy, robotics, geology, wildlife monitoring, security, and biometrics. New tools, such as vector databases and embeddings, will likely be useful for mining large amounts of time series data.
There are many different types and sources of time series data
Proposals and Challenges
This is a great time to be working on time series; tools that address all major tasks continue to improve with some areas like data management boasting of several well-funded startups . We’ve compiled ideas for how time series tools can gain more users and increase their impact. Here are some suggestions for enhancing time series solutions.
We need more modeling tools for streaming data
While time-series is associated with streaming applications, modeling tools in Figure 3 are designed to be used offline. Some of the most exciting tools and projects in data engineering and data infrastructure target streaming applications and there should be more time series modeling tools that can handle continuously arriving data. However, time series modeling of streaming data comes with its own set of challenges. For example – in many streaming data cases, batch learning of models is not scalable, requiring the design of sequential learning algorithms (learning on the stream). Another example of a challenge is the issue of historical data updates (often due to late arrival of some data), which requires the design of efficient model updates that account for historical data being changed.
Challenges caused by insufficient data require solutions
Lack of data is a surprisingly common problem in time series. As an example we need better tools to address the cold start problem – fitting a model to a very short time series. Lack of data partially explains why simple models remain popular: complex neural models are harder to train when the amount of available data is limited.
We are still in the early stages of understanding how to do transfer learning for time series modeling (or even joint training between variable length time series). Among the avenues being explored is the role of synthetic data generators. As synthetic data generation tools gain traction in computer vision, startups and researchers are optimistic that they will have an equally significant impact on problems involving structured data (including time series). The challenge with transfer learning and synthetic data in the time series domain is the lack of homogenous behavior of time series even within the same domain (see Section 3 for some examples). Non-homogeneity makes it difficult to know which models may be transferred to new time series, or what data to generate that will be relevant to the problem at hand.
Better and more efficient tools are needed for multivariate modeling of time series
While univariate models for representing time series are useful in many applications, jointly modeling multiple time series can increase accuracy of various tasks, such as forecasting, anomaly detection and pattern discovery. There are two main hindrances to applying these models more widely. The first is wide variability of behaviors of different time series, making it harder to capture their different behaviors in a single model. Solutions may be borrowed from other challenges ML researchers face in other data domains: for example, in other machine learning tasks, different scales between features is addressed using normalization techniques. A second challenge is joint modeling of time series that are measured/reported at different time intervals (e.g., reporting every second, minute or hour), and sometimes reported at irregular intervals. The variability in the time scales of time series poses a challenge to most types of time series models, whose inherent assumption is data arrives at regular and known intervals for all time series being jointly modeled. Alignment techniques, such as dynamic time warping, can help align the different time series before input to the models, but come at a computational cost and potential loss of accuracy.
Inch toward AutoML for time series
While the libraries in Figure 3 have made it easier to build anomaly detection and forecasting models, they still require experimentation and creativity. Experience with time series modeling still makes a difference. These open source packages require experimenting with different models, hyperparameter tuning, as well as the judgment of data scientists who have some familiarity with the domain and the underlying data. As we noted in a recent post on AutoML, there is an enormous pool of potential users (developers and analysts) with limited backgrounds in machine learning and statistics.
The reality is that unlike data and models used in NLP, computer vision, and speech, there are many different types and sources of time series data. As a result, it is much more difficult to develop (AutoML) tools that can handle a wide variety of behaviors and domains. Can we use techniques that have proved effective in modeling sequential data in areas like natural language processing? For example, will transformers be standard tools for time series or will simpler models continue to prevail?
A recent post from Amazon hints at how neural models may lead to more automated tools. They chronicled their decade-long journey towards building a “unified forecasting model that would produce accurate forecasts for multiple scenarios, forecasts, and categories”. Along the same lines, Anodot Autonomous Forecast uses neural models to automatically generate forecasting models. In much the same way that computer vision, text and speech were areas where access to large datasets led to foundation models, having access to large amounts of time series data of many different types allowed Amazon and Anodot to build more automated forecasting models.
Over time, we expect to see open source time series libraries and cloud services with similar automation and AutoML capabilities. Nixtla, for example, continues to develop and refine open source tools for automated time series forecasting.
Invest in scalable and efficient modeling tools to unlock huge amounts of data
For organizations with thousands, millions, or even trillions of time series, the priority is to surface important patterns or to receive the most important alerts and reports in a timely fashion. It involves using machine learning to monitor a large volume of time series data in real time and surface only the most important alerts and visualizations. Tools from the “ops” world (IT Ops, DevOps) have long prioritized reducing false positive alerts, and supplying features that reduce mean time to recovery (via efficient root cause analysis tools). We hope to see continued progress in tools that combine machine learning and visualization technologies to help teams who need to monitor and mine large numbers of time series.
What will the future hold for standalone time series databases
There are also situations where many models need to be built in parallel. For example, the open source project Ray can be used to train individualized time series models for different sensors and devices in parallel. Additionally, Ray can handle settings where nested parallelism is required: for instance, consider applications where for each sensor, different models are trained and an ensemble model or least error model needs to be determined.
As time series applications become more prevalent, data infrastructure must include advanced capabilities.
Listed below are a few nice to have features:
- Support for near real time calculations and user defined functions for streaming data.
- Subsampling and interpolation functions are included.
- Outlier exclusion capabilities, so users can easily query without including anomalies.
- Data management system that provides compression to save space without loss of speed.
- Enable revisions of data along with point in time view of the data (i.e., what did the data look like before an update).
- Enhanced data retention and partitioning capabilities.
Are specialized time series databases worth it in the long run?
As we note in Figure 2, there are many time series databases, and your selection process should most definitely take into account the type of workload (OLAP/analytics or OLTP/transactions). What will the future hold for standalone time series databases? When it comes to analytics, will it be more common to use optimized time series indices within established OLAP systems and lakehouses? What role will vector databases and embeddings play in time series? Quite a bit depends on how quickly and effectively these other systems can improve their time series capabilities.
It’s an exciting time to be working on time series applications in all their forms. While there are many useful tools addressing all key areas of time series data management and modeling, there are still many potential use cases and unexplored possibilities. If you are a developer or founder and would like to exchange notes, shoot us an email at email@example.com.
Ben Lorica is principal at Gradient Flow. He is an advisor to Anodot and other startups.
If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
Update (2022-12-08): We discuss topics covered in this post in an episode of the The Data Exchange podcast:
- The Data Exchange podcast: time series episodes
- The Vector Database Index
- Here’s what we need to do to fix AutoML
- What is Graph Intelligence?