A scalable time-series database that supports SQL

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Freedman on TimescaleDB and scaling SQL for time-series.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Michael Freedman, CTO of Timescale and professor of computer science at Princeton University. When I first heard that Freedman and his collaborators were building a time-series database, my immediate reaction was: “Don’t we have enough options already?” The early incarnation of Timescale was a startup focused on IoT, and it was while building tools for the IoT problem space that Freedman and the rest of the Timescale team came to realize that the database they needed wasn’t available (at least out in open source). Specifically, they wanted a database that could easily support complex queries and the sort of real-time applications many have come to associate with streaming platforms. Based on early reactions to TimescaleDB, many users concur.

Here are some highlights from our conversation:

The need for a time-series database

We initially were developing a platform to collect and store and analyze IoT data, and certainly a lot of IoT data is time-series in nature. We found ourselves struggling. The reason a lot of people adopt NoSQL was they thought it offered scale in the ways that more traditional relational databases did not—yet, they often gave up a lot of the rich query language, optimized complex queries, joins, and an ecosystem that you get in these more traditional relational databases. Customers who were using our platform kept wanting all these ways to query the data, and we couldn’t do it with the existing NoSQL database we were using. It just didn’t support those types of queries.

We ended up building one, in fact, based on top of Postgres. Architecting Postgres in a very particular way for time-series workloads, we came to realize that this is not just a problem limited to us. We think there is an important space still in the market where people either use a Vanilla relational database that does have scaling problems, or they go to something like NoSQL because a lot of the time-series data came from one particular use case, things like server metrics. People’s needs are much broader than just server metrics, so we actually thought there was an important area that’s somewhat missing from what people had before.

… The interesting thing about a time-series database is sometimes that data starts in one part of your organization, and then different parts of your organization quickly find a use for that data.

… In many cases, the people who are asking questions actually know SQL already; some of them may not but are using existing tools that support SQL. So, if you have a database that doesn’t support SQL, then those existing tools often can’t directly work with it. You would have to integrate them. You’d have to build special connectors. That was one of the things we wanted when we set out to build Timescale. We wanted to give the appearance that this looks like Postgres. It just looks like a traditional relational database. If you have any of those existing tools and business applications, you could just speak directly to it as if it’s a traditional database. It just happens to be much more efficient and much more scalable for time-series data.

Column-oriented and row-oriented databases

In the beginning, we weren’t setting out to build our own time-series database. … A lot of the time-series databases, particularly on the market now, are column-oriented, because that allows you to do very fast aggregations on a single column. TimescaleDB also allows you to define a schema and different metrics could be in their own column.

There is a difference between what are known as column-oriented databases and traditional SQL databases, which are row-oriented. This is related to the ways that they store data on disk—that is, are all of the values in a row stored contiguously on disk? In a column-oriented database, even though you might have every metric, or if a bunch of metrics belong to the same row, they’re actually going to be stored almost separately. It’s like every column becomes its own table.

For example, columns make it really easy and fairly efficient to scan a single column. If all you want to do is take the average of the CPU, that’s efficient; but if you want to ask a question, this is called a rich predicate. A predicate is that WHERE clause in SQL. If you want to ask a question like: “Tell me the average temperature of all devices where the CPU is above a certain threshold, or the free memory is below something”—internally with column-oriented databases, each of those WHERE clauses is a different column, almost a different table that the database needs to scan and then do a JOIN on. While the column-oriented databases might be very efficient for just rolling up a single column, if you want to ask anything richer, it becomes a lot more expensive. In some of these databases, they don’t have indexes for these WHERE clauses, so any time you ask a question, it actually takes a full table scan.

If only 1% of devices have a high CPU and you say, “Tell me all the statistics where the device has a high CPU,” in some of these time-series databases that lack this indexing on columns, you end up actually scanning all of the data, not just the 1%. If you have something like TimescaleDB or someone who could build these efficient secondary indices, you could quickly focus in on the important data, so the only thing we need to touch is that 1% of data, not all the data.

Related resources: