Blog > Post
Why Build a Time Series Data Platform?
I am frequently asked: "Why build a database specifically for time series?" The implication was that a general SQL database can act as a TSDB by ordering on some time column. Or you can build on top of a distributed database like Cassandra. While it’s possible to use these solutions for solving time series problems, they’re incredibly time consuming and require significant development effort.. I talked to other engineers to see what they had done and found that there was a common set of tasks that led to the need for a common Time Series Platform. Everyone seemed to be reinventing the wheel, so it looked like there was a gap in the market for something built specifically for time series.
Defining The Time Series Problem
- Regular time series are familiar to developers in the DevOps or metrics space. These are simply measurements that are taken at fixed intervals of time, like every 10 seconds. This is also seen quite frequently in sensor data use cases, like taking a reading at regular intervals from a sensor. The important thing about regular time series is that they represent a summarization of some underlying raw event stream or distribution. Summarization is very useful when looking for patterns or visualizing data sets that have more events than you have pixels to draw.
- The second type of time series, irregular, corresponds to discrete events. These could be requests to an API, trades in a stock market, or really any kind of event that you'd want to track in time. It's possible to induce a regular time series from an irregular one. For example, if you want to calculate the average response time from an API in 1 minute intervals, you can aggregate the individual requests to produce the regular time series.
Time Series Applications & Scale
- Time series data needs to focus on fast ingestion. That is, you're always inserting new data. Most often, these are append operations where you're adding only recent time series data—although users do sometimes need historical backfill, and with sensor data use cases, we frequently see lagged data collection. Even with the latter, you're usually appending recent data to each individual series.
- High-precision data is kept for some short period of time with longer retention periods for summary data at medium or lower precision. One way to think about this is the raw high-precision samples and summaries for 5 minute and 1 hour intervals. Operationally this means that you must be constantly deleting data from the database. The high-precision data is resident for a short window and then should be evicted. This is a very different workload than what a normal database is designed to handle.
- An agent or the database itself must continuously compute summaries from the high-precision data for longer term storage. These could be simple aggregates like first, last, min, max, sum, count or could include more complex computations like percentiles or histograms.
- The query pattern of time series can be quite different from other database workloads. In most cases, a query will pull a range of data back for a requested time range. For databases that can compute aggregates and downsamples on the fly, they will frequently churn through many records to pull back the result set for a query. Quickly iterating through many records to compute an aggregate is critical for the time series use case.
- Server and application monitoring
- Real-time analytics
- IoT sensor data monitoring and control
The Problem of Using a SQL Database for Time Series
|Create a single table to store everything with the series name, the value, and a time.||
Separate lookup index if we wanted to search on anything other than the specific name (like server, metric, service, etc.).
This naive implementation would have a table that gets 172M new records per day. This would quickly cause a problem because of the sheer size of the table.
With time series, it's common to have high-precision data that is kept around only for a short period of time.
This means that soon you'll be doing just as many deletes as inserts, which isn't something a traditional DB is designed to handle well.
|Create a separate table per day or some other period of time.||Requires the developer to create application code to tie the data from the different tables together.||More code must be written to compute summary statistics for lower-precision data and to periodically drop old tables.|
Then there's the issue of scaling past what a single SQL server can handle. Sharding segments of the time series to different servers is a common technique but requires more application-level code to handle it.
Building on Distributed Databases
After initially working with a more standard relational database, many will look at distributed databases like Cassandra or HBase. As with the SQL variant, building a time series solution on top of Cassandra requires quite a bit of application-level code.
First, you need to decide how to structure the data. Rows in Cassandra get stored to one replication group, which means that you need to think about how to structure your row keys to ensure that the cluster is properly utilized without creating hot spots for writes and reads. Then, once you've decided how to arrange the data, you need to write application logic to do additional query processing for the time series use case. You'll also end up writing downsampling logic to handle creating lower-precision samples that can be used for longer-term visualizations. Finally, once you have the basics wired up, it will be a continual chore to ensure that you get the query performance you need when querying many time series and computing aggregates across different dimensions.
Advantages of Building Specifically for Time Series
So this brings us back around to the point of this post: Why build a Time Series Data Platform?
One of our goals we envisioned when making a Time Series Platform was optimizing for a user’s or developer’s time to value. That is, the faster they get their problem solved and are up and running, the better the experience will be. That means that if we see users frequently writing code or creating projects to solve the same problems, we’ll try to pull that into our platform or database. The less code a developer has to write to solve their problem, the faster they’ll be done.
Time is Peculiar
Other than the obvious usability goals, we also saw that we could optimize the database around some of the peculiarities of time series. It’s insert only, we need to aggregate and downsample, we need to automatically evict high-precision data in the cases where users want to free up space. We could also build compression that was optimized for time series data. We also organized the data in a way that would index tag data for efficient queries. At the database level, there were many optimizations we could get.
Going Beyond a Database to Make Development Easier
The other advantage in building specifically for time series is that we could go beyond the database. We’ve found that most users run into a common set of problems they need to solve—how to collect the data, how to store it, how to process and monitor it, and how to visualize it.
We’ve also found that having a common API makes it easier for the community to build solutions around our stack. We have the line protocol to represent time series data, our HTTP API for writing and querying, and Kapacitor for processing. This means that over time, we can have pre-built components for the most common use cases.
We find that we can get better performance than more generalized databases while also reducing the developer effort to get a solution up by at least an order of magnitude. Doing something that might have taken months to get running on Cassandra or MySQL could take as little as an afternoon using our stack. And that’s exactly what we’re trying to achieve.
By focusing on time series, we can solve problems for application developers so that they can focus on the code that creates unique value inside their app.
|About the Author:
||Paul is the creator of InfluxDB. He has helped build software for startups, large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 10,000 members. Paul holds a degree in computer science from Columbia University.|
Share this page