Optimizing storage for historic time-series data

When dealing with time-series data, performance is crucial. Data that gets ingested should be available quickly for querying. To enable fast decision-making, analytical queries often need to return results in a fraction of a second.

As data ages, it is typically involved less often in real-time feedback loops, making query performance less of a concern. Instead, storage cost becomes more important as data accumulates. CrateDB is already very efficient in terms of storage usage (30 - 40% less than other time-series databases for equivalent schemas). In addition, CrateDB allows leveraging arrays to significantly reduce storage, which is described in this article.

Initial table design

Let’s look at a typical table design where one row is stored per second and sensor_id:

CREATE TABLE sensor_readings (
   time TIMESTAMP WITH TIME ZONE NOT NULL,
   sensor_id TEXT NOT NULL,
   battery_level DOUBLE PRECISION,
   battery_status TEXT,
   battery_temperature DOUBLE PRECISION
);

For 120 million records, the table has a size of 6.3 GB. As CrateDB indexes all columns by default, analytical queries like the one below finish in less than a second:

-- count of occurrences per day/sensor_id on which the battery level was below 10%
SELECT DATE_TRUNC('day', time) AS day,
       sensor_id,
       COUNT(*)
FROM sensor_readings
WHERE battery_level < 10
GROUP BY 1, 2
ORDER BY 1, 2;

Using arrays for time-based bucketing

Now we create a second table for historic data, optimizing storage consumption:

CREATE TABLE sensor_readings_historic (
   time_bucket TIMESTAMP WITH TIME ZONE NOT NOLL,
   time ARRAY(TIMESTAMP WITH TIME ZONE) INDEX OFF,
   sensor_id TEXT NOT NULL,
   battery_level ARRAY(DOUBLE PRECISION) INDEX OFF,
   battery_status ARRAY(TEXT) INDEX OFF,
   battery_temperature ARRAY(DOUBLE PRECISION) INDEX OFF
)
WITH (codec = 'best_compression');

The key changes are:

  • time is now modeled as an array of timestamps. time_bucket is a truncated timestamp on day-level for easier querying. It allows selecting all rows for a particular day without having to inspect all array values.
  • The metrics have also become arrays (battery_* columns)
  • The compression was changed to the more aggressive best_compression at the cost of slightly slower lookups
  • Indexes have been turned off for array columns

To maintain a good balance between storage efficiency and query performance, we limit the array size to 2880 elements when populating the table. If there are more values for a time_bucket, an additional row is inserted.

The size of the table is now 1.1 GB, which is a reduction of more than 80%.

Copying data into the historical table

To copy data from sensor_readings to sensor_readings_historic, you can use a query like this:

-- copying data into historic table
INSERT INTO sensor_readings_historic
SELECT DATE_TRUNC('day', time) AS time_bucket,
       ARRAY_AGG(time),
       sensor_id,
       ARRAY_AGG(battery_level) AS battery_level,
       ARRAY_AGG(battery_status) AS battery_status,
       ARRAY_AGG(battery_temperature) AS battery_temperature
FROM sensor_readings
-- filtering for the previous month
WHERE DATE_TRUNC('month', time) = DATE_TRUNC('month', NOW()) - '1 month'::INTERVAL;

-- deleting data from the previous table
DELETE FROM sensor_readings WHERE DATE_TRUNC('month', time) = DATE_TRUNC('month', NOW()) - '1 month'::INTERVAL;

Please note that DELETE statements in CrateDB should always be based on a partition, so that a complete partition can be dropped rather than a set of rows. For simplicity, we excluded partitioning in this article, please see Sharding and Partitioning Guide for Time Series Data.

Querying

Using the UNNEST table function, we can transform the table layout of sensor_readings_historic to match that of sensor_readings. A UNION ALL combines both tables, saved as a view:

CREATE VIEW sensor_readings_full_history AS
SELECT DATE_TRUNC('day', time) AS time_bucket,
       time,
       sensor_id,
       battery_level,
       battery_status,
       battery_temperature
FROM sensor_readings
UNION ALL
SELECT time_bucket,
       UNNEST(time) AS time,
       sensor_id,
       UNNEST(battery_level) AS battery_level,
       UNNEST(battery_status) AS battery_status,
       UNNEST(battery_temperature) AS battery_temperature
FROM sensor_readings_historic;

It is important to always query sensor_readings_historic with a WHERE condition on time_bucket to allow efficient row filtering. To keep both tables identical, we add a calculated time_bucket column to sensor_readings as well.

Instead of unnesting all arrays first and then applying analytical SQL functions, CrateDB also offers scalar functions directly on arrays. CrateDB 4.6.0 adds ARRAY_SUM, ARRAY_MIN, ARRAY_MAX, and ARRAY_AVG as new functions.

-- count days per sensor_id on which the battery level was below 10%
SELECT time_bucket,
       sensor_id,
       COUNT(*)
FROM sensor_readings_historic
WHERE ARRAY_MIN(battery_level) < 10
GROUP BY 1, 2;

Using user-defined functions

A more flexible alternative to predefined scalar functions is user-defined functions. They allow processing arrays via JavaScript for more sophisticated aggregations.

The example below recreates the first analytical query from above, counting the exact number of occurrences where the battery level was below 10%:

CREATE OR REPLACE FUNCTION count_less_than(ARRAY(DOUBLE), DOUBLE) RETURNS DOUBLE
LANGUAGE JAVASCRIPT AS '
    function count_less_than(dataArray, threshold) {
        return Array.prototype.reduce.call(dataArray, function (counter, currentValue) {
            if (currentValue < threshold) {
                counter++;
            }
            return counter;
        }, 0);
    }';

-- count of occurrences per day/sensor_id on which the battery level was below 10%
SELECT time_bucket,
       sensor_id,
       SUM(count_less_than(battery_level, 10))
FROM sensor_readings_historic
GROUP BY 1, 2
ORDER BY 1, 2;
2 Likes