Optimizing storage for historic time-series data

When dealing with time-series data, performance is crucial. Data that gets ingested should be available quickly for querying. To enable fast decision-making, analytical queries often need to return results in a fraction of a second.

As data ages, it is typically involved less often in real-time feedback loops, making query performance less of a concern. Instead, storage cost becomes more important as data accumulates. CrateDB is already very efficient in terms of storage usage (30 - 40% less than other time-series databases for equivalent schemas). In addition, CrateDB allows leveraging arrays to significantly reduce storage, which is described in this article.

Initial table design

Let’s look at a typical table design where one row is stored per second and sensor_id:

CREATE TABLE sensor_readings (
   time TIMESTAMP WITH TIME ZONE NOT NULL,
   sensor_id TEXT NOT NULL,
   battery_level DOUBLE PRECISION,
   battery_status TEXT,
   battery_temperature DOUBLE PRECISION
);

For 120 million records, the table has a size of 6.3 GB. As CrateDB indexes all columns by default, analytical queries like the one below finish in less than a second:

-- count of occurrences per day/sensor_id on which the battery level was below 10%
SELECT DATE_TRUNC('day', time) AS day,
       sensor_id,
       COUNT(*)
FROM sensor_readings
WHERE battery_level < 10
GROUP BY 1, 2
ORDER BY 1, 2;

Using arrays for time-based bucketing

Now we create a second table for historic data, optimizing storage consumption:

CREATE TABLE sensor_readings_historic (
   time_bucket TIMESTAMP WITH TIME ZONE NOT NOLL,
   time ARRAY(TIMESTAMP WITH TIME ZONE) INDEX OFF,
   sensor_id TEXT NOT NULL,
   battery_level ARRAY(DOUBLE PRECISION) INDEX OFF,
   battery_status ARRAY(TEXT) INDEX OFF,
   battery_temperature ARRAY(DOUBLE PRECISION) INDEX OFF
)
WITH (codec = 'best_compression');

The key changes are:

  • time is now modeled as an array of timestamps. time_bucket is a truncated timestamp on day-level for easier querying. It allows selecting all rows for a particular day without having to inspect all array values.
  • The metrics have also become arrays (battery_* columns)
  • The compression was changed to the more aggressive best_compression at the cost of slightly slower lookups
  • Indexes have been turned off for array columns

To maintain a good balance between storage efficiency and query performance, we limit the array size to 2880 elements when populating the table. If there are more values for a time_bucket, an additional row is inserted.

The size of the table is now 1.1 GB, which is a reduction of more than 80%.

Copying data into the historical table

To copy data from sensor_readings to sensor_readings_historic, you can use a query like this:

-- copying data into historic table
INSERT INTO sensor_readings_historic
SELECT DATE_TRUNC('day', time) AS time_bucket,
       ARRAY_AGG(time),
       sensor_id,
       ARRAY_AGG(battery_level) AS battery_level,
       ARRAY_AGG(battery_status) AS battery_status,
       ARRAY_AGG(battery_temperature) AS battery_temperature
FROM sensor_readings
-- filtering for the previous month
WHERE DATE_TRUNC('month', time) = DATE_TRUNC('month', NOW()) - '1 month'::INTERVAL
GROUP BY 1, 3;

-- deleting data from the previous table
DELETE FROM sensor_readings WHERE DATE_TRUNC('month', time) = DATE_TRUNC('month', NOW()) - '1 month'::INTERVAL;

Please note that DELETE statements in CrateDB should always be based on a partition, so that a complete partition can be dropped rather than a set of rows. For simplicity, we excluded partitioning in this article, please see Sharding and Partitioning Guide for Time Series Data.

Querying

Using the UNNEST table function, we can transform the table layout of sensor_readings_historic to match that of sensor_readings. A UNION ALL combines both tables, saved as a view:

CREATE VIEW sensor_readings_full_history AS
SELECT DATE_TRUNC('day', time) AS time_bucket,
       time,
       sensor_id,
       battery_level,
       battery_status,
       battery_temperature
FROM sensor_readings
UNION ALL
SELECT time_bucket,
       UNNEST(time) AS time,
       sensor_id,
       UNNEST(battery_level) AS battery_level,
       UNNEST(battery_status) AS battery_status,
       UNNEST(battery_temperature) AS battery_temperature
FROM sensor_readings_historic;

It is important to always query sensor_readings_historic with a WHERE condition on time_bucket to allow efficient row filtering. To keep both tables identical, we add a calculated time_bucket column to sensor_readings as well.

Instead of unnesting all arrays first and then applying analytical SQL functions, CrateDB also offers scalar functions directly on arrays. CrateDB 4.6.0 adds ARRAY_SUM, ARRAY_MIN, ARRAY_MAX, and ARRAY_AVG as new functions.

-- count days per sensor_id on which the battery level was below 10%
SELECT time_bucket,
       sensor_id,
       COUNT(*)
FROM sensor_readings_historic
WHERE ARRAY_MIN(battery_level) < 10
GROUP BY 1, 2;

Using user-defined functions

A more flexible alternative to predefined scalar functions is user-defined functions. They allow processing arrays via JavaScript for more sophisticated aggregations.

The example below recreates the first analytical query from above, counting the exact number of occurrences where the battery level was below 10%:

CREATE OR REPLACE FUNCTION count_less_than(ARRAY(DOUBLE), DOUBLE) RETURNS DOUBLE
LANGUAGE JAVASCRIPT AS '
    function count_less_than(dataArray, threshold) {
        return Array.prototype.reduce.call(dataArray, function (counter, currentValue) {
            if (currentValue < threshold) {
                counter++;
            }
            return counter;
        }, 0);
    }';

-- count of occurrences per day/sensor_id on which the battery level was below 10%
SELECT time_bucket,
       sensor_id,
       SUM(count_less_than(battery_level, 10))
FROM sensor_readings_historic
GROUP BY 1, 2
ORDER BY 1, 2;
6 Likes

Few questions :

  1. Can codec be set at partition level?
  2. Can codec be changed once my data becomes old? e.g. I can keep my cold data partitions with codec=best compression instead of default to save on disk space if that is possible.
  3. Is time_bucket above case alternative to creating partitions ?

Hi @vinayak.shukre,

You can run ALTER TABLE table PARTITION (partition_key = partition_value) SET (codec = 'best_compression'); on an already existing partition. However, you will need to close the partition before doing that. While a partition is closed, its rows won’t be included in the output of any SELECT queries:

Example:

ALTER TABLE table PARTITION (ts_month = 1648771200000) CLOSE;
ALTER TABLE table PARTITION (ts_month = 1648771200000) SET (codec = 'best_compression');
ALTER TABLE table PARTITION (ts_month = 1648771200000) OPEN;

Whether time_bucket in the examples above qualifies as a replacement for partitioning depends on the data volume. As per Sharding and Partitioning Guide for Time Series Data, if the size of a shard exceeds 70 GB, it is recommended to use partitioning. Therefore, it can still make sense to use partitioning to avoid degrading index performance. For tables where performance plays a less critical role, you might choose to go for a higher shard size, like 100 GB instead of 70 GB.

Thanks @hammerhead . I just fired above commands on one of my sample data partition. It was 2.2 TiB before. I am waiting for it to come again alive. All shards are in unassigned states after the above three commands were fired.

@hammerhead I tried above scenario on partition with 2.2TiB data. After the three commands were fired with a gap of 1 minute each, my shards went in unassigned state and did not recover at all for long time. Then I restarted the cluster and then my table including this partition recovered but size is still 2.2 TiB.

Then I ran it on a single node cluster where partition is just 35MB and data is around few hundred thousands. No error was seen in the log but size remained just 35 MB.

Then I left it for few days and again tried today on a different cluster with smaller partition size like 1MB etc and do not see any error and no effect is seen.

Not sure what I am missing.

Hi @vinayak.shukre,

I tried the workflow on a bigger partition and also don’t see the expected size reduction.
I opened an issue for our developers to provide feedback on this matter: Changing `codec` on existing partitions appears to have no effect · Issue #12629 · crate/crate · GitHub.

Hi @vinayak.shukre,

the outcome of the above-linked discussion with our developers is that changing codec only takes effect for newly written segments (i.e. newly inserted data). Hence, we didn’t see any change for already existing partitions.
To force a rewrite of existing segments, please run OPTIMIZE TABLE table PARTITION (partition_key = partition_key_value) WITH (max_num_segments = 1);.
Documentation is going to be updated shortly to include this information as well.

2 Likes

Thanks @hammerhead for following up on this for me. Is there any way, I can see codec set for each partition? For table, I can see that in show create table tablename output

SELECT table_name, table_schema, "values", partition_ident, settings['codec']
FROM information_schema.table_partitions;

This worked fine on a smaller size partition. 33.9 MiB partition came down to 21 MiB.
2 TB partition after closing and opening, has not recovered after 15 hours and is in critical condition.

Looks like this feature is not very much usable on real big data. :pensive:

This might be a different problem after all. Even clusters with 100 TiB should recover in minutes, not hours rather quickly and OPTIMIZE is a task that runs in the background.

2 Likes