When dealing with time-series data, performance is crucial. Data that gets ingested should be available quickly for querying. To enable fast decision-making, analytical queries often need to return results in a fraction of a second.
As data ages, it is typically involved less often in real-time feedback loops, making query performance less of a concern. Instead, storage cost becomes more important as data accumulates. CrateDB is already very efficient in terms of storage usage (30 - 40% less than other time-series databases for equivalent schemas). In addition, CrateDB allows leveraging arrays to significantly reduce storage, which is described in this article.
Let’s look at a typical table design where one row is stored per second and
CREATE TABLE sensor_readings ( time TIMESTAMP WITH TIME ZONE NOT NULL, sensor_id TEXT NOT NULL, battery_level DOUBLE PRECISION, battery_status TEXT, battery_temperature DOUBLE PRECISION );
For 120 million records, the table has a size of 6.3 GB. As CrateDB indexes all columns by default, analytical queries like the one below finish in less than a second:
-- count of occurrences per day/sensor_id on which the battery level was below 10% SELECT DATE_TRUNC('day', time) AS day, sensor_id, COUNT(*) FROM sensor_readings WHERE battery_level < 10 GROUP BY 1, 2 ORDER BY 1, 2;
Now we create a second table for historic data, optimizing storage consumption:
CREATE TABLE sensor_readings_historic ( time_bucket TIMESTAMP WITH TIME ZONE NOT NOLL, time ARRAY(TIMESTAMP WITH TIME ZONE) INDEX OFF, sensor_id TEXT NOT NULL, battery_level ARRAY(DOUBLE PRECISION) INDEX OFF, battery_status ARRAY(TEXT) INDEX OFF, battery_temperature ARRAY(DOUBLE PRECISION) INDEX OFF ) WITH (codec = 'best_compression');
The key changes are:
timeis now modeled as an array of timestamps.
time_bucketis a truncated timestamp on day-level for easier querying. It allows selecting all rows for a particular day without having to inspect all array values.
- The metrics have also become arrays (
- The compression was changed to the more aggressive
best_compressionat the cost of slightly slower lookups
- Indexes have been turned off for array columns
To maintain a good balance between storage efficiency and query performance, we limit the array size to 2880 elements when populating the table. If there are more values for a
time_bucket, an additional row is inserted.
The size of the table is now 1.1 GB, which is a reduction of more than 80%.
To copy data from
sensor_readings_historic, you can use a query like this:
-- copying data into historic table INSERT INTO sensor_readings_historic SELECT DATE_TRUNC('day', time) AS time_bucket, ARRAY_AGG(time), sensor_id, ARRAY_AGG(battery_level) AS battery_level, ARRAY_AGG(battery_status) AS battery_status, ARRAY_AGG(battery_temperature) AS battery_temperature FROM sensor_readings -- filtering for the previous month WHERE DATE_TRUNC('month', time) = DATE_TRUNC('month', NOW()) - '1 month'::INTERVAL; -- deleting data from the previous table DELETE FROM sensor_readings WHERE DATE_TRUNC('month', time) = DATE_TRUNC('month', NOW()) - '1 month'::INTERVAL;
Please note that
DELETE statements in CrateDB should always be based on a partition, so that a complete partition can be dropped rather than a set of rows. For simplicity, we excluded partitioning in this article, please see Sharding and Partitioning Guide for Time Series Data.
UNNEST table function, we can transform the table layout of
sensor_readings_historic to match that of
UNION ALL combines both tables, saved as a view:
CREATE VIEW sensor_readings_full_history AS SELECT DATE_TRUNC('day', time) AS time_bucket, time, sensor_id, battery_level, battery_status, battery_temperature FROM sensor_readings UNION ALL SELECT time_bucket, UNNEST(time) AS time, sensor_id, UNNEST(battery_level) AS battery_level, UNNEST(battery_status) AS battery_status, UNNEST(battery_temperature) AS battery_temperature FROM sensor_readings_historic;
It is important to always query
sensor_readings_historic with a
WHERE condition on
time_bucket to allow efficient row filtering. To keep both tables identical, we add a calculated
time_bucket column to
sensor_readings as well.
Instead of unnesting all arrays first and then applying analytical SQL functions, CrateDB also offers scalar functions directly on arrays. CrateDB 4.6.0 adds
ARRAY_AVG as new functions.
-- count days per sensor_id on which the battery level was below 10% SELECT time_bucket, sensor_id, COUNT(*) FROM sensor_readings_historic WHERE ARRAY_MIN(battery_level) < 10 GROUP BY 1, 2;
The example below recreates the first analytical query from above, counting the exact number of occurrences where the battery level was below 10%: