Interpolating missing time-series values


This article requires CrateDB >= 4.7.0. For older CrateDB versions, please refer to a previous version of this article.


When dealing with time-series data, you will likely at some point run into the question of how to deal with incomplete data. This could be caused by a sensor outage, network issues, or any other factors resulting in missing data. Two common interpolation techniques are Last Observation Carried Forward (LOCF) and Next Observation Carried Backward (NOCB), which we will discuss in this article.

Data Model

We will use a typical sensor data table, capturing several metrics per timestamp:

CREATE TABLE sensor_readings (
   time TIMESTAMP WITH TIME ZONE NOT NULL,
   battery_level DOUBLE PRECISION,
   battery_status TEXT,
   battery_temperature DOUBLE PRECISION
);

In our example, one record is expected per second and the table contains the following rows:

As you can see, entries are missing for 10:02 - 10:04 as well as 10:06 - 10:09.

The basis of the following interpolation is a query utilizing GENERATE_SERIES to generate a continuous time axis with a sample rate of one minute and a join to include existing values:

SELECT s."time",
       battery_level,
       battery_status,
       battery_temperature
FROM GENERATE_SERIES(
  '2022-02-09 10:00:00',
  '2022-02-09 10:10:00',
  '1 minute'::INTERVAL
) s("time")
LEFT JOIN doc.sensor_readings r USING("time");

As expected, it shows the mentioned gaps:

Now, let’s implement our two mentioned interpolation methods to fill the gaps.

Last Observation Carried Forward

One strategy to fill those two gaps is to take the last observed (non-null) value and carry it forward.
Referencing a value from previous rows can be achieved with the LAG function. However, LAG by default looks back with an offset of one, meaning the directly preceding row. In cases like ours, where there are gaps of multiple rows, but the size of the gap is not exactly known, working with an offset will not work out.

To solve this, CrateDB 4.7 adds the IGNORE NULLS option. If set, LAG will look back (within the result set) until the last non-null value is found, independent of the exact offset. In a simplified version for only one metric, the corresponding query looks like below:

SELECT s."time",
       battery_level,
       LAG(battery_level) IGNORE NULLS OVER w AS previous_battery_level
FROM GENERATE_SERIES(
  '2022-02-09 10:00:00',
  '2022-02-09 10:10:00',
  '1 minute'::INTERVAL
) s("time")
LEFT JOIN doc.sensor_readings r USING("time")
WINDOW w AS (ORDER BY s."time")
ORDER BY 1;

Combining this with a COALESCE to use either the actual metric (if present) or the one carried forward, this is the final query and result set:

SELECT s."time",
       COALESCE(battery_level, LAG(battery_level) IGNORE NULLS OVER w) AS battery_level,
       COALESCE(battery_status, LAG(battery_status) IGNORE NULLS OVER w) AS battery_status,
       COALESCE(battery_temperature, LAG(battery_temperature) IGNORE NULLS OVER w) AS battery_temperature
FROM GENERATE_SERIES(
  '2022-02-09 10:00:00',
  '2022-02-09 10:10:00',
  '1 minute'::INTERVAL
) s("time")
LEFT JOIN doc.sensor_readings r USING("time")
WINDOW w AS (ORDER BY s."time")
ORDER BY 1;

We can see that all gaps have been filled, independent of the size of the gap.

Next Observation Carried Backward

With this strategy, we look forward to the next non-value and carry that one backward to fill gaps. Instead of the LAG window function, we now use LEAD. Otherwise, the query remains identical:

SELECT s."time",
       COALESCE(battery_level, LEAD(battery_level) IGNORE NULLS OVER w) AS battery_level,
       COALESCE(battery_status, LEAD(battery_status) IGNORE NULLS OVER w) AS battery_status,
       COALESCE(battery_temperature, LEAD(battery_temperature) IGNORE NULLS OVER w) AS battery_temperature
FROM GENERATE_SERIES(
  '2022-02-09 10:00:00',
  '2022-02-09 10:10:00',
  '1 minute'::INTERVAL
) s("time")
LEFT JOIN doc.sensor_readings r USING("time")
WINDOW w AS (ORDER BY s."time")
ORDER BY 1;

Conclusion

Starting from CrateDB 4.7.0, the newly added IGNORE NULLS parameter enables the simple implementation of LOCB and NOCF. IGNORE NULLS is also available for the window functions FIRST_VALUE, LAST_VALUE, and NTH_VALUE.

4 Likes