Trimming data from prometheus

Hi,

We’re using cratedb as a remote storage location for prometheus via the cratedb prometheus adapter. This seems to be working well but I’d like to reduce the amount of data stored so that, for example, any data after 14 days is reduced to only hourly data (I’ve already got a cron job in place to delete data over 30 days via the partitions). The team who installed our system wrote the following script (trims anything over 14 days to hourly data) but I was wondering if there happened to be a more resource efficient way of doing this.

for i in `$CRASHBIN -c "select distinct (day__generated) from metrics order by day__generated;" | grep '\|  ' | awk '{print $2}' | egrep -v 'day__generated' | head -n -14`
do
  for h in {0..23}
  do
    $CRASHBIN -c "select distinct timestamp from metrics where (timestamp between ($i + ((86400000 / 24)*$h)) and ($i + (((86400000 / 24)*$h)+599999)));" | grep '\|  ' | awk '{print $2}
' | grep -v '^timestamp' | grep -v '^$' | paste -s -d, - >> timestamp_$i
  done
# timestamp output munging
sed -i '/^$/d' timestamp_$i
sed -i 's/,$//g' timestamp_$i
sed -i ':a;N;$!ba;s/\n/','/g' timestamp_$i
sed -i "s/,/','/g" timestamp_$i
KEEPSET=`cat timestamp_$i`
echo "delete from metrics where day__generated = $i and timestamp not in ( '$KEEPSET' );" > curate_$i
$CRASHBIN < curate_$i
done
# Clean up past retention period
$CRASHBIN -c 'delete from metrics where day__generated < CURRENT_TIMESTAMP - INTERVAL '\'''$RETAIN''\'' DAY;' && $CRASHBIN -c "OPTIMIZE table metrics;"

We’re running crate version 4.3.4 which I’ll be updating to 4.6.7 once it’s released.

Thanks for your help!

2 Likes

Hi @dithwick,

as initial step in your position I would verify that I actual do have a performance bottleneck there or any other requirement to improve performance of this piece of code. If performance is good enough and code works as it should maybe there is no need to change anything :grinning:

Let’s assume there is a need to improve performance for this code I would initially look at the delete from metrics where day__generated = $i and timestamp not in (...) query. Deleting individual records from a table is a rather expensive operation compared to deleting full partitions.

I would test how many records are affected by the delete operation (i.e. how much records are you trimming)

My next step is to do some benchmarking on delete from metrics where day__generated = $i and timestamp not in ... query and see if there are better performing alternatives. Something which comes into my mind is to copy hourly data into a metrics_trimmed table and then drop whole metrics partition. This could potentially be more performant than trimming the partition in metrics table.

Some additional pointers:

  • Is your metrics table partitioned by day__generated column? This would be helpful for performant delete query past your retention period.
  • Is it intended to only select timestamps for the first 10 minutes of every hour? select date_trunc('day', current_timestamp) + ((86400000 / 24)*1) as start, date_trunc('day', current_timestamp) + (((86400000 / 24)*1)+599999) as stop limit 100; does return start: 01:00:00 and stop 01:09:59.999
  • I would suggest to use $i + '$h hour'::INTERVAL for your between timestamp calculation. IMO it’s easier to read (e.g. where timestamp >= $i + '1 hour'::INTERVAL and timestamp < $i + '2 hour'::INTERVAL)

Hope this helps. Let me know if you have additional questions

2 Likes

Hi @jayeff,

Thanks for your helpful response! I think you’re right, maybe it is already performant enough but I can at least use your advice to make it more readable.

To answer some of your questions:

  • Yes the metrics table is partitioned by the day__generated column. I’ve got a script in place to remove data beyond where I want to keep efficiently using that mechanism.
  • I think the idea behind selecting the first 10 minutes of every hour was to ensure a sufficient amount of the prometheus scraped data was retained to make it useful. I may of course be able to reduce this.

Thanks again for your help.

1 Like

I think the idea behind selecting the first 10 minutes of every hour was to ensure a sufficient amount of the prometheus scraped data was retained to make it useful. I may of course be able to reduce this.

If this is intended everything is well. I thought maybe this is unintentional and the script is expected to keep data for the whole hour :grinning:

Potentially our guide on Sharding and Partitioning for Time Series Data could be interesting for you as well to find an efficient sharding strategy to optimise your query performance.

1 Like

@dithwick Just letting you know: CrateDB 4.6.7 was released today :tada:

1 Like

Thanks @jayeff (and for the pointer to the Sharding and Partitioning) I’ll get right on to updating :slight_smile:

1 Like