Evaluating disk usage of a specific column

Bryan_Ford · January 14, 2022, 10:15pm

I am looking for a best practice to evaluate the disk usage of CrateDB as it pertains to a single column.

Our system receives various messages that are stored in columns as type:object (JSON).
With the array of various test messages that we have already received it makes sense in our context to estimate how we would provision a larger, production, system by looking at select columns in select tables and extrapolating.
An idealized way to get this information would be a query to a system virtual table where I could filter for a specific column and return the amount of bytes consumed.

What I have found so far:
sys.shards – Great information; very close to what I am looking for, but does not seem to allow granularity to a select column.

pg_stats – This exceeds the need for granularity, actually breaking down the message column to the columns produced by each of the JSON fields; however, it provides averages for each of these columns.

bit_length – More specifically, casting each object of the target column to a string and evaluating the sum of the bit_lentgh.
This method has the benefit of targeting exactly what I am looking for, but I fear this is a somewhat naïve approach that creates inaccuracies from casting and may not consider the size-on-disk after compression and instead just the character counts of the object.

My next effort would be to create a new table with one column to migrate all the data from a select column to this new table and evaluate the sum of the size of the partitions filtered by this table name.
But, before doing this, I felt that perhaps I am missing something much easier.

jayeff · January 19, 2022, 8:27am

Hi @Bryan_Ford,

An idealized way to get this information would be a query to a system virtual table where I could filter for a specific column and return the amount of bytes consumed.

I’m sorry but CrateDB does not provide this option.

You could run the following query to get size in bytes for a specific object. Note that this method will not consider compression.

SELECT octet_length(_raw::object['<object_name>']) FROM <table> WHERE <clauses> LIMIT 100;

By applying aggregate functions (avg, max, etc) you can extract additional information like average object size, max object size, or similar. Make sure to restrict number of records evaluated though as these queries are rather slow.

The alternative would be as you suggested to copy over the relevant object into a new table and evaluate the size based on this table. I did use exactly this approach in the past myself when tasked to estimate future disk size requirements.

Topic		Replies	Views
Indexes and storage space CrateDB	5	519	December 15, 2022
Column size with nulls CrateDB	1	275	December 16, 2022
Space savings on data CrateDB	4	595	January 26, 2023
Statistics of table size not being refreshed CrateDB	1	346	November 28, 2022
Copying table results in corrupt data CrateDB	3	734	May 20, 2022

Evaluating disk usage of a specific column

Related Topics