Evaluating disk usage of a specific column

I am looking for a best practice to evaluate the disk usage of CrateDB as it pertains to a single column.

Our system receives various messages that are stored in columns as type:object (JSON).
With the array of various test messages that we have already received it makes sense in our context to estimate how we would provision a larger, production, system by looking at select columns in select tables and extrapolating.
An idealized way to get this information would be a query to a system virtual table where I could filter for a specific column and return the amount of bytes consumed.

What I have found so far:
sys.shards – Great information; very close to what I am looking for, but does not seem to allow granularity to a select column.

pg_stats – This exceeds the need for granularity, actually breaking down the message column to the columns produced by each of the JSON fields; however, it provides averages for each of these columns.

bit_length – More specifically, casting each object of the target column to a string and evaluating the sum of the bit_lentgh.
This method has the benefit of targeting exactly what I am looking for, but I fear this is a somewhat naïve approach that creates inaccuracies from casting and may not consider the size-on-disk after compression and instead just the character counts of the object.

My next effort would be to create a new table with one column to migrate all the data from a select column to this new table and evaluate the sum of the size of the partitions filtered by this table name.
But, before doing this, I felt that perhaps I am missing something much easier.

1 Like

Hi @Bryan_Ford,

An idealized way to get this information would be a query to a system virtual table where I could filter for a specific column and return the amount of bytes consumed.

I’m sorry but CrateDB does not provide this option.

You could run the following query to get size in bytes for a specific object. Note that this method will not consider compression.

SELECT octet_length(_raw::object['<object_name>']) FROM <table> WHERE <clauses> LIMIT 100;

By applying aggregate functions (avg, max, etc) you can extract additional information like average object size, max object size, or similar. Make sure to restrict number of records evaluated though as these queries are rather slow.

The alternative would be as you suggested to copy over the relevant object into a new table and evaluate the size based on this table. I did use exactly this approach in the past myself when tasked to estimate future disk size requirements.

1 Like