Another Indexes question

djbestenergy · January 10, 2023, 2:50pm

Hello all,

This is a follow on related question to my original question about indexing.

Is there a possibility that I could store REALs and DOUBLE PRECISIONs as chars ( fixed length ) , so the indexes could be removed and then in SQL typecast them back for any calculations ?

I know I could use an OBJECT(IGNORED) to remove indexes for the numeric fields, but Its a bit more work programmatically to change what we already have , if I can just use typecasting.

Many thanks
David.

proddata · January 10, 2023, 4:53pm

You could store them as TEXT and disable both the index and the column store. This is in many ways very similar to OBJECT(IGNORED)

num_field TEXT INDEX OFF STORAGE WITH (columnstore = false)

in some cases implicit casts might just be enough

cr> CREATE TABLE ta01 (num_field TEXT INDEX OFF STORAGE WITH (columnstore = false));                    
CREATE OK, 1 row affected  (0.475 sec)

cr> INSERT INTO ta01 VALUES (1.2);                                                                      
INSERT OK, 1 row affected  (0.101 sec)

cr> SELECT num_field + 1.0 imp FROM ta01;                                                                                                            
+-----+
| imp |
+-----+
| 2.2 |
+-----+

however indexes can also be removed from numeric fields

double_field DOUBLE INDEX OFF,
int_field INT INDEX OFF

djbestenergy · January 10, 2023, 5:04pm

I see - I misread the part about the columnstore and thought it was indexes that could not be removed from numerics fields.
Does having columnstore against a field also take up space ?
( in which case I might do the above with TEXT suggestion ).

Many thanks.

karynsaz · January 10, 2023, 5:13pm

Hi David!

In addition to Georg’s comment, if you compare the 4 tables below:

CREATE TABLE test_text (
  num_char TEXT INDEX OFF STORAGE WITH (columnstore = FALSE)
);  
CREATE TABLE test_text_col (
  num_char TEXT INDEX OFF
); 
CREATE TABLE test_real_off (
  num_real REAL INDEX OFF
); 
CREATE TABLE test_real (
  num_real REAL
);

Each loaded with 1 million records, the storage will be as follows:

test_text = 23.6MiB
test_text_col = 25.1MiB
test_real_off = 31.8MiB
test_real = 37.0MiB

Important to notice that, in the case of the test_text table, in which the field num_char has no index nor it has the columnstore, you would need to explicitly cast the field in case you want to use the WHERE clause.

djbestenergy · January 10, 2023, 5:21pm

Thanks, that’s very helpful!

Just out of interest, was the same data put into each (or at least similar sizes ) as the difference in test_text_col and test_real_off is quite surprising.

I’d not be needing the fields in a WHERE clause but I’d be performing AVG and WINDOW functions on the fields only.

karynsaz · January 10, 2023, 8:05pm

That’s actually a great question. To perform the insertion of data I used cr8, which counts with an insert-fake-data function.
As far as I could investigate, for the text type, the routine inserts one-word texts, while for the number type, it is more random with bigger numbers. That can partially explain the big difference there.

proddata · January 11, 2023, 3:49pm

Hi @djbestenergy

A little more context with a little more tests (inserting 1 Mio records in different table schemas):

General note for size comparisons it is good practice to limit the shards to 1 and run an optimize fully merging segments

create table float(num FLOAT) 
create table float_io(num FLOAT INDEX OFF)
create table double(num DOUBLE)
create table double_io(num DOUBLE INDEX OFF)
create table text_float(num TEXT)
create table text_float_io(num TEXT INDEX OFF)
create table text_float_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE))
create table text_double(num TEXT)
create table text_double_io(num TEXT INDEX OFF) 
create table text_double_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE))

test_size.sql

set search_path to 'size_test';

drop table if exists float;
drop table if exists double;
drop table if exists float_io;
drop table if exists double_io;
drop table if exists text_float;
drop table if exists text_float_io;
drop table if exists text_float_io_co;
drop table if exists text_double;
drop table if exists text_double_io;
drop table if exists text_double_io_co;

create table float(num FLOAT) clustered into 1 shards with ("number_of_replicas" = 0);
create table float_io(num FLOAT INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table double(num DOUBLE) clustered into 1 shards with ("number_of_replicas" = 0);
create table double_io(num DOUBLE INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_float(num TEXT) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_float_io(num TEXT INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_float_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE)) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_double(num TEXT) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_double_io(num TEXT INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_double_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE)) clustered into 1 shards with ("number_of_replicas" = 0);

insert into double
select random() from generate_series(1,1000000,1);
refresh table double;

insert into float select num::FLOAT from double;
insert into float_io select num::FLOAT from double;
insert into double_io select * from double;
insert into text_float select num::FLOAT from double;
insert into text_float_io select num::FLOAT from double;
insert into text_float_io_co select num::FLOAT from double;
insert into text_double select num from double;
insert into text_double_io select num from double;
insert into text_double_io_co select num from double;

refresh table float;
refresh table float_io;
refresh table double;
refresh table double_io;
refresh table text_float;
refresh table text_float_io;
refresh table text_float_io_co;
refresh table text_double;
refresh table text_double_io;
refresh table text_double_io_co;

optimize table float WITH (max_num_segments = 1);
optimize table float_io WITH (max_num_segments = 1);
optimize table double WITH (max_num_segments = 1);
optimize table double_io WITH (max_num_segments = 1);
optimize table text_float WITH (max_num_segments = 1);
optimize table text_float_io WITH (max_num_segments = 1);
optimize table text_float_io_co WITH (max_num_segments = 1);
optimize table text_double WITH (max_num_segments = 1);
optimize table text_double_io WITH (max_num_segments = 1);
optimize table text_double_io_co WITH (max_num_segments = 1);

select table_name, round(size / POWER(1024,2))::TEXT || ' MiB' as size
from sys.segments where primary and table_schema = 'size_test' order by 1;

can be run locally by putting it in a file and execute it with crash

crash < test_size.sql

Results:

+-------------------+--------+
| table_name        | size   |
+-------------------+--------+
| double            | 49 MiB |
| double_io         | 37 MiB |
| text_double       | 59 MiB |
| text_double_io    | 46 MiB |
| text_double_io_co | 31 MiB |
+-------------------+--------+
| float             | 31 MiB |
| float_io          | 26 MiB |
| text_float        | 37 MiB |
| text_float_io     | 29 MiB |
| text_float_io_co  | 23 MiB |
+-------------------+--------+

-- io -> index off
-- co -> column store off

i.e the storage savings from storing floating point values as text is only ~ 15%, but with significant performance impact, especially for aggregations:

echo "SELECT avg(num), count(num) from size_test.float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    30.928 ± 0.148
    min/max: 30.006 → 34.941

% echo "SELECT avg(num), count(num) from size_test.float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    31.006 ± 0.150
    min/max: 29.812 → 33.890

% echo "SELECT avg(num::float), count(num) from size_test.text_float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    526.880 ± 8.488
    min/max: 475.093 → 715.010

% echo "SELECT avg(num::float), count(num) from size_test.text_float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    554.985 ± 12.245
    min/max: 474.457 → 789.776

% echo "SELECT avg(num::float), count(num) from size_test.text_float_io_co;" | cr8 timeit --hosts localhost:4200;
Runtime (in ms):
    mean:    9963.003 ± 101.061
    min/max: 9122.198 → 11976.021

float_io vs text_float_io_co
→ 321x

That being said, this test was done with completely random data. encoding/compression might work different with real-word data, that typically shows less randomness.

djbestenergy · January 11, 2023, 5:35pm

Thanks @proddata that is really useful and shows how important the type is.

proddata · January 13, 2023, 7:45am

One last addition:

changing the table compression to best_compression

i.e.

create table float(num FLOAT)
clustered into 1 shards
with ("number_of_replicas" = 0, "codec" = 'best_compression');

+-------------------+--------+
| table_name        | size   |
+-------------------+--------+
| double            | 40 MiB |
| double_io         | 29 MiB |
| text_double       | 50 MiB |
| text_double_io    | 37 MiB |
| text_double_io_co | 23 MiB |
+-------------------+--------+
| float             | 25 MiB |
| float_io          | 21 MiB |
| text_float        | 31 MiB |
| text_float_io     | 24 MiB |
| text_float_io_co  | 18 MiB |
+-------------------+--------+

-- differences in time to the above results due to different system setups

% echo "SELECT avg(num), count(num) from size_test.float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    19.131 ± 0.121
    min/max: 18.501 → 23.190

% echo "SELECT avg(num), count(num) from size_test.float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    19.348 ± 0.165
    min/max: 18.544 → 24.255

% echo "SELECT avg(num::float), count(num) from size_test.text_float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    460.077 ± 1.895
    min/max: 449.610 → 507.642

 % echo "SELECT avg(num::float), count(num) from size_test.text_float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    459.306 ± 1.153
    min/max: 451.709 → 485.768

% echo "SELECT avg(num::float), count(num) from size_test.text_float_io_co;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    32687.997 ± 100.736
    min/max: 31582.404 → 34792.086

Setting the table compression should not have an impact on aggregation performance, but makes table scans - which basically happens with deactivated column stores - quite a bit more expensive. Also best_compression has an impact on ingest speed.

proddata · December 18, 2023, 8:32am

Updated results with 5.5.1 and ability to disable column_store on numeric types:

+-------------------+-------------------+-------------------+
| table_name        | default           | best_compression  |
+-------------------+-------------------+-------------------+
| double            | 50 MiB            | 41 MiB            |
| double_io         | 39 MiB            | 30 MiB            |
| double_io_co      | 31 MiB            | 23 MiB            |
+-------------------+-------------------+-------------------+
| float             | 33 MiB            | 25 MiB            |
| float_io          | 26 MiB            | 21 MiB            |
| float_io_co       | 23 MiB            | 18 MiB            |
+-------------------+-------------------+-------------------+
| text_double       | 59 MiB            | 50 MiB            |
| text_double_io    | 45 MiB            | 37 MiB            |
| text_double_io_co | 31 MiB            | 23 MiB            |
+-------------------+-------------------+-------------------+
| text_float        | 37 MiB            | 31 MiB            |
| text_float_io     | 29 MiB            | 24 MiB            |
| text_float_io_co  | 23 MiB            | 18 MiB            |
+-------------------+-------------------+-------------------+

Topic		Replies	Views
Indexes for columns CrateDB	3	166	November 23, 2023
REAL, DOUBLE inexact precision CrateDB	1	248	March 2, 2023
Incorrect result with TRY_CAST function CrateDB	6	527	February 18, 2022
Case Insensitive searching in cratedb CrateDB	1	736	September 20, 2022
Multiplying INT columns with DECIMALS results in INT results SQL	1	4330	January 10, 2019

Another Indexes question

Related Topics