Rows need significantly more storage as others in the same table and pretty much the same content

Jurgen_Zornig · November 24, 2021, 7:05pm

I already mentioned this in another Topic, then thougth it is solved, but still the issue is emminent and I tried to better circumscribe on this.

I have a table with the following scheme:

CREATE TABLE doc.odp_ts (
     entity_id text, -- primary key,
     entity_type text,
     time_index timestamp with time zone,
     source text,
     value text,
     value_num double precision,
     category text,
     location_index text,
     location geo_shape,
     location_centroid geo_point,
     parent text,
     content object(ignored),
     year integer GENERATED ALWAYS AS extract(year from coalesce(time_index,CURRENT_TIMESTAMP))
)
CLUSTERED INTO 10 SHARDS
PARTITIONED BY (year)
;

In location there is a polygon representation of a UBER H3 Hexagon, so always only 7 vertices.

In content there is a JSON representation of the data of this row. Its a little bit more comprehensive, but EVERY (!) rows json is nearly the same size. There isn’t much variation in the size of these JSONs

In source there is either ‘aaa’ or ‘bbb’…you guess it? It tells how the rows were imported:

The ‘aaa’ were imported via COPY FROM... and then through some ELT Queries cleaned and INSERTED into this table
The ‘bbb’ were INSERTED directly via the HTTP Endpoint`s bulk_copy

If I delete all ‘bbb’ rows and keep only the roughly 8.8 Mio. ‘aaa’ rows, the table has a size of 1.5GB
If I delete all ‘aaa’ rows and keep only roughly 9 Mio. ‘bbb’ rows, the table has a size between 30 and 50 GB (I got different outcomes in my experiements).

These sizes had a long time to sit in, so I doubt there are any table optimizations, merges and so on left to bring that further down. The problem is not only size, but also Query time if the result set intersects on any of these heavy-weight rows.

I am assuming that this has something to do with how the index was built, as the index would be the only thing I could’t compare so far (the rows totally hold compareable data) and this would also explain why queries are significantly slower on these rows.

Is there any possibility to check/repair/recreate the index? Just copying the rows to a new table didn’t work so far.

proddata · November 24, 2021, 7:24pm

That massive difference seems really strange, as some segment merge mechanics are not working correctly.

There is not really a differentiation between index and data storage. CrateDB is using Lucene as storage (and indexing) layer. Reindexing a table can be done using COPY FROM/TO or INSERT INTO SELECT.

Could you manually trigger an

OPTIMIZE TABLE doc.odp_ts

or

OPTIMIZE TABLE doc.odp_ts WITH (only_expunge_deletes = true)

after you deleted the respective rows. Be aware, that in CrateDB, a row is not deleted from a segment, just marked as deleted. During a merge process of segments, a new segment is created that does not have those deletes.

Jurgen_Zornig · November 26, 2021, 11:43am

I`ve inserted those rows into a new table, tried OPTIMIZE (on that particular partition for sure) and also with only_expunge_deletes… Crate Console writes “OPTIMIZE ok” but without any effect. The rows are still humongous and still heavy to query.

Fortunately I have to replace them by another import anyway…hope the next try via HTTP bulk_insert will produce healthy rows.

proddata · November 26, 2021, 12:18pm

When you do a

SELECT _raw

is there a difference between an ‘aaa’ row and a ‘bbb’ row?
how many shards are created for ‘aaa’ and ‘bbb’ entries? Is data distributed evenly between them?

If you are not directly comparing COPY FROM and bulk inserts, but have transformations step in between, there should be no difference in the actual storage

Jurgen_Zornig · November 26, 2021, 1:19pm

Thats the _raw content of a bbb row

{"entity_type":"Device","location_index":"8a1e330060f7fff","time_index":1637703371884,"year":2021,"location":{"coordinates":[[[14.945624835300078,48.647363731740725],[14.945310462375334,48.64672806067648],[14.94602200090132,48.646264939809946],[14.947047914669962,48.64643748649882],[14.947362302843871,48.647073156767455],[14.94665076200018,48.6475362811429],[14.945624835300078,48.647363731740725]]],"type":"Polygon"},"source":"roc/attributeReport","entity_id":"urn:ngsi-ld:Device:roc-smarthome-7894B4F6E302-FLEXP0002-06FC020E8738EC89:photovoltaic-FF88-UNK","value":"59040","content":{"refDeviceModel":{"type":"Text","value":"urn:ngsi-ld:DeviceModel:roc-smarthome-7894B4F6E302-FLEXP0002-06FC020E8738EC89:photovoltaic"},"location_index":{"metadata":{"level":{"type":"Text","value":"10"}},"type":"Text","value":"8a1e330060f7fff"},"controlledProperty":{"type":"Array","value":[""]},"location":{"type":"geo:json","value":{"coordinates":[[[14.945624835300078,48.647363731740725],[14.945310462375334,48.64672806067648],[14.94602200090132,48.646264939809946],[14.947047914669962,48.64643748649882],[14.947362302843871,48.647073156767455],[14.94665076200018,48.6475362811429],[14.945624835300078,48.647363731740725]]],"type":"Polygon"}},"id":"urn:ngsi-ld:Device:roc-smarthome-7894B4F6E302-FLEXP0002-06FC020E8738EC89:photovoltaic-FF88-UNK","source":{"type":"Text","value":"roc/attributeReport"},"category":{"type":"Text","value":"actuator"},"type":"Device","controlledPropertyOgd":{"type":"Text","value":"FF88"},"value":{"metadata":{"timestamp":{"type":"DateTime","value":"2021-11-23T21:36:11.884687Z"}},"type":"Text","value":"59040"}}}

and here`s one aaa row:

{"entity_type":"Device","location_index":"841e14bffffffff","time_index":1617381900000,"year":2021,"location":{"coordinates":[[[14.415637880797282,48.15307115333914],[14.310556796303484,47.93313302100054],[14.552744019984633,47.773600769774],[14.90031132391351,47.83360089560644],[15.007165667198523,48.053458901041],[14.764687046088037,48.21339918083625],[14.415637880797282,48.15307115333914]]],"type":"Polygon"},"source":"evn/joulie-import","entity_id":"urn:ngsi-ld:Device:evn-joulie:FFFF644214006826:FFFF64002D053109","value":"43.02980000000001","content":{"refDeviceModel":{"type":"Text","value":"urn:ngsi-ld:DeviceModel:evn-joulie:FFFF644214006826:FFFF64002D053109"},"location_index":{"type":"Text","value":"841e14bffffffff","metadata":{"level":{"type":"String","value":"10"}}},"controlledProperty":{"type":"Array","value":["power","energy"]},"location":{"type":"geo:json","value":{"coordinates":[[[14.415637880797282,48.15307115333914],[14.310556796303484,47.93313302100054],[14.552744019984633,47.773600769774],[14.90031132391351,47.83360089560644],[15.007165667198523,48.053458901041],[14.764687046088037,48.21339918083625],[14.415637880797282,48.15307115333914]]],"type":"Polygon"}},"id":"urn:ngsi-ld:Device:evn-joulie:FFFF644214006826:FFFF64002D053109","source":{"type":"Text","value":"evn/joulie-import"},"category":{"type":"Text","value":"meter"},"type":"Device","value":{"type":"Number","value":"43.02980000000001","metadata":{"timestamp":{"type":"DateTime","value":"2021-04-02T16:45:00.000000Z"}}}}}

So, no…there isn’t really any difference. The particular table I have them now has 3 Partitions (for each year 2019, 2020, 2021) each with 10 Partitions (5 Nodes with 2 CPUs each).In “Shards” view the data looks evenly distributed. Those rows also are situated in the same shards (the distinctive attribute between those rows isn’t distincting partitions)

proddata · November 30, 2021, 6:14pm

Would it maybe be possible to get some example data for this, or a snapshot of a big and small partition. This sound really strange.

Jurgen_Zornig · December 13, 2021, 9:46pm

I’ve created two new tables “_bloat” and “_sleek” with the exact same schema… copied 10.000 of those datasets to each of these tables… the sleek rows were copied in 22seconds, the bloated ones in 1m25sec (!).

Also the tables are very different in size as you can see here

I’ll try to export those as csvs and deliver the schema and the data to you. Should also be possible that I snapshot those two tables to a different storageaccount and deliver you the credentials, but it will take me some time I’m afraid.

Jurgen_Zornig · December 17, 2021, 3:07pm

Got some news here: The significant column seems to be the location field…if I insert those rows without location (NULL), the resulting table is normal in size and behaviour.

I still don’t know what exactly is the difference as also the lighter rows have location in the very same manner. Some kind of encoding or spatial index type of problem?

proddata · December 17, 2021, 3:09pm

I was just to write something along the line

proddata · December 17, 2021, 4:10pm

I still don’t know what exactly is the difference as also the lighter rows have location in the very same manner. Some kind of encoding or spatial index type of problem?

Yes, it really looks like this is related to the spatial index.
I see that you used a geohash index. I did some first tests with quadtrees and it seems to significantly perform better. Also I would lower the amount of shards to 1 per node

With quadtree and 1 shard, the total size is 14.6 MiB ( vs 57.7MiB + 2.6 MiB bloat/sleek with geohash)

CREATE TABLE IF NOT EXISTS "doc"."odp_timeseries_quad" (
   "entity_id" TEXT,
   "entity_type" TEXT,
   "time_index" TIMESTAMP WITH TIME ZONE,
   "source" TEXT,
   "value" TEXT,
   "value_num" DOUBLE PRECISION,
   "category" TEXT,
   "location_index" TEXT,
   "location" GEO_SHAPE INDEX USING quadtree,
   "location_centroid" GEO_POINT,
   "parent" TEXT,
   "content" OBJECT(IGNORED),
   "year" INTEGER GENERATED ALWAYS AS EXTRACT(YEAR FROM coalesce("time_index", current_timestamp))
)
CLUSTERED INTO 1 SHARDS
PARTITIONED BY ("year");

Jurgen_Zornig · December 17, 2021, 4:26pm

Thanks for the comprehensive analysis, now I also know, why the one datasource is different to the other. The one datasource only stores Uber H3 Hexagons of a certain level which seems to intersect very well with the Geohash cells resulting in a compact index. The other datasource stores H3 Index cells of different Levels/Sizes and is more heterogenous in that sense, which seems to overlap Geohash cells in a very disadvantagous manner resulting in a bloated index which needs a lot of space and performs bad on querying.

We will end up banning this location field completely from our multi-million row timeseries table as it doesn’t make sense to store those all equal polygons anyway.

Instead we will put them into a simple, small lookup table where I also will make use of quadtree index as you showed I think

Thank you very much!

Topic		Replies	Views
Space savings on data CrateDB	4	591	January 26, 2023
Indexes and storage space CrateDB	5	516	December 15, 2022
Weird behaviour with COPY FROM SQL	2	693	April 1, 2020
Copying table results in corrupt data CrateDB	3	731	May 20, 2022
Column size with nulls CrateDB	1	274	December 16, 2022

Rows need significantly more storage as others in the same table and pretty much the same content

Related Topics