Currently I`m ingesting 11Mio. rows of IoT timeseries data through Kafka into cratedb. I use a single table with a rather flat schema for it, which currently hold ~70Mio rows for the last 3 years. This large table is partitioned over multiple fields (Type of device, Year, Source) so that a partition has a couple of Millions of rows at worst, most have only some hundrets of thousands. So far that worked for me quite acceptable, but I am still struggling to tune this correctly.
At this specific Ingestion process, I have two major issues:
A) Ingestion performance is very poor… only about 100 Rows per second at best. Batches of about 1000 Rows are sent to the HTTP Endpoint via a Bulk-Insert. The table has only 1 replica (as there is all data basically in there so I don’t want to lose anything on node failure). Every partition is distributed over 6 shards. I have 5 Nodes runnning, each has 4 CPUs and 8GB RAM (which isn’t much at all, I know). They are running as Kubernetes Pods in AKS and persisting to Azurefile Volumes (Azure Storage Account) which I should also probably switch to Premium-SSD Disks instead. As this whole thing is just a Proof-of-concept system not productive at all), my question is just, if I do anything fatally wrong on the database side or is the ingestion performance just so bad because of the Hardware it is running on.
B) The data I am ingesting gets located into its own partition. The data in the different partitions is totally compareable in structure and size, so it should basically need the same amount of space. But this isn’t the case…the partition needs about 20-times the space for the same amount of rows as data in a comparable partition. I don’t have a clue why this is the case…any suggestions on this?