Importance of Importing data in order

djbestenergy · October 6, 2022, 9:03am

Hello all,

This is possibly a self answering question about the order of data insertion from a migration perspective and how this could affect performance.

We’ve got 100s GB of data that will need importing that are currently held in individual “day” data files for (n) IoT devices with each day having 1440 records in timestamp order.

These will need importing into Crate but just wanted to know if I could run several imports at once, where the data will be out of order ( due to the differing processes inserting different data files and times ) or would a single process reading data in order ( but vastly slower) work better.

i.e. could query performance be affected by out of order data.

I might be barking up the wrong tree, but I don’t know how “flexible” CrateDB is regarding this.

Many thanks in advance,
David.

hernanc · October 6, 2022, 11:01am

Hi David,
The way CrateDB stores data with partitioning, sharding, and distribution among nodes, makes so that it can be a lot faster ingesting data that arrives “out of order” compared to other systems, so I think processing the files in parallel would be a good idea, 1440 records is not much so you should not need additional batching, but you should do some testing importing different number of files in parallel as there will be a point if you run too many requests in parallel where the system will get overwhelmed and the import throughput will actually go down.
Depending on the partitioning columns, sharding routing keys, primary keys defined, indexed columns, and how out of order the records arrive with respect to all these, there may be some fragmentation leading to additional disk space being consumed and query performance being suboptimal. To maximise query performance after the data is loaded I would suggest running the Optimization — CrateDB: Reference process.

Topic		Replies	Views
Updating data performance CrateDB	0	152	October 31, 2023
Inserting billions of rows the hard way CrateDB	15	1813	April 6, 2021
Partition requires significantly a lot more space than the others CrateDB	10	923	October 26, 2021
Multi-Threaded Inserts on crateDB CrateDB	1	620	June 1, 2021
Limited performance during query CrateDB	15	1061	May 26, 2021

Importance of Importing data in order

Related Topics