Weird behaviour with COPY FROM

I’m encountering weird behaviour with COPY FROM when attempting to import pretty small CSV file to cluster with 3 nodes (version: 4.1.3). Cluster is running in AWS EKS on 3 instances of m5.2xlarge. Each pod has 7 cpu’s and 26 gigabytes of memory divided 50%/50% for crate and lucene.

File has around 55 million rows and gzipped size is around 200mb.

I have created table like this:

CREATE TABLE result (
     device integer,
     time timestamp with time zone,
     idx smallint,
     count smallint
)
     clustered by (device) into 12 shards
WITH (
number_of_replicas = 0,
refresh_interval = 0
)

and running following command to import file from S3:

copy result from 's3://secret:secret@bucket/file_9_with_header.gz' WITH (format='csv', compression='gzip', shared='true', num_readers='1')

What then happens is that crate starts one job every minute or so. If I query running jobs:

select * from sys.jobs order by started asc limit 100;

In few minutes it looks like this:

id node started stmt username
be842aaf-dd0c-4af2-8cb5-107ec7e3dfbf {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101060977 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate
2907bcb7-5c6d-45d3-8630-0fe3b7c23112 {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101120535 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate
6954cd0d-99c0-46ec-a336-86bff85074a1 {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101181028 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate
6090be76-7293-4f8e-8947-96bcef4e114f {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101240535 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate
a49f089e-603a-4983-9b4d-e71695f542d3 {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101300535 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate
98e22fc4-c096-4837-b1a4-01e8c4930667 {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101420535 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate
7c863dfe-95f1-4543-92e9-3c6615ffe172 {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101480535 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate
f5f3e260-b721-4837-a46e-e2411be030d5 {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} 1584101540536 copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) crate

and the import never completes but exponentially start using more cpu.
Also, there doesn’t seems to be any errors in the logs of the cluster nodes.

Any ideas how to get this working?

How are you executing the statements? Is there any chance there is a bug somewhere in your client that invokes the statements multiple times?

As this could be a bug in CrateDB, could you file a bug report on Github ?

I used Crate.io admin-ui as a client. I’ll file a bug report to GitHub.