Weird behaviour with COPY FROM

Hyperion · March 13, 2020, 12:47pm

I’m encountering weird behaviour with COPY FROM when attempting to import pretty small CSV file to cluster with 3 nodes (version: 4.1.3). Cluster is running in AWS EKS on 3 instances of m5.2xlarge. Each pod has 7 cpu’s and 26 gigabytes of memory divided 50%/50% for crate and lucene.

File has around 55 million rows and gzipped size is around 200mb.

I have created table like this:

CREATE TABLE result (
     device integer,
     time timestamp with time zone,
     idx smallint,
     count smallint
)
     clustered by (device) into 12 shards
WITH (
number_of_replicas = 0,
refresh_interval = 0
)

and running following command to import file from S3:

copy result from 's3://secret:secret@bucket/file_9_with_header.gz' WITH (format='csv', compression='gzip', shared='true', num_readers='1')

What then happens is that crate starts one job every minute or so. If I query running jobs:

select * from sys.jobs order by started asc limit 100;

In few minutes it looks like this:

id	node	started	stmt	username
be842aaf-dd0c-4af2-8cb5-107ec7e3dfbf	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101060977	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate
2907bcb7-5c6d-45d3-8630-0fe3b7c23112	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101120535	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate
6954cd0d-99c0-46ec-a336-86bff85074a1	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101181028	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate
6090be76-7293-4f8e-8947-96bcef4e114f	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101240535	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate
a49f089e-603a-4983-9b4d-e71695f542d3	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101300535	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate
98e22fc4-c096-4837-b1a4-01e8c4930667	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101420535	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate
7c863dfe-95f1-4543-92e9-3c6615ffe172	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101480535	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate
f5f3e260-b721-4837-a46e-e2411be030d5	{“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”}	1584101540536	copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’)	crate

and the import never completes but exponentially start using more cpu.
Also, there doesn’t seems to be any errors in the logs of the cluster nodes.

Any ideas how to get this working?

mfussenegger · March 31, 2020, 1:45pm

How are you executing the statements? Is there any chance there is a bug somewhere in your client that invokes the statements multiple times?

As this could be a bug in CrateDB, could you file a bug report on Github ?

Hyperion · April 1, 2020, 5:18am

I used Crate.io admin-ui as a client. I’ll file a bug report to GitHub.

Topic		Replies	Views
Help with Data Ingestion SQL	3	499	April 6, 2022
Copying table results in corrupt data CrateDB	3	732	May 20, 2022
COPY FROM csv fails successfully SQL	7	833	October 4, 2021
No Rows Are Imported When Using COPY FROM gzip File SQL	2	791	July 13, 2020
If I use use COPY TO for a big table with many shards, will there be different files on different nodes? CrateDB	1	495	September 13, 2021

Weird behaviour with COPY FROM

Related Topics