Why does COPY TO create empty JSON files and how is data distributed to these files?

Hello,

Coming from PostgreSQL background, I’m a bit confused about exporting data out of CrateDB using COPY TO command. I’m following the instructions at https://crate.io/docs/crate/reference/en/4.6/sql/statements/copy-to.html.

I’m running a 3-node CrateDB cluster on a Debian 10 Linux servers. The CrateDB version is 4.6.3.

I have a simple table for this experiment:

cr> SHOW CREATE TABLE emre.test_table_002 ;
+------------------------------------------------------+
| SHOW CREATE TABLE emre.test_table_002                |
+------------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "emre"."test_table_002" ( |
|    "time" TIMESTAMP WITH TIME ZONE,                  |
|    "location_id" TEXT,                               |
|    "device_id" TEXT,                                 |
|    "temperature" DOUBLE PRECISION,                   |
|    "humidity" DOUBLE PRECISION,                      |
|    PRIMARY KEY ("time", "location_id", "device_id")  |
| )                                                    |
| CLUSTERED INTO 6 SHARDS                              |
| WITH (                                               |
|    "allocation.max_retries" = 5,                     |
|    "blocks.metadata" = false,                        |
|    "blocks.read" = false,                            |
|    "blocks.read_only" = false,                       |
|    "blocks.read_only_allow_delete" = false,          |
|    "blocks.write" = false,                           |
|    codec = 'default',                                |
|    column_policy = 'strict',                         |
|    "mapping.total_fields.limit" = 1000,              |
|    max_ngram_diff = 1,                               |
|    max_shingle_diff = 3,                             |
|    number_of_replicas = '0-1',                       |
|    "routing.allocation.enable" = 'all',              |
|    "routing.allocation.total_shards_per_node" = -1,  |
|    "store.type" = 'fs',                              |
|    "translog.durability" = 'REQUEST',                |
|    "translog.flush_threshold_size" = 536870912,      |
|    "translog.sync_interval" = 5000,                  |
|    "unassigned.node_left.delayed_timeout" = 60000,   |
|    "write.wait_for_active_shards" = '1'              |
| )                                                    |
+------------------------------------------------------+
SHOW 1 row in set (0.013 sec)

The table contains the following data:

cr> SELECT * FROM emre.test_table_002 ;
+---------------+-------------+------------+-------------+----------+
|          time | location_id | device_id  | temperature | humidity |
+---------------+-------------+------------+-------------+----------+
| 1633050000000 | Dendermonde | device_001 |        12.1 |     40.3 |
| 1633050000000 | Dendermonde | device_002 |        13.1 |     41.3 |
| 1633050000000 | Wilrijk     | device_002 |        13.1 |     41.3 |
+---------------+-------------+------------+-------------+----------+
SELECT 3 rows in set (0.037 sec)

On another Linux machine (non-CrateDB) using crash utility I’m connected to one of the CrateDB nodes, and I run the COPY TO command:

cr> COPY emre.test_table_002 TO DIRECTORY '/tmp/' ;
COPY OK, 3 rows affected  (0.050 sec)

Then I go and check the /tmp directories of each CrateDB node.

crate-dn-001:~$ ls -lah /tmp/*.json
-rw-r--r-- 1 crate crate 111 Nov 12 12:41 /tmp/test_table_002_1_.json
-rw-r--r-- 1 crate crate   0 Nov 12 12:41 /tmp/test_table_002_2_.json
-rw-r--r-- 1 crate crate   0 Nov 12 12:41 /tmp/test_table_002_4_.json
-rw-r--r-- 1 crate crate   0 Nov 12 12:41 /tmp/test_table_002_5_.json
crate-dn-002:~$ ls -lah /tmp/*.json
ls: cannot access '/tmp/*.json': No such file or directory
crate-dn-003:~$ ls -lah /tmp/*.json
-rw-r--r-- 1 crate crate 107 Nov 12 12:41 /tmp/test_table_002_0_.json
-rw-r--r-- 1 crate crate 111 Nov 12 12:41 /tmp/test_table_002_3_.json

My confusion is related to the following questions:

  • Six JSON files scattered to two nodes: is this because of shards?
  • Why some empty JSON files (0 bytes)? I mean… you can say “empty shards”, but I still find empty JSON files confusing.

And a few questions more in the category of wishlist / roadmap:

  • Do you plan to add support for exporting to non-JSON formats, e.g. CSV?
  • Do you plan to add support for a single export file per node? I mean, is there a reason for not exporting to a single JSON file e.g. on crate-dn-001? What’s the motivation behind mapping number of shards to number of exported files on a given node?
  • Do you plan to add support for custom file names for exported files, at least having the ability to add some prefix / suffix to the name generated by CrateDB?

Hi @Emre_Sevinc,

Six JSON files scattered to two nodes: is this because of shards ?

Correct, one JSON file per primary shard.

You could optionally use a shared data storage volume mounted on all nodes to have all files combined in the same directory. Also when using S3 (or a S3 compatible storage) your files would be combined into a single folder.

Why some empty JSON files (0 bytes)? I mean… you can say “empty shards”, but I still find empty JSON files confusing.

Again correct. The JSON file is empty, because the shard is empty.

This is quite the edge case as typically you have more records than shards in a table. Your example is the exception as it has 6 shards for 3 records :sweat_smile:

It could also be confusing if empty shards are skipped as you would end up with smaller number of json files as expected (“only 3 json files; are the copy to for the other shards stuck?”)

Do you plan to add support for a single export file per node ? I mean, is there a reason for not exporting to a single JSON file e.g. on crate-dn-001 ? What’s the motivation behind mapping number of shards to number of exported files on a given node?

Using multiple files allows CrateDB to parallelize the export and thus makes it faster.

  • Do you plan to add support for exporting to non-JSON formats, e.g. CSV?
  • Do you plan to add support for custom file names for exported files , at least having the ability to add some prefix / suffix to the name generated by CrateDB?

We plan to add some improvements for COPY FROM / TO for CrateDB 4.8. I will log your feature requests.

Also as CrateDB is Open Source there is always the option to add new functionality yourself: https://github.com/crate/crate :wink: Contributions to CrateDB are always appreciated

I hope this helps. Have a great weekend,
Johannes

1 Like