This tutorial is also available in video at Fundamentals: Importing and Exporting Data in CrateDB
This tutorial presents the basics of
COPY FROM and
COPY TO in CrateDB. For in-depth details of CrateDB
COPY FROM, refer to our COPY FROM statement: things you need to know post.
Before anything else, I demonstrate how to link my local file system to the Docker container running CrateDB. This step is only necessary if you are running CrateDB with Docker. Then, I will show you how to import
CSV data to CrateDB using the
COPY FROM statement. Finally, I show you how to export data from CrateDB to a local file system using the
COPY TO statement.
Importing Data in CrateDB is done with the
COPY FROM statement.
Reading the COPY FROM documentation, I see that
COPY FROM accepts both
CSV inputs, and in this tutorial, I will show you how to do it with both, starting with
JSON. You can refer to the documentation at the crate.io website to get more details on
But before getting to the dataset, let me quickly explain how to link local files to CrateDB, in case you are running CrateDB with Docker - as I am.
To put it simply: the Docker container is the running environment for CrateDB. This means that CrateDB does not have direct access to my local filesystem, but rather to the ones given in the container.
So when I want to import files from my local files to CrateDB, I add these files to CrateDB’s Docker container with the
--volumes flag, which has the following structure:
I create a folder in my machine called
my_datasets, where I will store the JSON file and link this folder to a
docker_datasets folder in the Docker container.
To apply these changes to my CrateDB setup, I stop Docker with CrateDB and start it again with the following command:
docker run --publish=4200:4200 --publish=5433:5432 --volume=/Users/rafaelasantana/my_datasets:/docker_datasets crate:latest
From now on, I can add my datasets to my local
my_datasets folder, which will be accessible from CrateDB!
The first dataset I’m using today is a small collection of five quotes formatted in
JSON, each having the
If you want to import your dataset, ensure it’s a single
JSON OBJECT per line and no commas separate different objects. You can find more information on the formatting in our COPY FROM documentation.
I set up a table in CrateDB to store the
single_lined.json data. I take the object’s keys as the table columns, so it looks like this:
CREATE TABLE dataset_quotes ( "Quote" TEXT, "Author" TEXT, "Tags" ARRAY(TEXT), "Popularity" DOUBLE, "Category" TEXT );
And then, I run the following
COPY FROM statement, which imports
single_lined.json into the
dataset_quotes table. It’s worth saying that the folder path is from Docker, for the
docker_datasets folder, and not my local file path.
RETURN SUMMARY to the end of my queries gives me detailed error reporting in case something would not work as expected.
COPY dataset_quotes FROM '/docker_datasets/single_lined.json' RETURN SUMMARY;
Now that I successfully imported
JSON data into CrateDB let’s quickly check out how it works with
I use a dataset from Kaggle with around 500k quote records for the
CSV data. It consists of three columns: the quote, the author of the quote, and the category tags for that quote.
I download the dataset, name it
quote_dataset.csv and save it in that same
my_datasets folder, which is mounted in Docker and accessible from CrateDB.
Then, I open the dataset to check the column names and data types.
I see there are the
category columns, all having
So I copy these headers and create a table in CrateDB with the same columns.
CREATE TABLE csv_quotes ( quote TEXT, author TEXT, category TEXT );
Now, all it is left is to run the
COPY FROM statement to import the data into CrateDB.
COPY csv_quotes FROM '/docker_datasets/quote_dataset.csv' RETURN SUMMARY;
Here, CrateDB reports five errors in this immense import, probably due to formatting issues within the dataset.
Most importantly, I see that CrateDB ingested close to 500k rows in seconds!
I query the
csv_quotes in the Table Browser and get a glimpse of the imported data.
Finally, I can easily export data from CrateDB using the
COPY TO statement.
COPY TO documentation, I read:
COPY TO command exports the contents of a table to one or more files into a given directory with unique filenames. Each node with at least one table shard will export its contents onto its local disk.
The created files are JSON formatted and contain one table row per line and, due to the distributed nature of CrateDB, will remain on the same nodes where the shards are."
I head to the Shards Tab in Admin UI and see that CrateDB distributed my tables into shards 0, 1, 2, and 3.
So, I expect CrateDB to export each of these shards’ data into an individual file.
I copy the
csv_quotes table content to the
docker_dataset folder. Since the Docker folder is linked to my local
my_datasets folder, CrateDB will successfully export the values to
COPY csv_quotes TO DIRECTORY '/docker_datasets/';
Now I see four files (one for each CrateDB shard) in my local
my_datasets folder, containing the exported data in JSON format.
For example, I open
csv_quoes_1_.json and see that each row was formatted as a JSON object with
With that, we come to the end of this tutorial on importing and exporting data in CrateDB.