Importing and exporting data in CrateDB

This tutorial is also available in video at CrateDB Video | Fundamentals: Importing and Exporting Data in CrateDB

This tutorial presents the basics of COPY FROM and COPY TO in CrateDB. For in-depth details of CrateDB COPY FROM, refer to our Fundamentals of the COPY FROM statement post.

Before anything else, I demonstrate how to link my local file system to the Docker container running CrateDB. This step is only necessary if you are running CrateDB with Docker. Then, I will show you how to import JSON and CSV data to CrateDB using the COPY FROM statement. Finally, I show you how to export data from CrateDB to a local file system using the COPY TO statement.

Importing Data - JSON

Importing Data in CrateDB is done with the COPY FROM statement.

Reading the COPY FROM documentation, I see that COPY FROM accepts both JSON and CSV inputs, and in this tutorial, I will show you how to do it with both, starting with JSON. You can refer to the documentation at the crate.io website to get more details on COPY FROM.

But before getting to the dataset, let me quickly explain how to link local files to CrateDB, in case you are running CrateDB with Docker - as I am.

Linking local files to Docker container

To put it simply: the Docker container is the running environment for CrateDB. This means that CrateDB does not have direct access to my local filesystem, but rather to the ones given in the container.

So when I want to import files from my local files to CrateDB, I add these files to CrateDB’s Docker container with the --volumes flag, which has the following structure:

--volume=/path/on/your/machine/to/file.json:/path/in/docker

I create a folder in my machine called my_datasets, where I will store the JSON file and link this folder to a docker_datasets folder in the Docker container.

To apply these changes to my CrateDB setup, I stop Docker with CrateDB and start it again with the following command:

docker run --publish=4200:4200 --publish=5433:5432 --volume=/Users/rafaelasantana/my_datasets:/docker_datasets --env CRATE_HEAP_SIZE=1g crate:latest

From now on, I can add my datasets to my local my_datasets folder, which will be accessible from CrateDB!

Dataset Overview

The first dataset I’m using today is a small collection of five quotes formatted in JSON, each having the Quote, Author, Tags, Popularity, and Category keys.

If you want to import your dataset, ensure it’s a single JSON OBJECT per line and no commas separate different objects. You can find more information on the formatting in our COPY FROM documentation.

Creating a table to store data

I set up a table in CrateDB to store the single_lined.json data. I take the object’s keys as the table columns, so it looks like this:

CREATE TABLE dataset_quotes (
  "Quote" TEXT,
  "Author" TEXT, 
  "Tags" ARRAY(TEXT),
  "Popularity" DOUBLE,
  "Category" TEXT
);

And then, I run the following COPY FROM statement, which imports single_lined.json into the dataset_quotes table. It’s worth saying that the folder path is from Docker, for the docker_datasets folder, and not my local file path.

Also, adding RETURN SUMMARY to the end of my queries gives me detailed error reporting in case something would not work as expected.

COPY dataset_quotes 
FROM '/docker_datasets/single_lined.json' 
RETURN SUMMARY;

Now that I successfully imported JSON data into CrateDB let’s quickly check out how it works with CSV.

Importing Data - CSV

I use a dataset from Kaggle with around 500k quote records for the CSV data. It consists of three columns: the quote, the author of the quote, and the category tags for that quote.

I download the dataset, name it quote_dataset.csv and save it in that same my_datasets folder, which is mounted in Docker and accessible from CrateDB.

Then, I open the dataset to check the column names and data types.

I see there are the quote, author, and category columns, all having TEXT data.

So I copy these headers and create a table in CrateDB with the same columns.

CREATE TABLE csv_quotes (
  quote TEXT,
  author TEXT,
  category TEXT
);

Now, all it is left is to run the COPY FROM statement to import the data into CrateDB.

COPY csv_quotes 
FROM '/docker_datasets/quote_dataset.csv' 
RETURN SUMMARY;

Here, CrateDB reports five errors in this immense import, probably due to formatting issues within the dataset.

Most importantly, I see that CrateDB ingested close to 500k rows in seconds!

I query the csv_quotes in the Table Browser and get a glimpse of the imported data.

Exporting Data

Finally, I can easily export data from CrateDB using the COPY TO statement.

In the COPY TO documentation, I read:

"The COPY TO command exports the contents of a table to one or more files into a given directory with unique filenames. Each node with at least one table shard will export its contents onto its local disk.
The created files are JSON formatted and contain one table row per line and, due to the distributed nature of CrateDB, will remain on the same nodes where the shards are."

I head to the Shards Tab in Admin UI and see that CrateDB distributed my tables into shards 0, 1, 2, and 3.

So, I expect CrateDB to export each of these shards’ data into an individual file.

I copy the csv_quotes table content to the docker_dataset folder. Since the Docker folder is linked to my local my_datasets folder, CrateDB will successfully export the values to my_datasets.

COPY csv_quotes TO DIRECTORY '/docker_datasets/';

Now I see four files (one for each CrateDB shard) in my local my_datasets folder, containing the exported data in JSON format.

For example, I open csv_quoes_1_.json and see that each row was formatted as a JSON object with quote, author, and category keys.

With that, we come to the end of this tutorial on importing and exporting data in CrateDB.

Reference

1 Like