COPY FROM statement: things you need to know

COPY FROM is a handy command: it allows you to connect CrateDB with a filesystem and import data from local or remote files to a specified table. The syntax is fairly simple: often, you just need to specify the destination table and URI of the file:

COPY table_name FROM 'path_to_file';

To import data from a file into the table you will need the table with the same number and data type of columns as present in the file. However, importing data is not always straightforward: there are sometimes cases when files are not formatted in a way that CrateDB accepts with default settings. In this situation, you can see the successful execution of the command but an empty table as a result. So, how do we find out what the error is and how to solve it?

In this tutorial, we will illustrate how to effectively use the COPY FROM statement in CrateDB and show you several options you should know in order to avoid common mistakes.

RETURN SUMMARY

To get detailed information about possible failures when executing statements in CrateDB use the RETURN SUMMARY clause. The clause is optional but very helpful: it reports information such as the node that processed the file, the number of rows that were inserted, the number of records that failed, and detailed information about all errors.

Now, let’s imagine having a simple users table and a CSV file containing several records:

CREATE TABLE "doc"."users" (
   "id" INTEGER,
   "name" TEXT,
   "country" TEXT
)

If we try to import the following records:

"id","name","country"
"1", "Ana", "DE"
"2", "Sara", "DE"
"\\", "Peter", "DE"

We will get the output as COPY OK, 2 records affected (1.1 seconds)

But, we actually have three rows in our file. By running the RETURN SUMMARY clause with the COPY FROM statement and inspecting the errors field in the output, we can see that the third row wasn’t imported due to the following error:

Screenshot 2022-07-21 at 12.19.36

As we can see, we also get information about the number of records that failed with the error and the line numbers in the source file where the error occurred. We would strongly advise using this optional clause whenever you run your queries in CrateDB.

More information about the complete output of the RETURN SUMMARY clause can be found in the official documentation.

Importing CSV files

CrateDB accepts files in JSON or CSV formats. Please have in mind that if the format is not specified, the file will proceed as JSON. Furthermore, the format specification can be provided as a format option in the WITH clause.

When importing CSV files, it is important to consider the file formatting. For instance, the default delimiter in CrateDB is , but it may be the case that the columns in the file are separated with a different delimiter. In the following sections, we will show you how you can import files with other, non-default formatting options.

Other delimiters

Although CSV files usually separate values using commas, this is not always the case. Basically, any delimiter can be used to separate columns in CSV files. Some of the most common delimiters include the semicolon (;), a new line(\n), vertical line (|), or tab(\t). A file with a tab-separated value is also known as a TSV file and we will go into more details on how to import TSV files later in this tutorial.

To specify the delimiter character in the COPY FROM statement, you should use the delimiter option with a WITH clause. For example, if the column values are separated by a semicolon, we need this additional definition:

COPY table_name FROM 'file_path/file_name.csv' WITH (delimiter=';', format='csv') 

Does the data have a header

The first line of a CSV file indicates a header with the column name. CrateDB checks if the types from the import file match the data type of the column in the destination table. This casts the types and will always import the data as in the source file. However, if the import file does not contain the header, you can still import data by setting header = false option in WITH statement:

COPY table_name FROM 'file_path/file_name.csv' WITH (header=false, format='csv') 

The default value of the header option is true, and in this case, the first line in the CSV file must contain the column names. Instead of importing all data, you can use the optional column declaration to import a subset of the data:

COPY table_name(column1, column2) FROM 'file_path/file_name.csv' WITH (format='csv')

The example shows the case when you import the data only for the first two columns. The values of other columns in the table will be automatically set to NULL.

If the header option is set to false, the CSV must not contain column names in the first line, and instead, the columns declared in the COPY statement are used (e.g., COPY table_name(column1, column2, ...)). If no columns are declared in the statement, it will default to all columns present in the destination table.

Importing TSV files

A tab-separated values (TSV) file is a variation of the comma-separated values format. It is one of the most common files when transferring data between databases or applications. In TSV files, each row is one line of the text file and each field value is separated from the next by a tab character.

You can import data from TSV files in the same way as CSV files, but you need to provide two more options in WITH clause: delimiter and format, as illustrated below:

COPY table_name FROM 'file_path/file_name.tsv' WITH (delimiter=e'\t', format='csv') 

The command above shows that the file should be processed as a CSV file where the columns in the file are separated with a tab delimiter. For \t sequence to be interpreted as a tab character, it must be specified with an escape token as illustrated.

Compressed files

If you want to import CSV or JSON files that are compressed in gzip file format you should use the compression option with the COPY FROM command:

COPY table_name FROM 'file_path/file_name.csv' WITH (compression='gzip', format='csv') 

The default value for the compression option is null, and it should be set to gzip to read gzipped files. For now, CrateDB allows the import of compressed data only in gzip format.

Conclusion

In this tutorial, we illustrated how to use the COPY FROM command in CrateDB. We started with the RETURN SUMMARY clause that gives you a better understanding of the errors that occurred during the import and discussed a couple of options for more effective use of COPY FROM. A better understanding of these capabilities can help you deal with data imports from files that are formatted using non-standard options.

If you like this tutorial and would like to learn more about CrateDB check our official documentation and other tutorials on the CrateDB community.

2 Likes