Best practices for transferring data from one CrateDB cluster to another

When operating multiple CrateDB clusters, you might need to copy certain data from one cluster to another. There can be several reasons why to run multiple CrateDB clusters, such as having different stages (development, staging, testing, production) or a decentralized setup with clusters deployed to different locations, such as hybrid on-premise/cloud or multi-cloud setups.

Solutions

There are different strategies available on how to copy data from one cluster to another, each suitable for different needs. In this article, we will briefly introduce and discuss each of them.

Logical Replication

If you are looking for an ongoing replication of data from a source cluster to a target cluster, logical replication is the best choice. Based on a publish/subscribe model, certain tables on a source cluster can be made available to subscribe to by a target cluster. Data is replicated instantly as it changes on the source cluster. Replicated tables are read-only on the target cluster, making this a good approach if you do not need to further alter data on the target cluster.

Requirements: The network setup must allow incoming traffic to the source cluster from the target cluster
Tutorial: Logical replication setup between CrateDB clusters — CrateDB: How-Tos

Snapshots

Another build-in mechanism to copy data is snapshots. Snapshots are based on a central repository, such as a cloud bucket (AWS S3, Azure Blob storage, …) or shared network storage.

Snapshots capture the current state of a table or partition when generated using CREATE SNAPSHOT. They are stored in a binary format and can be imported into the target cluster using RESTORE SNAPSHOT.

Requirements: A cloud bucket or shared network storage
Tutorial: Snapshots — CrateDB: Reference

Export/import (COPY TO/COPY FROM)

Similar to snapshots, COPY TO exports a table or partition but as a CSV or JSON file. The target of COPY TO can be an AWS S3 bucket or the local file system. From there, exports can be imported into the target cluster using COPY FROM.

Due to the open file format, this approach is ideal if you plan to read exported files with other tools as well.

Requirements: A cloud bucket or shared network storage
Tutorial: Data manipulation — CrateDB: Reference

Custom solution

All above-mentioned methods use built-in capabilities of CrateDB. You may also take a different approach by utilizing CrateDB’s PostgreSQL-compatible SQL interface in connection with 3rd party tools, such as a data pipeline or ETL tool. This might be a good approach if you wish to further transform or enrich data before ingesting it into the target cluster.

For a (non-exhaustive) list of tutorials for some of such tools, please visit our Overview of CrateDB integration tutorials.

Summary

We showed that there are several strategies available to copy data from one CrateDB cluster to another. To wrap up the comparison, let’s highlight again when to use which approach:

  • Logical replication: For continuous replication of one or more tables without modifications to the data
  • Snapshots: For one-time copying of a table or partition, using an efficient, but binary file format
  • Export/import: For one-time copying of a table or partition, using a standard CSV/JSON file format
  • Custom solution: For one-time or continuous copying of data with the need for more complex data transformation or enrichment
1 Like