Insert data to local node only?

Is it possible to overwrite somehow the distribution function configuration, so the data will be stored only at the local node’s shards?
The idea is to reduce the network’s load since I have lot of data and scalability is not so important.
Thnaks.

1 Like

You could use Shard Allocation filtering to only store certain tables / partitions on specified node (types)

Thanks, that is the closest option I found,
But my demands are a little different

  1. need all tables on all nodes.
  2. each table/node will hold only and all the data that was inserted into that node only.
  3. I need to run queries that will look into all nodes/tables and join data from all

I know its not the original purpose of crateDB, but need something like that

Thanks.

  1. need all tables on all nodes.
    They still would be accessible

One option would be to bring a node-identifier into a partition column and then allocate the partitions to the specific nodes. Then you really could have one (partitioned) table that is queryable as single table.

similar but not necessarily the same are hot/cold strategies
also see Building a hot and cold storage data retention policy in CrateDB with Apache Airflow

@ArnonAyal Can you give us additional insights into your use case? Are you running into limits due to too much network load or is there another reason why you want to rreduce the network load?

@jayeff, yes, that is the exact reason.
we are dealing with a huge amount of data so in case of replication or data transfer between nodes, we’ll have a problematic network issue, the nodes are distributed all over the world.

Hi @ArnonAyal,

Is it possible to overwrite somehow the distribution function configuration, so the data will be stored only at the local node’s shards?

sorry but there isn’t an option to configure CrateDB to work in this way.

You potentially have the option to emulate such a behaviour with some additional cluster management.

As @proddata already mentioned you could use shard allocation filtering on a partioned table and assign specific partitions to specific nodes. Alternatively you look into using manual shard allocation to specific nodes with ALTER TABLE REROUTE (and automatic shard rebalancing disabled).

Additionally you would need to adapt your insert logic so that records are written directly to the CrateDB node where these records should be stored (based on either partition column value or routing column value).

Such an apprach is defintively not the easiest solution and also does run a bit counter to what CrateDB is designed to do but maybe something like this or similar to this could help you with decreasing your network load.

Please feel free to reach out if you have additional questions.
Johannes