When are shards relocated to other nodes? Can you clarify the behavior of CrateDB?

Hello,

I need some clarification regarding the behavior of CrateDB when it comes to relocation of shards during an upgrade (particularly during the graceful shutdown of a node): Recently, we’ve upgraded our 3-node CrateDB cluster from 4.6.5 to 4.6.6.

During the graceful shutdown of the first node, I’ve realized that most of the CrateDB data were moved to the remaining two nodes.

Before and during that process, I didn’t change the default value of cluster.graceful_stop.min_availability. In other words, it had the default of primaries.

On top of that, our largest data, that is time series table had '0-1' as the number of replicas:

number_of_replicas = '0-1'

I’m a bit confused, because, @proddata wrote in a relevant thread the following:

It looks like that shards have been relocated to the two remaining nodes. This would be the case if the number of replicas would be set to '1' (not 0-1 ) or if the min availability is set to full.

But as I wrote above, even though the number_of_replicas for that large table had the default value of '0-1', and cluster.graceful_stop.min_availability had the default value of primaries, I still observed that most of the data on the node being gracefully shut down moved to the remaining two nodes.

Can one of the CrateDB experts clarify this situation, so that for the next upgrade I’ll have a better understanding?

If cluster.graceful_stop.min_availability is set to primaries any non-primary shard can remain on the decommissioned node, but all primary shards must be allocated first on another node. If another node already contains a replica shard, this replica shard may be promoted to primary if up-to-date (localCheckpoint = globalCheckpoint).

I guess @proddata made a typo/mistake here as number_of_replicas = '0-1' in a cluster with more than one node is effectively the same as number_of_replicas = 1.
Thus the replica shards should have been active.
If number_of_replicas would have been set to 0, this would effectively result in the same shard reallocation like cluster.graceful_stop.min_availability = 'full' as no replica exists and can be promoted to primary (thus the opposite of what @proddata wrote).

In your case not all data should have beeen moved to the remaining nodes as with number_of_replicas = '0-1', replicas (and thus copy of the data) should have been already allocated on one of the other nodes.

As you wrote the large table is a time-series table, could it be that this table is partitioned and not all partitions has number_of_replicas set to > 0?

1 Like

Thanks for the detailed explanation to help my understanding.

You wrote:

As you wrote the large table is a time-series table, could it be that this table is partitioned and not all partitions has number_of_replicas set to > 0?

So I used SHOW CREATE TABLE — CrateDB: Reference again to take a look at the structure and properties of that time series table (that’s the one with more than 100 million rows).

And what I see is:

CREATE TABLE IF NOT EXISTS "data"."meterreading" (
   "datetime" TIMESTAMP WITH TIME ZONE,
    ...
    ...
)
CLUSTERED INTO 6 SHARDS
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

Therefore, it seems to me the table is not partitioned, right?

Yes right, it’s not partitioned.

It really weird as your description at the related thread does indeed sound like cluster.graceful_stop.min_availability was set to full. I currently have no other explanation why all data was moved to the other nodes. :frowning:

Thanks for the detailed technical explanations so far! :+1:

I’ll do another observation next time we upgrade our cluster to see if the same behavior happens again.