3rd node of 3 Node cluster always syncing complete data

We have been seeing a strange issue in our 3 node crate cluster. FTR we are running an older release of crate (3.0.6). When we stop the cluster and start it back up (or stop one node and start it back up), it syncs back all data after clearing the data path. Am not sure if this is expected behavior as we do not see this in normal ES clusters.

Please let me know if we are missing something here.

Hi,

Could you elaborate on what you observe on the logs and sys.shards when this occurs?

There has been a lot of development in CrateDB since version 3.0.6, 4 years ago, lots of new functionality and performance improvements, but also security and bug fixes, if you could consider an upgrade that is something I would strongly recommend.

Logs on master after node3 has been brought down:

[2022-10-17T11:37:34,804][INFO ][o.e.c.r.a.AllocationService] [node3] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node1}{_GdTzxzPRSilISRZoR_q6g}{YOU4FfFZQYi6lUy2Rv3PUA}{10.109.231.34}{10.109.231.34:4300}{http_address=10.109.231.34:4200} left]).
[2022-10-17T11:37:34,820][INFO ][o.e.c.s.MasterService    ] [node3] zen-disco-node-left({node1}{_GdTzxzPRSilISRZoR_q6g}{YOU4FfFZQYi6lUy2Rv3PUA}{10.109.231.34}{10.109.231.34:4300}{http_address=10.109.231.34:4200}), reason(left)[{node1}{_GdTzxzPRSilISRZoR_q6g}{YOU4FfFZQYi6lUy2Rv3PUA}{10.109.231.34}{10.109.231.34:4300}{http_address=10.109.231.34:4200} left], reason: removed {{node1}{_GdTzxzPRSilISRZoR_q6g}{YOU4FfFZQYi6lUy2Rv3PUA}{10.109.231.34}{10.109.231.34:4300}{http_address=10.109.231.34:4200},}
[2022-10-17T11:37:34,914][INFO ][o.e.c.s.ClusterApplierService] [node3] removed {{node1}{_GdTzxzPRSilISRZoR_q6g}{YOU4FfFZQYi6lUy2Rv3PUA}{10.109.231.34}{10.109.231.34:4300}{http_address=10.109.231.34:4200},}, reason: apply cluster state (from master [master {node3}{41MJPpDvS7Or4u1TNKGIpg}{fWRNimZXRrGBbOScXvPPHg}{10.109.231.140}{10.109.231.140:4300}{http_address=10.109.231.140:4200} committed version [199] source [zen-disco-node-left({node1}{_GdTzxzPRSilISRZoR_q6g}{YOU4FfFZQYi6lUy2Rv3PUA}{10.109.231.34}{10.109.231.34:4300}{http_address=10.109.231.34:4200}), reason(left)[{node1}{_GdTzxzPRSilISRZoR_q6g}{YOU4FfFZQYi6lUy2Rv3PUA}{10.109.231.34}{10.109.231.34:4300}{http_address=10.109.231.34:4200} left]]])
[2022-10-17T11:37:34,937][INFO ][o.e.c.r.DelayedAllocationService] [node3] scheduling reroute for delayed shards in [59.8s] (30 delayed shards)
[2022-10-17T11:37:34,939][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node3] updating number_of_replicas to [1] for indices [xxxxxx..partitioned.txn.042j4chg70, xxxxxx..partitioned.txn.042j4chg74, xxxxxx..partitioned.txn.042j4chg6s]
[2022-10-17T11:37:34,944][WARN ][o.e.d.i.m.AllFieldMapper ] [_all] is deprecated in 6.0+ and will be removed in 7.0. As a replacement, you can use [copy_to] on mapping fields to create your own catch all field.
[2022-10-17T11:37:34,949][WARN ][o.e.d.i.m.AllFieldMapper ] [_all] is deprecated in 6.0+ and will be removed in 7.0. As a replacement, you can use [copy_to] on mapping fields to create your own catch all field.
[2022-10-17T11:37:34,953][WARN ][o.e.d.i.m.AllFieldMapper ] [_all] is deprecated in 6.0+ and will be removed in 7.0. As a replacement, you can use [copy_to] on mapping fields to create your own catch all field.
[2022-10-17T11:37:34,954][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node3] [xxxxxx..partitioned.txn.042j4chg70/EEo9Vh5VSCC-wOduVivzHQ] auto expanded replicas to [1]
[2022-10-17T11:37:34,954][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node3] [xxxxxx..partitioned.txn.042j4chg74/InclGnGTRza_2xO9CgMzPQ] auto expanded replicas to [1]
[2022-10-17T11:37:34,955][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node3] [xxxxxx..partitioned.txn.042j4chg6s/ONsqtr2ARvOk5w0jA4SXlA] auto expanded replicas to [1]

sys.shards before node 3 shut down:

cr> select table_name as t,
    id,
    partition_ident as p_i,
    num_docs as docs,
    primary,
    relocating_node as r_n,
    routing_state as r_state,
    state,
    orphan_partition as o_p
    from sys.shards where table_name = 'txn';
+-----+----+------------+---------+---------+------+---------+---------+-------+
| t   | id | p_i        |    docs | primary |  r_n | r_state | state   | o_p   |
+-----+----+------------+---------+---------+------+---------+---------+-------+
| txn |  0 | 042j4chg6s | 3116992 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg6s | 3116998 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg6s | 3117252 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg6s | 3116077 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg6s | 3117186 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg6s | 3113858 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg6s | 3113216 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg6s | 3115646 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg6s | 3118914 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg6s | 3116300 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg70 | 3176395 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg70 | 3177159 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg70 | 3176364 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg70 | 3177011 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg70 | 3176803 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg70 | 3177610 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg70 | 3176784 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg70 | 3181229 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg70 | 3178579 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg70 | 3177076 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg74 |  106765 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg74 |  106309 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg74 |  107618 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg74 |  106703 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg74 |  106328 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg74 |  106396 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg74 |  106704 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg74 |  106458 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg74 |  106777 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg74 |  106330 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg6s | 3116992 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg6s | 3116998 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg6s | 3117252 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg6s | 3116077 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg6s | 3117186 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg6s | 3113858 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg6s | 3113216 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg6s | 3115646 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg6s | 3118914 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg6s | 3116300 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg70 | 3176395 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg70 | 3177159 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg70 | 3176364 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg70 | 3177011 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg70 | 3176803 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg70 | 3177610 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg70 | 3176784 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg70 | 3181229 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg70 | 3178579 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg70 | 3177076 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg74 |  106765 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg74 |  106309 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg74 |  107618 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg74 |  106703 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg74 |  106328 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg74 |  106396 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg74 |  106704 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg74 |  106458 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg74 |  106777 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg74 |  106330 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg6s | 3116992 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg6s | 3116998 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg6s | 3117252 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg6s | 3116077 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg6s | 3117186 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg6s | 3113858 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg6s | 3113216 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg6s | 3115646 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg6s | 3118914 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg6s | 3116300 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg70 | 3176395 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg70 | 3177159 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg70 | 3176364 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg70 | 3177011 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg70 | 3176803 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg70 | 3177610 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg70 | 3176784 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg70 | 3181229 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg70 | 3178579 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg70 | 3177076 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg74 |  106765 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg74 |  106309 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg74 |  107618 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg74 |  106703 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg74 |  106328 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg74 |  106396 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg74 |  106704 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg74 |  106458 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg74 |  106777 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg74 |  106330 | FALSE   | NULL | STARTED | STARTED | FALSE |
+-----+----+------------+---------+---------+------+---------+---------+-------+
SELECT 90 rows in set (0.010 sec)

sys.shards after node 3 shut down:

cr> select table_name as t,
    id,
    partition_ident as p_i,
    num_docs as docs,
    primary,
    relocating_node as r_n,
    routing_state as r_state,
    state,
    orphan_partition as o_p
    from sys.shards where table_name = 'txn';
+-----+----+------------+---------+---------+------+---------+---------+-------+
| t   | id | p_i        |    docs | primary |  r_n | r_state | state   | o_p   |
+-----+----+------------+---------+---------+------+---------+---------+-------+
| txn |  0 | 042j4chg6s | 3116992 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg6s | 3116998 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg6s | 3117252 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg6s | 3116077 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg6s | 3117186 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg6s | 3113858 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg6s | 3113216 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg6s | 3115646 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg6s | 3118914 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg6s | 3116300 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg70 | 3176395 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg70 | 3177159 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg70 | 3176364 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg70 | 3177011 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg70 | 3176803 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg70 | 3177610 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg70 | 3176784 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg70 | 3181229 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg70 | 3178579 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg70 | 3177076 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg74 |  106765 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg74 |  106309 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg74 |  107618 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg74 |  106703 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg74 |  106328 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg74 |  106396 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg74 |  106704 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg74 |  106458 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg74 |  106777 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg74 |  106330 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg6s | 3116992 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg6s | 3116998 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg6s | 3117252 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg6s | 3116077 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg6s | 3117186 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg6s | 3113858 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg6s | 3113216 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg6s | 3115646 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg6s | 3118914 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg6s | 3116300 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg70 | 3176395 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg70 | 3177159 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg70 | 3176364 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg70 | 3177011 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg70 | 3176803 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg70 | 3177610 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg70 | 3176784 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg70 | 3181229 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg70 | 3178579 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg70 | 3177076 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  0 | 042j4chg74 |  106765 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  1 | 042j4chg74 |  106309 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  2 | 042j4chg74 |  107618 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  3 | 042j4chg74 |  106703 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  4 | 042j4chg74 |  106328 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  5 | 042j4chg74 |  106396 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  6 | 042j4chg74 |  106704 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  7 | 042j4chg74 |  106458 | TRUE    | NULL | STARTED | STARTED | FALSE |
| txn |  8 | 042j4chg74 |  106777 | FALSE   | NULL | STARTED | STARTED | FALSE |
| txn |  9 | 042j4chg74 |  106330 | TRUE    | NULL | STARTED | STARTED | FALSE |
+-----+----+------------+---------+---------+------+---------+---------+-------+
SELECT 60 rows in set (0.009 sec)```

Given below is the show create table for the table in question (only the crate metadata):

| CLUSTERED BY ("id") INTO 10 SHARDS                  |
| PARTITIONED BY ("partition_date")                   |
| WITH (                                              |
|    "allocation.max_retries" = 5,                    |
|    "blocks.metadata" = false,                       |
|    "blocks.read" = false,                           |
|    "blocks.read_only" = false,                      |
|    "blocks.read_only_allow_delete" = false,         |
|    "blocks.write" = false,                          |
|    column_policy = 'dynamic',                       |
|    "mapping.total_fields.limit" = 1000,             |
|    number_of_replicas = '1-2',                      |
|    refresh_interval = 1000,                         |
|    "routing.allocation.enable" = 'all',             |
|    "routing.allocation.total_shards_per_node" = -1, |
|    "translog.durability" = 'REQUEST',               |
|    "translog.flush_threshold_size" = 536870912,     |
|    "translog.sync_interval" = 5000,                 |
|    "unassigned.node_left.delayed_timeout" = 60000,  |
|    "warmer.enabled" = true,                         |
|    "write.wait_for_active_shards" = 'all'           |
| )                                                   |

Thank you for sharing these details.
The fact the shards from the node that is unavailable disappear from sys.shards is expected with number_of_replicas = '1-2'
An alternative configuration that you could try is
number_of_replicas = '2',"write.wait_for_active_shards" = '2' /* instead of all */
With these settings writes can resume when we have the primary + 1 replica available, and the replicas from the inactive node will stay visibile in sys.shards
But on my tests with a 5.0.1 cluster, on both cases, the data in the node that is off is not invalidated, and it is not copied from scratch when the node comes back online.
In your case you mention the data path is cleared?
Unfortunately I do not have a 3.0.6 cluster at hand to test, but could spin one up if needed.

Yes the data path is cleared and re-synced from the master.

It would take us some planning to upgrade the cluster as we have close to 1 tb of data and the upgrade paths shown specify a full restart upgrade plus index migration to newer format and that would mean an extended downtime

I understand. Please note the release notes for 5.0.2 state that:

Tables that were created before CrateDB 4.x will not function with 5.x and must be recreated before moving to 5.x.x.

If your large tables only get data inserted, but never updated, perhaps the way to go is to prepare a new cluster in parallel, use COPY TO (it accepts a WHERE clause) / COPY FROM to seed the tables in the new cluster with all the data up to the day before, and on the day of the switch over you only need to copy the data of the very last day.