Cannot allocate because all found copies of the shard are either stale or corrupt

All of a sudden one primary cant be allocated anymore.

select * FROM sys.allocations WHERE partition_ident = '082j4c1i6813e' limit 100;

/usr/share/crate/lib# /usr/share/crate/jdk/bin/java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /mnt/crate_data/nodes/0/indices/eMFzNmvSQ1y9WA3V7jAdrA/2/index/

Gives me :

No problems were detected with this index.

Took 1336.002 sec total.
ALTER CLUSTER REROUTE RETRY FAILED;

Doesnt work

ALTER TABLE "dadata" PARTITION (monthreceived = '7' , yearreceived = '2022') REROUTE PRomote replica SHARD 2 ON 'dacrate02' WITH (accept_data_loss = TRUE);

Doesnt work either.

IDK anymore further. How can I force allocate the shard like with elasticsearch? IDK why it lost this shard, no logging even on Debug is shown…

I hope someone can help me, thx.

Hi @gruselglatz,

what version of CrateDB are you running? Can you please share the CREATE TABLE statement for the affected table with us (SHOW CREATE TABLE)? Is the node in good health? Did you try to restart the node?

When you say ALTER CLUSTER/TABLE command do not work what do you mean? What is the output for these queries? Do you get an error?

Hi @jayeff !

I use 4.8.1
Yes i restarted everything multiple times. All nodes are in good health, no watermarks or disk problems, no ram problems, no network problems.

I shortend the output of SHOW CREATE TABLE because i dont think our fields are the problem:

CREATE TABLE IF NOT EXISTS "doc"."dadata" (
...
)
CLUSTERED INTO 3 SHARDS
PARTITIONED BY ("yearreceived", "monthreceived")
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'best_compression',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0',
   refresh_interval = 1000,
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

The ALTER commands give

ALTER OK, 1 record affected (0.197 seconds)

Without any visible affect or running job.

Does number_of_replicas = '0' also hold for all partitions?

If yes, then this will be the reason why REROUTE PRomote replica will not work as there is no replica to promote. I assume REROUTE RETRY FAILED does not work due to the shard being corrupt.

We do recommend to have at least 1 replica configured to prevent potential data loss.

Do you have a snapshot/backup of your partition that you can restore?

How do i show if its set on a specific partition?

How can the shard be corrupt, when lucene check said it is not corrupt? and even cratedb dont log something that would indicate it is corrupt?

No i dont have a snapshot of this shard, the other 2 in this partion work flawlessly and it occured all of a sudden. It resides on a filesystem which is protected against corruption, so not a single disk or something.

With the following query you can inspect all partitions of a table:

select table_name, partition_ident, number_of_replicas from information_schema.table_partitions where table_name = '<table_name>';

I can only speculate as I don’t have insight. Maybe some node failure caused the staleness/corruption.

I’m sorry that I can’t be of more help.

Ok Thanks!

The number of replicas is different over the partitions. Maybe some change in the structure in the team caused this.

OK so there is no chance to bring the shard back to live, even when lucene check finds no error?
Can I maybe find more info in some deeper log? Or can i force allocate the shard somehow?

@jayeff all of a sudden 3 more shards in the same index got corrupt. Without any error in the logs or any hardware/network failure. Same as above, lucene dont see any errors when I check them.

IDK what to do anymore. Simply holding replicas for the case that shards can go corrupt all the time is a little bit strange. We also have some bigger Elasticsearch/Opensearch clusters, on the same Hardwarebase and we never saw things like that.

Number of replicas does not change by itself. Maybe the value was changed at some point with ALTER TABLE SET.

Unfortunately the affected partition seems to be one without a replica :disappointed:

I would recommend that you update existing partitions with current number of replicas set to zero to 1 replica.

I’m afraid that it will won’t be possible to bring this shard back :pensive: Force allocation would be done by REROUTE RETRY FAILED or REROUTE PROMOTE REPLICA which did not work in your case.

Really sorry to hear. We never saw such a behaviour in the CrateDB clusters we host and I don’t have an explanation what is going on.

That said shards in CrateDB don’t simply randomly go corrupt so I believe something else most be afoot.

My guess would be something related to hardware or networking (but strange that other clusters on the same hardware seemingly aren’t affected). Did you maybe recently roll out new software which changed behaviour how your CrateDB cluster is used (different reads, inserts, updates, deletes)? Is there anything in your monitoring pointing to an issue? Is it possible that your cluster is overloaded?

For CrateDB clusters running on CrateDB Cloud I could give more details as it includes our monitoring, logging, 24x7 alerting, support, backups, etc. It’s difficult to diagnose such outlier issues without this from afar :pensive:

Ok so is it possible to drop only a specific shard then?

Sorry, but this isn’t supported by CrateDB

OK, thanks. Is there a way to get rid of the files and initialize an empty shard?

I never tried this before so cannot say if this would work. If you try it out I would suggest to test it in a dev environment beforehand and recommend to take a full backup in case anything goes awry :crossed_fingers: