what version of CrateDB are you running? Can you please share the CREATE TABLE statement for the affected table with us (SHOW CREATE TABLE)? Is the node in good health? Did you try to restart the node?
When you say ALTER CLUSTER/TABLE command do not work what do you mean? What is the output for these queries? Do you get an error?
How can the shard be corrupt, when lucene check said it is not corrupt? and even cratedb dont log something that would indicate it is corrupt?
No i dont have a snapshot of this shard, the other 2 in this partion work flawlessly and it occured all of a sudden. It resides on a filesystem which is protected against corruption, so not a single disk or something.
IDK what to do anymore. Simply holding replicas for the case that shards can go corrupt all the time is a little bit strange. We also have some bigger Elasticsearch/Opensearch clusters, on the same Hardwarebase and we never saw things like that.
Number of replicas does not change by itself. Maybe the value was changed at some point with ALTER TABLE SET.
Unfortunately the affected partition seems to be one without a replica
I would recommend that you update existing partitions with current number of replicas set to zero to 1 replica.
I’m afraid that it will won’t be possible to bring this shard back Force allocation would be done by REROUTE RETRY FAILED or REROUTE PROMOTE REPLICA which did not work in your case.
Really sorry to hear. We never saw such a behaviour in the CrateDB clusters we host and I don’t have an explanation what is going on.
That said shards in CrateDB don’t simply randomly go corrupt so I believe something else most be afoot.
My guess would be something related to hardware or networking (but strange that other clusters on the same hardware seemingly aren’t affected). Did you maybe recently roll out new software which changed behaviour how your CrateDB cluster is used (different reads, inserts, updates, deletes)? Is there anything in your monitoring pointing to an issue? Is it possible that your cluster is overloaded?
For CrateDB clusters running on CrateDB Cloud I could give more details as it includes our monitoring, logging, 24x7 alerting, support, backups, etc. It’s difficult to diagnose such outlier issues without this from afar