TranslogCorruptedException

Hi guys,

I am running a one-node deployment, version 4.1.8, for a small deployment. After a reboot, I encountered the following problem:

servicesnode.1.o10owwaac4vg@b2-15-smart-city    | org.elasticsearch.indices.recovery.RecoveryFailedException: [mtwastemanagement.etwastecontainer][0]: Recovery failed on {servicesnode}{NQOmfBYLRZ-w6g4pQagm9w}{iY4_Q7uwRuqKVOQ66ZJnhQ}{10.0.1.134}{10.0.1.134:4300}{http_address=10.0.1.134:4200}
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2044) [crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.lang.Thread.run(Thread.java:830) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:347) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	... 4 more
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog] is corrupted
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1809) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1796) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1319) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1282) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:426) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:303) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	... 4 more
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | Caused by: java.nio.file.NoSuchFileException: /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/translog-6537.tlog
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.nio.channels.FileChannel.open(FileChannel.java:292) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.nio.channels.FileChannel.open(FileChannel.java:345) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1804) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1796) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1319) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1282) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:426) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:303) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	... 4 more

The status of the table in the graphical interface is as follows:

150,100 Registros (34.6 MB)
50034 Underreplicated Records / 50,034 Unavailable Records / 4 Shards / 2-all Replicas

Following what I have been able to read in the documentation and similar problems, I have executed the following command:

select * from sys.allocations where current_state != 'STARTED' limit 100;
{
  "cols": [
    "current_state",
    "decisions",
    "explanation",
    "node_id",
    "partition_ident",
    "primary",
    "shard_id",
    "table_name",
    "table_schema"
  ],
  "col_types": [
    4,
    [
      100,
      12
    ],
    4,
    4,
    4,
    3,
    9,
    4,
    4
  ],
  "rows": [
    [
      "UNASSIGNED",
      [
        {
          "node_name": "servicesnode",
          "explanations": [
            "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually execute 'ALTER CLUSTER REROUTE RETRY FAILED' to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2022-10-25T07:06:47.401Z], failed_attempts[5], delayed=false, details[failed shard on node [NQOmfBYLRZ-w6g4pQagm9w]: failed recovery, failure RecoveryFailedException[[mtwastemanagement.etwastecontainer][0]: Recovery failed on {servicesnode}{NQOmfBYLRZ-w6g4pQagm9w}{iY4_Q7uwRuqKVOQ66ZJnhQ}{10.0.1.134}{10.0.1.134:4300}{http_address=10.0.1.134:4200}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog] is corrupted]; nested: NoSuchFileException[/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/translog-6537.tlog]; ], allocation_status[deciders_no]]]"
          ],
          "node_id": "NQOmfBYLRZ-w6g4pQagm9w"
        }
      ],
      "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
      null,
      null,
      true,
      0,
      "etwastecontainer",
      "mtwastemanagement"
    ]
  ],
  "rowcount": 1,
  "duration": 3.479047
}

Following the recommendation in the explanation I have executed the command:

ALTER CLUSTER REROUTE RETRY FAILED

but the error is still displayed, as shown in the log.

At this point, I don’t know what else to do, do you know of a way to fix the problem? Or alternatively, I would accept losing the percentage of unavailable records if I can keep the rest of the records.

I appreciate any help you can provide.
Best regards

Looks like the node may have been shutdown in an unclean state, e.g. translog files were not correctly fsynced to disk, some disk failure happened, etc.

In a multi-node setup with at least 1 replica configured, CrateDB would automatically recover from a healthy copy of the shard.

There are 2 options to recover from this corrupted state, both may result in a data loss unfortunately:

a) Download Elasticsearch and use their elasticsearch-shard command line tool, see elasticsearch-shard | Elasticsearch Guide [8.4] | Elastic. This tool will try to detect and repair the translog files. You’d need to pass the full path of the index directory using the --dir options.

b) Delete all files inside /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/. This may result in bigger data loss as option a) as it will delete the whole translog and such any records not yet commited to Lucene.

Hi @smu , thank you for your quick response.

I have tried to follow the solution a). I have installed elasticsearch and run the command, but it seems I have not followed the proper installation method or I am missing some steps. Maybe you can help me.

[root@52a7db6d10dc bin]# ./elasticsearch-shard remove-corrupted-data --dir /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ
warning: ignoring JAVA_HOME=/opt/jdk-13.0.1; using bundled JDK
------------------------------------------------------------------------

    WARNING: Elasticsearch MUST be stopped before running this tool.

-----------------------------------------------------------------------

  Please make a complete backup of your index before using this tool.

-----------------------------------------------------------------------
Exception in thread "main" org.elasticsearch.ElasticsearchException: no node meta data is found, node has not been started yet?
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.createPersistedClusterStateService(ElasticsearchNodeCommand.java:107)
	at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.processDataPaths(RemoveCorruptedShardDataCommand.java:241)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processDataPaths(ElasticsearchNodeCommand.java:142)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:160)
	at org.elasticsearch.common.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:54)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:94)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.Command.main(Command.java:50)
	at org.elasticsearch.launcher.CliToolLauncher.main(CliToolLauncher.java:64)

These are the steps I have followed for the installation.

$ rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
$ nano /etc/yum.repos.d/elasticsearch.repo
$ yum install --enablerepo=elasticsearch elasticsearch
$ cd /usr/share/elasticsearch/bin/

Thank you in advance.

Hi @versatildefuy,

hm right, the data directory must be passed using the -E path.data argument instead of the --dir one. Seems to me that the --dir option may not be usable, a bit strange.
So in your case the command should be:

./bin/elasticsearch-shard -E path.data=/data/data/nodes/0/ --index <INDEX_NAME> --shard 0

Be aware that you must use the index name and not the index UUID which is used for storing the data and also inside the exception.

The --dir option can be used to pass in the index directory of the relevant shard directly, e.g. when the index name isn’t known.

./bin/elasticsearch-shard -E path.data=/data/data/nodes/0/ --dir /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/index