Frozen Cluster with FileSystemException


#1

I encountered an error on my 4 node Crate cluster with the following exception: java.nio.file.FileSystemException: /mnt/crate/nodes/0/node.lock: Read-only file system

Writes were impossible on the whole cluster during the period this error persisted. I finally resolved it by rebooting the machines. What could have caused this? Any idea how to prevent this in the future?

Thank you.


#2

This must have been caused due some event to the system CrateDB is running on. You’d probably have to check the kernel logs to figure out what happened (via dmesg or journalctl -k.

CrateDB itself doesn’t have the capability to change a filesystem to read-only.


#3

Hello. Thanks for the prompt response.
I got the same problem again in the last days and I have more information that I would present below:

The following are the exact exceptions that I encountered (in order):

– Logs begin at Sat 2019-02-09 09:16:44 UTC, end at Sun 2019-02-17 09:38:44 UTC. –
Feb 16 18:16:50 crate-4 crate[3905]: [2019-02-16T18:16:50,122][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [li
Feb 16 18:16:50 crate-4 crate[3905]: java.lang.InternalError: a fault occurred in a recent unsafe memory access operat
Feb 16 18:16:50 crate-4 crate[3905]: at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTerm
Feb 16 18:16:50 crate-4 crate[3905]: at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.getDocID(PerTh
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.lookupVersion(
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.common.lucene.uid.VersionsResolver.loadDocIdAndVersion(Versi
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.common.lucene.uid.VersionsResolver.loadVersion(VersionsResol
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(Inte
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.engine.InternalEngine.resolveDocVersion(InternalEngine
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.engine.InternalEngine.compareOpToLuceneDocBasedOnVersi
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.engine.InternalEngine.planIndexingAsNonPrimary(Interna
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:496) ~
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:558) ~[crate-ap
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:547) ~[crate-ap
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.upsert.TransportShardUpsertAction.shardIndexOperationOn
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.upsert.TransportShardUpsertAction.processRequestItemsOn
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.upsert.TransportShardUpsertAction.processRequestItemsOn
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.TransportShardAction$2.call(TransportShardAction.java:1
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.TransportShardAction$2.call(TransportShardAction.java:1
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.TransportShardAction.wrapOperationInKillable(TransportS
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.TransportShardAction.shardOperationOnReplica(TransportS
Feb 16 18:16:50 crate-4 crate[3905]: at io.crate.execution.dml.TransportShardAction.shardOperationOnReplica(TransportS
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncR
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncR
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOpera
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShar
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncR
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.action.support.replication.TransportReplicationAction$Replic
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.action.support.replication.TransportReplicationAction$Replic
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(Requ
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.jav
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstra
Feb 16 18:16:50 crate-4 crate[3905]: at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable
Feb 16 18:16:50 crate-4 crate[3905]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149
Feb 16 18:16:50 crate-4 crate[3905]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624
Feb 16 18:16:50 crate-4 crate[3905]: at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Feb 16 18:16:50 crate-4 systemd[1]: crate.service: main process exited, code=exited, status=128/n/a
Feb 16 18:16:50 crate-4 systemd[1]: Unit crate.service entered failed state.
Feb 16 18:16:50 crate-4 systemd[1]: crate.service failed.

And

[2019-02-17T09:40:07,584][ERROR][o.e.b.BootstrapProxy ] Exception
java.lang.IllegalStateException: failed to obtain node locks, tried [[/var/lib/crate/bold, /mnt/crate/bold]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
at org.elasticsearch.env.NodeEnvironment.(NodeEnvironment.java:263) ~[crate-app-2.3.4.jar:2.3.4]
at org.elasticsearch.node.Node.(Node.java:268) ~[crate-app-2.3.4.jar:2.3.4]
at io.crate.node.CrateNode.(CrateNode.java:62) ~[crate-app-2.3.4.jar:2.3.4]
at org.elasticsearch.bootstrap.BootstrapProxy$1.(BootstrapProxy.java:199) ~[crate-app-2.3.4.jar:2.3.4]
at org.elasticsearch.bootstrap.BootstrapProxy.setup(BootstrapProxy.java:199) ~[crate-app-2.3.4.jar:2.3.4]
at org.elasticsearch.bootstrap.BootstrapProxy.init(BootstrapProxy.java:282) [crate-app-2.3.4.jar:2.3.4]
at io.crate.bootstrap.CrateDB.init(CrateDB.java:138) [crate-app-2.3.4.jar:2.3.4]
at io.crate.bootstrap.CrateDB.execute(CrateDB.java:118) [crate-app-2.3.4.jar:2.3.4]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:76) [crate-app-2.3.4.jar:2.3.4]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:134) [crate-app-2.3.4.jar:2.3.4]
at org.elasticsearch.cli.Command.main(Command.java:90) [crate-app-2.3.4.jar:2.3.4]
at io.crate.bootstrap.CrateDB.main(CrateDB.java:87) [crate-app-2.3.4.jar:2.3.4]
at io.crate.bootstrap.CrateDB.main(CrateDB.java:80) [crate-app-2.3.4.jar:2.3.4]
Caused by: java.io.IOException: failed to obtain lock on /mnt/crate/nodes/0
at org.elasticsearch.env.NodeEnvironment.(NodeEnvironment.java:242) ~[crate-app-2.3.4.jar:2.3.4]
… 12 more
Caused by: java.nio.file.FileSystemException: /mnt/crate/nodes/0/node.lock: Read-only file system
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_161]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_161]
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:113) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.elasticsearch.env.NodeEnvironment.(NodeEnvironment.java:228) ~[crate-app-2.3.4.jar:2.3.4]
… 12 more

Using the Crate documentation, I found the ALTER CLUSTER statement and executed as follows: ALTER CLUSTER REROUTE RETRY FAILED; That reduced the number of shards that were not started, but I still have a couple with the following explanation:
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-02-18T10:59:33.191Z], failed_attempts[13], delayed=false, details[failed to create shard, failure FileSystemException[/mnt/crate/nodes/0/indices/cGCye6v4RS-UVjhFuxE6eA/7/_state/state-1.st.tmp: Read-only file system]], allocation_status[no_attempt]]]

Does this provide more context for the problem? What should I do in this case. Simply restarting everything no longer works.

Thanks


#4

Ok. I figured out what the problem was and how to fix it. You were exactly right before, it was disk related. So what i did was the following:

sudo service crate stop #stop crate service
cp -r /mnt/crate ~/backup/ #backup all my files, of course
df -h # note device path, here /dev/mapper/volume-1e719f29o1
sudo umount /mnt/ # unmount device
sudo fsck /dev/mapper/volume-1e719f20p1 # check the disk, everything was “clean” in this case
sudo mount /dev/mapper/volume-1e719f20p1 # mounted it again
sudo service crate start

If you encounter an error like this afterwards:
Likely root cause: org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/mnt/crate/nodes/0/indices/8OGRcEOxQhqEN__XG1JtBQ/_state/state-395.st"))

check if it is a zero byte file. If it is, simply rename the file and start your crate node again. Everything worked.

Thanks for an awesome product.