What to do to fix "Too many open files" on a 3-node cluster with almost no data?

Hello,

We have a development cluster for CrateDB: it’s a 3-node cluster with CrateDB version 4.6.6. The underlying servers are Debian GNU/Linux 11.1. We install CrateDB packages via the official CrateDB APT repository.

When I tried to view the web-based admin UI, I received the following error:

/usr/share/crate/lib/site/index.html: Too many open files

This is a development cluster that has almost no data on it with minimal load.

When I tried to run an lsof command to see open files on the first node of that development CrateDB cluster the command took a few minutes to complete and the number of open files were more than 800.000 (at least 2.5 times more than the first node of our production cluster):

$ sudo lsof | wc -l
8133395

$ sudo lsof | wc -l
8133304

Checking the status of crate.service , I get the following:

$ sudo systemctl status crate.service
● crate.service - CrateDB Server
     Loaded: loaded (/lib/systemd/system/crate.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2021-12-16 09:55:15 UTC; 4 weeks 0 days ago
       Docs: https://crate.io/docs/
   Main PID: 6204 (java)
      Tasks: 124 (limit: 77076)
     Memory: 13.7G
        CPU: 2h 15min 53.803s
     CGroup: /system.slice/crate.service
             └─6204 /usr/share/crate/jdk/bin/java -Xms16G -Xmx16G -Djava.awt.headless=true -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Xlog:gc*,gc+age=trace>

Jan 13 10:15:04 dev-crate-dn-001 crate[6204]: io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
Jan 13 10:15:05 dev-crate-dn-001 crate[6204]: [2022-01-13T10:15:05,422][WARN ][i.n.c.DefaultChannelPipeline] [dev-crate-dn-001] An exceptionCaught() event was fired, and it reached at the >
Jan 13 10:15:05 dev-crate-dn-001 crate[6204]: io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
Jan 13 10:15:06 dev-crate-dn-001 crate[6204]: [2022-01-13T10:15:06,150][WARN ][o.e.c.c.ClusterFormationFailureHelper] [dev-crate-dn-001] master not discovered or elected yet, an election r>
Jan 13 10:15:06 dev-crate-dn-001 crate[6204]: [2022-01-13T10:15:06,422][WARN ][i.n.c.DefaultChannelPipeline] [dev-crate-dn-001] An exceptionCaught() event was fired, and it reached at the >
Jan 13 10:15:06 dev-crate-dn-001 crate[6204]: io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
Jan 13 10:15:07 dev-crate-dn-001 crate[6204]: [2022-01-13T10:15:07,422][WARN ][i.n.c.DefaultChannelPipeline] [dev-crate-dn-001] An exceptionCaught() event was fired, and it reached at the >
Jan 13 10:15:07 dev-crate-dn-001 crate[6204]: io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
Jan 13 10:15:08 dev-crate-dn-001 crate[6204]: [2022-01-13T10:15:08,422][WARN ][i.n.c.DefaultChannelPipeline] [dev-crate-dn-001] An exceptionCaught() event was fired, and it reached at the >
Jan 13 10:15:08 dev-crate-dn-001 crate[6204]: io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files

Another check:

~$ sudo systemctl status crate.service
● crate.service - CrateDB Server
     Loaded: loaded (/lib/systemd/system/crate.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2021-12-16 09:55:15 UTC; 4 weeks 0 days ago
       Docs: https://crate.io/docs/
   Main PID: 6204 (java)
      Tasks: 124 (limit: 77076)
     Memory: 13.7G
        CPU: 2h 15min 56.001s
     CGroup: /system.slice/crate.service
             └─6204 /usr/share/crate/jdk/bin/java -Xms16G -Xmx16G -Djava.awt.headless=true -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Xlog:gc*,gc+age=trace>

Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[crate-server.jar:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:452) ~[crate-server.jar:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:131) ~[crate-server.jar:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[crate-server.jar:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:698) ~[crate-server.jar:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [crate-server.jar:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]:         at java.lang.Thread.run(Thread.java:831) [?:?]
Jan 13 10:22:36 dev-crate-dn-001 crate[6204]: [2022-01-13T10:22:36,177][WARN ][o.e.c.c.ClusterFormationFailureHelper] [dev-crate-dn-001] master not discovered or elected yet, an election r>

Some more details from logs:

        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[crate-server.jar:?]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:266) ~[crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: java.nio.file.FileSystemException: /data/nodes/0/_state/write.lock: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:292) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:345) ~[?:?]
        at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:125) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:923) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.elasticsearch.gateway.PersistedClusterStateService.createIndexWriter(PersistedClusterStateService.java:208) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.PersistedClusterStateService.createWriter(PersistedClusterStateService.java:184) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.getWriterSafe(GatewayMetaState.java:491) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setCurrentTerm(GatewayMetaState.java:447) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.CoordinationState.handleStartJoin(CoordinationState.java:188) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.joinLeaderInTerm(Coordinator.java:427) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.ensureTermAtLeast(Coordinator.java:419) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:931) ~[crate-server.jar:?]
        at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
        at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:499) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:465) ~[crate-server.jar:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[crate-server.jar:?]
        at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:121) ~[crate-server.jar:?]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:334) ~[crate-server.jar:?]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:452) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:131) ~[crate-server.jar:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[crate-server.jar:?]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:266) ~[crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:831) ~[?:?]
[2022-01-13T10:26:26,194][WARN ][o.e.c.c.ClusterFormationFailureHelper] [dev-crate-dn-001] master not discovered or elected yet, an election requires at least 2 nodes with ids from [oCYP8FLRTcWH20fAPboXmA, NLSbhbptSNaHuVwXlv-T8g, O7KFL7W3StGLLN6mTmQB7Q], have discovered [{dev-crate-dn-001}{NLSbhbptSNaHuVwXlv-T8g}{P0zOcABzRYywaX8ivj6DNA}{192.168.239.50}{192.168.239.50:4300}{http_address=192.168.239.50:4200}, {dev-crate-dn-002}{oCYP8FLRTcWH20fAPboXmA}{FXd6ml8vQr2vLSgSHJ3Pbw}{192.168.239.51}{192.168.239.51:4300}{http_address=192.168.239.51:4200}, {dev-crate-dn-003}{O7KFL7W3StGLLN6mTmQB7Q}{nc2jAKtZRpayr75Ck-fobw}{192.168.239.52}{192.168.239.52:4300}{http_address=192.168.239.52:4200}] which is a quorum; discovery will continue using [192.168.239.51:4300, 192.168.239.52:4300] from hosts providers and [{dev-crate-dn-003}{O7KFL7W3StGLLN6mTmQB7Q}{nc2jAKtZRpayr75Ck-fobw}{192.168.239.52}{192.168.239.52:4300}{http_address=192.168.239.52:4200}, {dev-crate-dn-001}{NLSbhbptSNaHuVwXlv-T8g}{P0zOcABzRYywaX8ivj6DNA}{192.168.239.50}{192.168.239.50:4300}{http_address=192.168.239.50:4200}, {dev-crate-dn-002}{oCYP8FLRTcWH20fAPboXmA}{FXd6ml8vQr2vLSgSHJ3Pbw}{192.168.239.51}{192.168.239.51:4300}{http_address=192.168.239.51:4200}] from last-known cluster state; node term 262158, last-accepted version 582 in term 4
[2022-01-13T10:26:34,558][INFO ][o.e.c.c.JoinHelper       ] [dev-crate-dn-001] failed to join {dev-crate-dn-002}{oCYP8FLRTcWH20fAPboXmA}{FXd6ml8vQr2vLSgSHJ3Pbw}{192.168.239.51}{192.168.239.51:4300}{http_address=192.168.239.51:4200} with JoinRequest{sourceNode={dev-crate-dn-001}{NLSbhbptSNaHuVwXlv-T8g}{P0zOcABzRYywaX8ivj6DNA}{192.168.239.50}{192.168.239.50:4300}{http_address=192.168.239.50:4200}, optionalJoin=Optional[Join{term=262160, lastAcceptedTerm=4, lastAcceptedVersion=582, sourceNode={dev-crate-dn-001}{NLSbhbptSNaHuVwXlv-T8g}{P0zOcABzRYywaX8ivj6DNA}{192.168.239.50}{192.168.239.50:4300}{http_address=192.168.239.50:4200}, targetNode={dev-crate-dn-002}{oCYP8FLRTcWH20fAPboXmA}{FXd6ml8vQr2vLSgSHJ3Pbw}{192.168.239.51}{192.168.239.51:4300}{http_address=192.168.239.51:4200}}]}
org.elasticsearch.transport.RemoteTransportException: [dev-crate-dn-002][192.168.239.51:4300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.ElasticsearchException: java.nio.file.FileSystemException: /data/nodes/0/_state/write.lock: Too many open files
        at org.elasticsearch.ExceptionsHelper.convertToRuntime(ExceptionsHelper.java:53) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.getWriterSafe(GatewayMetaState.java:500) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setCurrentTerm(GatewayMetaState.java:447) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.CoordinationState.handleStartJoin(CoordinationState.java:188) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.joinLeaderInTerm(Coordinator.java:427) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.ensureTermAtLeast(Coordinator.java:419) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:931) ~[crate-server.jar:?]
        at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
        at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:499) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:465) ~[crate-server.jar:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[crate-server.jar:?]
        at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:121) ~[crate-server.jar:?]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:334) ~[crate-server.jar:?]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:452) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:131) ~[crate-server.jar:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[crate-server.jar:?]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:266) ~[crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: java.nio.file.FileSystemException: /data/nodes/0/_state/write.lock: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:292) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:345) ~[?:?]
        at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:125) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:923) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
        at org.elasticsearch.gateway.PersistedClusterStateService.createIndexWriter(PersistedClusterStateService.java:208) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.PersistedClusterStateService.createWriter(PersistedClusterStateService.java:184) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.getWriterSafe(GatewayMetaState.java:491) ~[crate-server.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setCurrentTerm(GatewayMetaState.java:447) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.CoordinationState.handleStartJoin(CoordinationState.java:188) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.joinLeaderInTerm(Coordinator.java:427) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.ensureTermAtLeast(Coordinator.java:419) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:931) ~[crate-server.jar:?]
        at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
        at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:499) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:465) ~[crate-server.jar:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[crate-server.jar:?]
        at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:121) ~[crate-server.jar:?]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:334) ~[crate-server.jar:?]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:452) ~[crate-server.jar:?]
        at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:131) ~[crate-server.jar:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[crate-server.jar:?]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:266) ~[crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:831) ~[?:?]

Any ideas why I’m getting this “Too many open files” error and how to fix this?

How many shards are on that cluster?
Lucene does keep a lot of files open, but 8 million still seems high.

Not many. Sometimes web UI can respond and this is what I was able to see just now (there’s not much data on that development cluster, we’ve only recently installed it a few weeks ago for some testing):

In comparison with the Node 01 of our production cluster:

$ sudo lsof | wc -l
364819

$ sudo lsof | wc -l
370925

$ sudo lsof | wc -l
370252

$ sudo lsof | wc -l
370142

$ sudo lsof | wc -l
370697

what does …

select
name,
process['open_file_descriptors'], 
process['max_open_file_descriptors']
from sys.nodes

return?

1 Like

I couldn’t connect via crash:

+-----------------------+-----------+---------+-----------+---------------------------------------+
| server_url            | node_name | version | connected | message                               |
+-----------------------+-----------+---------+-----------+---------------------------------------+
| http://localhost:4200 |      NULL | 0.0.0   | FALSE     | 503 Server Error: Service Unavailable |
+-----------------------+-----------+---------+-----------+---------------------------------------+
CONNECT ERROR

but then I tried with PostgreSQL psql utility and I was able to connect and run the SQL query you wrote:

       name       | process['open_file_descriptors'] | process['max_open_file_descriptors']
------------------+----------------------------------+--------------------------------------
 dev-crate-dn-003 |                               -1 |                                65536
 dev-crate-dn-002 |                               -1 |                                65536
 dev-crate-dn-001 |                            65535 |                                65536
(3 rows)

Are you sure that the CrateDB process opened so much file descriptors?

Could be found out by e.g. (listing top 10 processes):
lsof | awk '{print $1}' | sort | uniq -c | sort -r | head

This is what I see:

$ sudo lsof | awk '{print $1}' | sort | uniq -c | sort -r | head
      9 uniq
      9 scsi_tmf_
      9 kworker/u
      9 head
     96 systemd-t
     87 kworker/1
8131796 java
      6 rcu_tasks
      6 ext4-rsv-
     58 systemd-j

And another similar command that shows more details:

$ sudo lsof +c 0 | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -n30
8131796 6204 java
    132 575 rsyslogd
     96 483 systemd-timesyn
     87 1 systemd
     61 32787 sshd
     58 335 systemd-journal
     57 32760 sshd
     56 32764 (sd-pam)
     55 32763 systemd
     49 577 systemd-logind
     41 32989 sudo
     39 592 sshd
     36 573 dbus-daemon
     34 355 systemd-udevd
     20 572 cron
     17 32995 lsof
     16 32788 bash
     13 585 agetty
     12 32996 lsof
     10 32993 sort
     10 32991 sort
     10 32990 awk
      9 32994 head
      9 32992 uniq
      3 9 mm_percpu_wq
      3 90 kworker/15:0H-events_highpri
      3 88 ksoftirqd/15
      3 87 migration/15
      3 86 cpuhp/15
      3 85 kworker/14:0H-events_highpri

It seems that java has the most number of open files. But why 8.131.796 open files, that I don’t understand.

What does

 ls -l /proc/6204/fd | wc -l

return?

Well that node is indeed running into the limit.
However it is 65535 file descriptors. The 8 millions seems to be multiplied by threads

It returns:

$ sudo ls -l /proc/6204/fd | wc -l
65535
1 Like

Can you maybe also quickly check how many segments have been created

SELECT count(*) from sys.segments

This is what I get:

 SELECT count(*) from sys.segments ;
 count(*)
----------
       42
(1 row)

This occurred over time or immediately after setup? Since there is data in the cluster, I assume, that it was running before? It looks like there is no master elected.

Over time. I mean, not immediately. The cluster was already running and we upgraded it from 4.6.5 to 4.6.6 on 16 December, 2021.

I think last time I was able to connect to the web UI without any problems was last week, Friday. This week I didn’t try it out for a few days, and only this morning (Thursday) I realized this situation.

In other words to me it seems like our development 3-Node CrateDB cluster was working fine (at leas as can be observed from the web UI) until very recently.

In the worst case, I’ll try to restart the crate service, or maybe reboot that node. But I still don’t know why more than 8 million files are open on that cluster.

Hello,

It was getting difficult to work with our development cluster, not being able to view the web based UI, etc. Therefore I’ve rebooted the nodes one by one.

After the reboot and checking that the development cluster health is reported as OK on the web UI, I’m monitoring the number of open files on the first node, using the following command:

$ sudo lsof +c 0 | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -n1 | column -t

It started with 85944 open files and in a few minutes reached 107802, and then 124374.

I expect the number of open files to be stabilize around 300.000 (because that’s what I see on the node 01 of our production cluster, and it’s been working fine). I’ll keep an eye on it.

Be aware, that this is most likely double counting, as CrateDB makes heavy use of threads.

open_file_descriptors

in

select
name,
process['open_file_descriptors'], 
process['max_open_file_descriptors']
from sys.nodes

should return the correct number for actual open file descriptors.

1 Like

Thanks for the detailed info! Indeed, now that I can connect via crash on that node, I can see the following:

$  crash --verbose
+-----------------------+------------------+---------+-----------+---------+
| server_url            | node_name        | version | connected | message |
+-----------------------+------------------+---------+-----------+---------+
| http://localhost:4200 | dev-crate-dn-001 | 4.6.6   | TRUE      | OK      |
+-----------------------+------------------+---------+-----------+---------+
CONNECT OK
CLUSTER CHECK OK
TYPES OF NODE CHECK OK
cr> SELECT name, process['open_file_descriptors'], process['max_open_file_descriptors'] from sys.nodes ;
+------------------+----------------------------------+--------------------------------------+
| name             | process['open_file_descriptors'] | process['max_open_file_descriptors'] |
+------------------+----------------------------------+--------------------------------------+
| dev-crate-dn-001 |                             1332 |                                65536 |
| dev-crate-dn-003 |                             1269 |                                65536 |
| dev-crate-dn-002 |                             1325 |                                65536 |
+------------------+----------------------------------+--------------------------------------+
SELECT 3 rows in set (0.003 sec)

I’ll monitor this development cluster for a few more days to see if this “too many files open” issue occurs again.

Thanks for the swift technical support so far! :pray:

I’ve monitored both our development and production cluster for a while, and this is what I’ve found:

Development: when I run Chrome browser and connect to the web-based admin UI, value of process['open_file_descriptors'] immediately starts to increase, and as I keep Chrome open, clicking on various screens, that value increases. When I exit Chrome, that increased value stays at that level, only to increase again if I run Chrome browser again.

I did a crude experimentation using the following command line (and running/exiting Chrome):

$ while true; do 2>&1 crash --command "SELECT name, process['open_file_descriptors'] from sys.nodes ;" | grep "dev-crate-dn-001" | head -n2 | tail ; sleep 2; done
| dev-crate-dn-001 |   2850 |
| dev-crate-dn-001 |   2850 |
| dev-crate-dn-001 |   2850 |
| dev-crate-dn-001 |   2850 |
| dev-crate-dn-001 |   2850 |
| dev-crate-dn-001 |   2851 |  <--- Start Chrome browser
| dev-crate-dn-001 |   2851 |       ...
| dev-crate-dn-001 |   2856 |       ...
| dev-crate-dn-001 |   2860 |
| dev-crate-dn-001 |   2860 |
| dev-crate-dn-001 |   2859 |
| dev-crate-dn-001 |   2864 |
| dev-crate-dn-001 |   2863 |
| dev-crate-dn-001 |   2868 |
| dev-crate-dn-001 |   2868 |
| dev-crate-dn-001 |   2872 |
| dev-crate-dn-001 |   2872 |
| dev-crate-dn-001 |   2871 |
| dev-crate-dn-001 |   2876 |
| dev-crate-dn-001 |   2875 |
| dev-crate-dn-001 |   2880 |
| dev-crate-dn-001 |   2880 |
| dev-crate-dn-001 |   2884 |
| dev-crate-dn-001 |   2884 |
| dev-crate-dn-001 |   2888 |
| dev-crate-dn-001 |   2888 |
| dev-crate-dn-001 |   2887 |
| dev-crate-dn-001 |   2892 |
| dev-crate-dn-001 |   2892 |
| dev-crate-dn-001 |   2890 | <--- Exit Chrome browser 
| dev-crate-dn-001 |   2890 |      ...
| dev-crate-dn-001 |   2890 |      ...
| dev-crate-dn-001 |   2890 |      value of process['open_file_descriptors'] stays the same
| dev-crate-dn-001 |   2890 |      value of process['open_file_descriptors'] stays the same
| dev-crate-dn-001 |   2890 |      value of process['open_file_descriptors'] stays the same
| dev-crate-dn-001 |   2890 |      ...
| dev-crate-dn-001 |   2890 |      ....
| dev-crate-dn-001 |   2890 |      value of process['open_file_descriptors'] stays the same

Production: when I did a similar experiment on the first node of our Production cluster, I see that when running Chrome, process['open_file_descriptors'] increases, but when I exit Chrome it decreases:

| crate-dn-001 |        524 |
| crate-dn-001 |        527 |
| crate-dn-001 |        528 |
| crate-dn-001 |        518 |
| crate-dn-001 |        517 |
| crate-dn-001 |        521 | <--- Start Chrome browser
| crate-dn-001 |        517 |      ...  
| crate-dn-001 |        521 |
| crate-dn-001 |        521 |
| crate-dn-001 |        522 |
| crate-dn-001 |        526 |
| crate-dn-001 |        534 |
| crate-dn-001 |        534 |
| crate-dn-001 |        533 |
| crate-dn-001 |        530 |
| crate-dn-001 |        534 |
| crate-dn-001 |        529 |
| crate-dn-001 |        535 |
| crate-dn-001 |        530 |
| crate-dn-001 |        534 |
| crate-dn-001 |        518 |
| crate-dn-001 |        522 |
| crate-dn-001 |        580 | <--- Exit Chrome browser
| crate-dn-001 |        515 |      value of process['open_file_descriptors'] decreases 
| crate-dn-001 |        519 |      ...
| crate-dn-001 |        515 |
| crate-dn-001 |        523 |
| crate-dn-001 |        527 |
| crate-dn-001 |        528 |

This means: on the production cluster, running Chrome again and again doesn’t have much impact, whereas on our development cluster, running Chrome for a few days continuously would easily lead to an explosion of the value for process['open_file_descriptors'].

One major difference between these clusters is the underlying Debian version:

  • Production: Debian GNU/Linux version 10.11
  • Development: Debian GNU/Linux version 11.1

But CrateDB version is the same, so I’m really confused as to why process['open_file_descriptors'] increases as long as there’s a Chrome browser connected, and that value doesn’t go down after I exit Chrome.

As I wrote before, on the development CrateDB cluster, there’s almost no data, and almost no load, whereas on the production CrateDB cluster, there are tables with more than 200 million rows, and more than 35 GB of data and some services constantly connecting and inserting data, querying data, etc. creating some load.