What to do to fix "Too many open files" on a 3-node cluster with almost no data?

proddata · January 13, 2022, 4:27pm

The Admin UI is basically also just a SQL client, that uses the _sql / http-endpoint to run queries to display data in the UI. So even though you are not actively querying in the console, the admin ui sends quite some sql-statements to the cluster.

Are you using any sort of loadbalancer / reverse proxy?

The amount of data does not matter to much, but rather the amount of lucene segments, which only is somehow related to the amount of data

proddata · January 13, 2022, 4:34pm

btw, this might not have anything to do with actual data, but might as well be a networking problem.

Emre_Sevinc · January 14, 2022, 8:39am

There’s an HAProxy load balancer, but for this development cluster we connected directly to the nodes most of the time. For example, all of the web-based UI connections were directly made to the IP address of a particular node, not via a load balancer.

I made a similar experiment this morning:

Every few seconds, check that value of process['open_file_descriptors'] for Node 01 and see that it stays the same at the value of 2902.
Connect via Chrome directly to the web-based administrative UI running on node 01.
Start clicking on some items on the web UI to generate some SQL queries to the back-end CrateDB server.
Observe that the value of process['open_file_descriptors'] increasing to 2944 quickly.
Exit Chrome browser.
The value of process['open_file_descriptors'] drops but not by much, it becomes 2942 and stays at that amount.
The same situation applies to process['open_file_descriptors'] on all 3 nodes. They all increase, on and on…

Meanwhile, on our production cluster, I do something similar, connect directly to first node with Chrome, click, etc. exit. And process['open_file_descriptors'] stays around 500.

Therefore, it is almost like the following: if I continue connecting with Chrome to one of the nodes on our development cluster, click on a few items, exit, and do this repeatedly, the value of process['open_file_descriptors'] will continue to increase, until I start to get /usr/share/crate/lib/site/index.html: Too many open files error again.

What I’m trying to understand:

why does the value of process['open_file_descriptors'] continuously increase?

I mean, normally, I’d expect it to stabilize around some value on that cluster that’s not doing much at all. I’d expect that number to increase when I connect with Chrome, but then decrease after I exit Chrome.

Maybe you have some insight? Or some method to debug this strange case further?

Emre_Sevinc · January 14, 2022, 9:00am

I decided to look into the most recent file descriptors and see if there are obvious differences between production cluster node 01, and development cluster node 01.

On production, node 01:

$ sudo ls -lt /proc/218658/fd | head -n40
total 0
lrwx------ 1 crate crate 64 Jan 13 13:55 436 -> socket:[783194]
lrwx------ 1 crate crate 64 Jan 13 13:55 499 -> socket:[783184]
lrwx------ 1 crate crate 64 Jan 13 13:55 506 -> socket:[783192]
lr-x------ 1 crate crate 64 Jan 13 13:53 303 -> /data/nodes/0/indices/fKxtpGYgR4WrlgKifnUa6w/0/index/_4k1.fdt
l-wx------ 1 crate crate 64 Jan 13 13:53 438 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/2/index/_2i394.fdt
l-wx------ 1 crate crate 64 Jan 13 13:53 439 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/2/index/_2i394_Lucene85FieldsIndex-doc_ids_2if7e.tmp
l-wx------ 1 crate crate 64 Jan 13 13:53 443 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/2/index/_2i394_Lucene85FieldsIndexfile_pointers_2if7f.tmp
lrwx------ 1 crate crate 64 Jan 13 13:53 465 -> /data/nodes/0/indices/fKxtpGYgR4WrlgKifnUa6w/0/translog/translog-521.tlog
lrwx------ 1 crate crate 64 Jan 13 13:53 477 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/4/translog/translog-44.tlog
lrwx------ 1 crate crate 64 Jan 13 10:12 484 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/3/translog/translog-45.tlog
l-wx------ 1 crate crate 64 Jan 13 10:12 489 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/4/index/_2i2nq.fdm
lrwx------ 1 crate crate 64 Jan 13 10:06 462 -> /data/nodes/0/indices/QZ08ZovzTYin1OKmU9GvfA/0/translog/translog-1439.tlog
lrwx------ 1 crate crate 64 Dec 27 12:59 556 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/3/translog/translog-46.tlog
lrwx------ 1 crate crate 64 Dec 27 09:25 550 -> /data/nodes/0/indices/QJhmCWHnTAik-1hYASbtqA/5/translog/translog-3.tlog
lr-x------ 1 crate crate 64 Dec 23 00:19 543 -> /data/nodes/0/indices/5VMxfFDOTg-VaBI3x50s6w/4/index/_e0v.fdt
lr-x------ 1 crate crate 64 Dec 21 22:40 535 -> /data/nodes/0/indices/5VMxfFDOTg-VaBI3x50s6w/2/index/_ihp.fdx
lrwx------ 1 crate crate 64 Dec 21 21:46 540 -> socket:[471822]
lrwx------ 1 crate crate 64 Dec 21 21:46 542 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/5/translog/translog-43.tlog
lrwx------ 1 crate crate 64 Dec 21 20:42 533 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/4/translog/translog-42.tlog
lrwx------ 1 crate crate 64 Dec 21 20:42 534 -> /data/nodes/0/indices/mM6WFguiTXO8ozwRnWX9RQ/3/translog/translog-2678.tlog
lrwx------ 1 crate crate 64 Dec 21 20:42 532 -> /data/nodes/0/indices/mM6WFguiTXO8ozwRnWX9RQ/5/translog/translog-2669.tlog
l-wx------ 1 crate crate 64 Dec 21 20:27 485 -> /data/nodes/0/indices/IYTZhsKBTmOEE5FZXXTlqA/4/index/_23j.fdm
lrwx------ 1 crate crate 64 Dec 21 20:27 486 -> /data/nodes/0/indices/TfdwP1MXTiODrdJYgM_vnw/3/translog/translog-4.tlog
lr-x------ 1 crate crate 64 Dec 21 20:27 487 -> /data/nodes/0/indices/QZ08ZovzTYin1OKmU9GvfA/0/index/_d3f.fdx
l-wx------ 1 crate crate 64 Dec 21 20:27 488 -> /data/nodes/0/indices/QJhmCWHnTAik-1hYASbtqA/5/index/write.lock
l-wx------ 1 crate crate 64 Dec 21 20:27 490 -> /data/nodes/0/indices/IYTZhsKBTmOEE5FZXXTlqA/4/index/_23j.fdt
l-wx------ 1 crate crate 64 Dec 21 20:27 491 -> /data/nodes/0/indices/QJhmCWHnTAik-1hYASbtqA/4/index/write.lock
lrwx------ 1 crate crate 64 Dec 21 20:27 492 -> /data/nodes/0/indices/IYTZhsKBTmOEE5FZXXTlqA/5/translog/translog-651.tlog
lr-x------ 1 crate crate 64 Dec 21 20:27 497 -> /data/nodes/0/_state/_lt.cfs
lrwx------ 1 crate crate 64 Dec 21 20:27 498 -> /data/nodes/0/indices/5VMxfFDOTg-VaBI3x50s6w/3/translog/translog-1343.tlog
l-wx------ 1 crate crate 64 Dec 21 20:27 500 -> /data/nodes/0/indices/IYTZhsKBTmOEE5FZXXTlqA/2/index/_20z_Lucene85FieldsIndexfile_pointers_113.tmp
l-wx------ 1 crate crate 64 Dec 21 20:27 513 -> /data/nodes/0/indices/TfdwP1MXTiODrdJYgM_vnw/5/index/write.lock
lrwx------ 1 crate crate 64 Dec 21 20:27 514 -> /data/nodes/0/indices/SSM93J75RCWoR0g1ynfEuw/3/translog/translog-41.tlog
lr-x------ 1 crate crate 64 Dec 21 20:27 517 -> /data/nodes/0/indices/QZ08ZovzTYin1OKmU9GvfA/4/index/_cxs.fdx
lr-x------ 1 crate crate 64 Dec 21 20:27 518 -> /data/nodes/0/indices/5VMxfFDOTg-VaBI3x50s6w/5/index/_c5s.fdt
lr-x------ 1 crate crate 64 Dec 21 20:27 520 -> /data/nodes/0/indices/5VMxfFDOTg-VaBI3x50s6w/5/index/_c5s.fdx
lr-x------ 1 crate crate 64 Dec 21 14:45 545 -> /data/nodes/0/indices/5VMxfFDOTg-VaBI3x50s6w/4/index/_e0v.fdx
lr-x------ 1 crate crate 64 Dec 21 13:29 525 -> /data/nodes/0/indices/fKxtpGYgR4WrlgKifnUa6w/2/index/_4hw.fdx
lrwx------ 1 crate crate 64 Dec 21 13:29 527 -> /data/nodes/0/indices/mM6WFguiTXO8ozwRnWX9RQ/1/translog/translog-2721.tlog

On development, node 01:

$ sudo ls -lt /proc/585/fd | head -n40
total 0
lr-x------ 1 crate crate 64 Jan 14 08:32 2993 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2994 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2989 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2990 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2991 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:32 2992 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:32 2985 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2986 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2987 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:32 2988 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:32 2981 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2982 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:32 2983 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:32 2984 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2977 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2978 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2979 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2980 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2973 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2974 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2975 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2976 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2969 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2970 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2971 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2972 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2965 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2966 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2967 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2968 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2961 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2962 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2963 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2964 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2957 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2958 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount
lr-x------ 1 crate crate 64 Jan 14 08:31 2959 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2960 -> /sys/fs/cgroup
lr-x------ 1 crate crate 64 Jan 14 08:31 2953 -> /sys/fs/cgroup/sys-fs-fuse-connections.mount

Emre_Sevinc · January 14, 2022, 9:19am

Another difference between the production cluster and development cluster:

I installed sshfs on development cluster nodes around 17 December, but I didn’t configure anything for sshfs, didn’t create a network mount, etc.

Development cluster, node 01:

$ dpkg -l | egrep "(fuse)|(sshf)"
ii  fuse3                      3.10.3-2                       amd64        Filesystem in Userspace (3.x version)
ii  libfuse2:amd64             2.9.9-5                        amd64        Filesystem in Userspace (library)
ii  libfuse3-3:amd64           3.10.3-2                       amd64        Filesystem in Userspace (library) (3.x version)
ii  sshfs                      3.7.1+repack-2                 amd64        filesystem client based on SSH File Transfer Protocol

Production cluster, node 01:

$ dpkg -l | egrep "(fuse)|(sshf)"
ii  libfuse2:amd64                2.9.9-1+deb10u1              amd64        Filesystem in Userspace (library)

Can it be that this fuse3 package and its associated programs creating thousands of file descriptors in the fd directory of the java process? Because:

On development cluster (sshfs installed, but not configured, running, etc.), node 01:

$ sudo ls -lt /proc/585/fd | grep cgroup | wc -l
2708

On production cluster, node 01:

$ sudo ls -lt /proc/218658/fd | grep cgroup | wc -l
0

But why? And how to fix this?

Emre_Sevinc · January 14, 2022, 9:52am

Another experiment, this time monitoring cgroup / fuse related file descriptors for java process as I run Chrome, click on a few items on the web-based administrative UI, and exit Chrome:

$ while true; do sudo ls -lt /proc/585/fd | egrep "(cgroup)|(fuse)" | wc -l ; sleep 3; done
2708
2712 <-- Start Chrome
2712 <-- ...
2716 <-- ...
2716
2720
2724 <-- click on UI items, generate SQL to the back-end
2724 <-- ...
2728 <-- ...
2728
2732
2736
2736
2740
2748
2752  <-- Exit Chrome
2752  <-- it's like once those file descriptors are created, they are always there
2752  ...
2752  ...
2752  <-- it's like once those file descriptors are created, they are always there
2752
2752
2752
2752 <-- but why? Some kind of resource leakage?

jayeff · January 19, 2022, 9:53am

Hi @Emre_Sevinc,

Core DB team did a quick analysis but did not find an obvious cause for this. I created a card on our board for us to try to reproduce this behaviour and dig deeper

Thanks for reporting this and looking into it so far

Emre_Sevinc · February 11, 2022, 3:06pm

I decided to look at things via strace (version 5.10) and this is what I see (this is on a Debian 11.2 server, running CrateDB 4.7.0):

First let’s get the process ids that have a lot of files open:

$ sudo lsof +c 0 | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -n5 | column -t
116000  1127  java
132     585   rsyslogd
96      578   systemd-timesyn
87      1     systemd
61      7261  sshd

And then let’s trace that Java process and its “forks”:

$ sudo strace --decode-fds=all --trace-path=/sys/fs/cgroup --follow-forks --attach 1127
strace: Process 1127 attached with 116 threads
[pid  1615] lstat("/sys/fs/cgroup", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0    <-- As soon as I open the Web UI I start to see these kind of entries and they stop as soon as I close web UI
[pid  1615] openat(AT_FDCWD, "/sys/fs/cgroup", O_RDONLY) = 865</sys/fs/cgroup>
[pid  1615] dup(865</sys/fs/cgroup>)    = 938</sys/fs/cgroup>
[pid  1615] fstat(865</sys/fs/cgroup>, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
[pid  1615] fcntl(865</sys/fs/cgroup>, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
[pid  1615] fcntl(865</sys/fs/cgroup>, F_SETFD, FD_CLOEXEC) = 0
[pid  1615] getdents64(865</sys/fs/cgroup>, 0x7f4f5025f690 /* 30 entries */, 32768) = 1144
[pid  1303] lstat("/sys/fs/cgroup", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
[pid  1303] openat(AT_FDCWD, "/sys/fs/cgroup", O_RDONLY) = 942</sys/fs/cgroup>
[pid  1303] dup(942</sys/fs/cgroup>)    = 943</sys/fs/cgroup>
[pid  1303] fstat(942</sys/fs/cgroup>, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
[pid  1303] fcntl(942</sys/fs/cgroup>, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
[pid  1303] fcntl(942</sys/fs/cgroup>, F_SETFD, FD_CLOEXEC) = 0
[pid  1303] getdents64(942</sys/fs/cgroup>, 0x7f4ea439a950 /* 30 entries */, 32768) = 1144 <-- At this point in time I close web UI
^Cstrace: Process 1127 detached  <-- I manually exit strace, because after I close web UI, nothing happens with respect to /sys/fs/cgroup file descriptors
strace: Process 1133 detached
strace: Process 1134 detached
strace: Process 1135 detached
strace: Process 1136 detached
strace: Process 1138 detached
strace: Process 1139 detached
strace: Process 1140 detached
strace: Process 1141 detached
strace: Process 1142 detached
strace: Process 1143 detached
strace: Process 1144 detached
strace: Process 1145 detached
strace: Process 1146 detached
strace: Process 1147 detached
strace: Process 1148 detached
strace: Process 1149 detached
strace: Process 1150 detached
strace: Process 1151 detached
strace: Process 1162 detached
strace: Process 1163 detached
strace: Process 1164 detached
strace: Process 1165 detached
strace: Process 1166 detached
strace: Process 1167 detached
strace: Process 1168 detached
strace: Process 1169 detached
strace: Process 1170 detached
strace: Process 1171 detached
strace: Process 1172 detached
strace: Process 1173 detached
strace: Process 1174 detached
strace: Process 1175 detached
strace: Process 1180 detached
strace: Process 1181 detached
strace: Process 1183 detached
strace: Process 1184 detached
strace: Process 1185 detached
strace: Process 1186 detached
strace: Process 1187 detached
strace: Process 1188 detached
strace: Process 1189 detached
strace: Process 1190 detached
strace: Process 1191 detached
strace: Process 1192 detached
strace: Process 1193 detached
strace: Process 1197 detached
strace: Process 1198 detached
strace: Process 1199 detached
strace: Process 1200 detached
strace: Process 1201 detached
strace: Process 1202 detached
strace: Process 1203 detached
strace: Process 1204 detached
strace: Process 1205 detached
strace: Process 1206 detached
strace: Process 1207 detached
strace: Process 1208 detached
strace: Process 1209 detached
strace: Process 1210 detached
strace: Process 1211 detached
strace: Process 1212 detached
strace: Process 1213 detached
strace: Process 1214 detached
strace: Process 1216 detached
strace: Process 1217 detached
strace: Process 1218 detached
strace: Process 1219 detached
strace: Process 1220 detached
strace: Process 1221 detached
strace: Process 1222 detached
strace: Process 1223 detached
strace: Process 1224 detached
strace: Process 1225 detached
strace: Process 1226 detached
strace: Process 1227 detached
strace: Process 1228 detached
strace: Process 1229 detached
strace: Process 1230 detached
strace: Process 1231 detached
strace: Process 1233 detached
strace: Process 1234 detached
strace: Process 1238 detached
strace: Process 1246 detached
strace: Process 1247 detached
strace: Process 1248 detached
strace: Process 1249 detached
strace: Process 1250 detached
strace: Process 1251 detached
strace: Process 1252 detached
strace: Process 1253 detached
strace: Process 1254 detached
strace: Process 1255 detached
strace: Process 1256 detached
strace: Process 1264 detached
strace: Process 1265 detached
strace: Process 1266 detached
strace: Process 1267 detached
strace: Process 1268 detached
strace: Process 1269 detached
strace: Process 1270 detached
strace: Process 1271 detached
strace: Process 1272 detached
strace: Process 1284 detached
strace: Process 1303 detached
strace: Process 1313 detached
strace: Process 1400 detached
strace: Process 1425 detached
strace: Process 1615 detached
strace: Process 1626 detached
strace: Process 1628 detached
strace: Process 1993 detached
strace: Process 1994 detached
strace: Process 2001 detached
strace: Process 2002 detached
strace: Process 2733 detached

proddata · September 15, 2022, 5:23pm

Hi @Emre_Sevinc

We came across this issue again and could finally point it down to a bug in CrateDB in combination with distributions using cgroup v2. The fix is already merged and has been released to the testings channels with version 4.8.4. I expect a 5.0.2 soon to follow.

Thanks again for bringing this up originally!

best regards
@proddata

Topic		Replies	Views
Frozen Cluster with FileSystemException CrateDB	3	1369	February 18, 2019
Shards replication error CrateDB	2	1463	February 12, 2020
SQL Freeze after sync done SQL	3	659	October 20, 2021
Issue with COPY FROM a CSV with ~2M records CrateDB	13	572	May 24, 2022
Mmap error followed by node crash in crate cluster CrateDB	8	1037	January 25, 2024

What to do to fix "Too many open files" on a 3-node cluster with almost no data?

Related Topics