How to handle "There are still active requests on this node, delaying graceful shutdown"?

Hello,

To do a rolling upgrade from 4.6.5 to 4.6.6 I’m trying to gracefully shutdown our 3-node CrateDB cluster.

There are some .NET services constantly sending data via HAProxy to this 3-node cluster.

On Node 01, I gave the following commands:

SET GLOBAL TRANSIENT "cluster.routing.allocation.enable" = 'new_primaries';
ALTER CLUSTER DECOMMISSION 'crate-dn-001' ;

But it can’t gracefully shutdown. When I check the logs, I see the following:

[2021-12-17T13:06:59,850][INFO ][o.e.c.s.ClusterSettings  ] [crate-dn-001] updating [cluster.routing.allocation.enable] from [all] to [new_primaries]
[2021-12-17T13:09:06,119][INFO ][o.e.c.s.ClusterSettings  ] [crate-dn-001] updating [crate.internal.decommission.] from [{}] to [{"FM5C1UbYRJSyPnRQytu25Q":"true"}]
[2021-12-17T13:09:27,464][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:09:27,468][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:09:40,055][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:10:41,213][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:11:42,327][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:12:12,985][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:13:14,136][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:13:30,643][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:13:33,920][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:14:15,316][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:14:28,948][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:14:49,043][WARN ][o.e.i.IndexService       ] [crate-dn-001]  [data.meterreading][data.meterreading][3] failed to execute global checkpoint sync
org.elasticsearch.index.shard.ShardNotInPrimaryModeException: CurrentState[STARTED] shard is not in primary mode
        at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2667) [crate-server.jar:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:90) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:296) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:246) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2638) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.runUnderPrimaryPermit(IndexShard.java:2718) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.sync(IndexService.java:716) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.maybeSyncGlobalCheckpoints(IndexService.java:696) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService$AsyncGlobalCheckpointTask.runInternal(IndexService.java:869) [crate-server.jar:?]
		        at org.elasticsearch.index.IndexService$AsyncGlobalCheckpointTask.runInternal(IndexService.java:869) [crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:143) [crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
[2021-12-17T13:15:00,062][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:15:00,125][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:15:07,097][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:15:41,225][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:16:48,934][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:19:11,730][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:19:33,487][INFO ][i.c.a.HttpAuthUpstreamHandler] [crate-dn-001] password authentication failed for user=energytix from connection=/192.168.239.20
[2021-12-17T13:19:35,797][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:20:46,540][WARN ][o.e.i.IndexService       ] [crate-dn-001]  [data.quarter][data.quarter][3] failed to execute global checkpoint sync
org.elasticsearch.index.shard.ShardNotInPrimaryModeException: CurrentState[STARTED] shard is not in primary mode
        at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2667) [crate-server.jar:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:90) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:296) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:246) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2638) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.runUnderPrimaryPermit(IndexShard.java:2718) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.sync(IndexService.java:716) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.maybeSyncGlobalCheckpoints(IndexService.java:696) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService$AsyncGlobalCheckpointTask.runInternal(IndexService.java:869) [crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:143) [crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
[2021-12-17T13:20:51,786][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:20:56,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:01,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:06,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:11,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:16,790][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:21,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:26,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:31,792][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:36,793][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:41,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:46,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:51,795][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:20:56,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:01,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:06,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:11,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:16,790][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:21,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:26,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:31,792][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:36,793][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:41,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:46,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:51,795][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:56,796][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:01,796][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:06,797][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:11,797][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:12,946][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:22:16,798][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:21,798][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:26,799][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:31,799][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:36,800][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:41,800][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:46,800][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:51,801][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:56,801][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:01,802][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:06,802][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:11,803][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:16,803][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:21,804][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:26,804][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:31,805][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:36,805][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:41,806][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:46,806][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:51,807][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:56,807][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:01,808][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:06,808][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:11,809][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:16,809][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:21,810][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:26,810][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:31,811][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:36,811][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:41,812][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:46,812][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:51,813][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:56,813][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:01,814][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:06,815][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:11,815][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:16,816][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:21,816][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:26,817][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:16,829][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:21,830][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:26,830][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:31,831][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:36,831][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:41,832][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:46,832][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:51,832][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:56,833][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:59,043][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:28:01,833][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:06,834][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:11,834][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:16,835][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:21,835][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:26,836][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:31,836][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

Any ideas how I can make this Node 01 of our CrateDB cluster finish graceful shutdown?

On top of that: I can’t connect to it via crash utility (as superuser) anymore (it just hangs, stuck waiting).

  • Was the cluster in a green state before starting the graceful shutdown?
  • How full are the clusters disks currently?
  • How many shards do you totally have in the cluster?

On top of that: I can’t connect to it via crash utility (as superuser) anymore (it just hangs, stuck waiting).

That is expected for that node, as first the sql-services (http/pg) are shut down

Was the cluster in a green state before starting the graceful shutdown?

Yes, it was green, functioning normally (services writing and reading data in and out of it without any problems).

How full are the clusters disks currently?

Before the DECOMMISSION command, it was about 27 GB per node. Now I see that first node’s data disk almost empty, it became only 73 MB, and the others increased, to about 40 GB.

How many shards do you totally have in the cluster?

I couldn’t quickly find the SQL command for that, therefore I took a screenshot:

Also:

select name, connections['psql']['open'] from sys.nodes ORDER BY name limit 100;

returns:

Result from query
1 / 1
name connections[‘psql’][‘open’]
crate-dn-001
0
crate-dn-002
3
crate-dn-003
1

It seems like there’s no psql connection to the first node. But I continue getting:

[2021-12-17T14:18:32,095][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T14:18:37,095][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T14:18:42,096][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

And the web admin UI (on Node 02) still shows all of the nodes as up and the cluster as GREEN:

I also see the following in the logs:

[2021-12-17T14:26:17,134][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T14:26:21,717][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,718][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,724][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,727][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,728][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,729][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:22,134][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

It’s been about 1.5 hours since I issued the decommission command. It still says “There are still active requests on this node, delaying graceful shutdown”.

We also quit all the database viewing applications that might have existing connections to that first node and restarted all of the data services that connect to the cluster. But it still says “There are still active requests on this node, delaying graceful shutdown” in the logs.

Is there a way to forcibly stop those “active requests”? (Or see what they actually are?)

Before the DECOMMISSION command, it was about 27 GB per node. Now I see that first node’s data disk almost empty, it became only 73 MB , and the others increased , to about 40 GB .

Have you set

cluster.graceful_stop.min_availability=full

??? It looks like that shards have been relocated to the two remaining nodes. This would be the case if the number of replicas would be set to '1' (not 0-1) or if the min availbilty is set to full.

It seems like there’s no psql connection to the first node. But I continue getting:

select name, connections['psql']['open'], connections['http']['open'] from sys.nodes ORDER BY name limit 100;

There might also be open http connections (like used by crash). However I would think this is related to finishing up the relocation of shards and open transport connections.

I also see the following in the logs:

org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected

those should be expected, as the sql service is blocking any new incoming connections from your LB.


It’s been about 1.5 hours since I issued the decommission command. It still says “There are still active requests on this node, delaying graceful shutdown”.

The default timeout is 2h for a graceful shutdown to fail. By default the node would then go back into operation

Is there a way to forcibly stop those “active requests”? (Or see what they actually are?)

If you check

SELECT * FROM sys.shards
where routing_state <> 'STARTED'

and 0 rows are returned you can also terminated the crate process.

I didn’t change it to cluster.graceful_stop.min_availability=full setting, therefore I think it’s the default setting of primaries?

On second node, I ran:

SELECT * FROM sys.shards WHERE routing_state <> 'STARTED' ;

it didn’t return any rows.

The default timeout is 2h for a graceful shutdown to fail

According to the logs the DECOMMISSION started around 13:09:06 UTC::

[2021-12-17T13:09:06,119][INFO ][o.e.c.s.ClusterSettings  ] [crate-dn-001] updating [crate.internal.decommission.] from [{}] to [{"FM5C1UbYRJSyPnRQytu25Q":"true"}]

Right now time is about T15:14:02,368, so more than 2 hours passed, and I still get the same warnings:

[2021-12-17T15:14:42,371][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T15:14:47,372][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

based on your comments so far, I’ll run the following:

sudo systemctl stop crate

and then

sudo systemctl start crate

hoping it’ll restart properly.

To finalize this thread: I’ve stopped and started CrateDB on the first node, waited for it to be a part of cluster and then repeated the whole process, this time with a successful graceful shutdown. Then I upgraded CrateDB Debian packages and repeated the same thing for other nodes, finishing the cluster upgrade process in about 20 minutes.

I think the reason this process (graceful shutdown on the first node) was stuck waiting is:

  • After issuing the DECOMMISSION command via crash on the first node, I waited for about 1 minute and then Ctrl-C exited from crash command line utility.
  • And the reason I exited (Ctrl-C) is because I thought I could do it, and also I was expecting the decommission process to finish by that time. Apparently, it takes more than a few minutes!
  • In the successful case, I have seen that it takes about 7-8 minutes between the ALTER CLUSTER DECOMMISSION 'whatever-node-name-... ; and receiving ALTER OK, 1 row affected message.

During that process, min_availability was PRIMARIES (I never changed that):

 select settings['cluster']['graceful_stop']['min_availability'] from sys.cluster limit 100;
 settings['cluster']['graceful_stop']['min_availability']
----------------------------------------------------------
 PRIMARIES
(1 row)

And the largest time series table had '0-1' as the number of replicas:

 number_of_replicas = '0-1',