How to handle "There are still active requests on this node, delaying graceful shutdown"?

Emre_Sevinc · December 17, 2021, 1:41pm

Hello,

To do a rolling upgrade from 4.6.5 to 4.6.6 I’m trying to gracefully shutdown our 3-node CrateDB cluster.

There are some .NET services constantly sending data via HAProxy to this 3-node cluster.

On Node 01, I gave the following commands:

SET GLOBAL TRANSIENT "cluster.routing.allocation.enable" = 'new_primaries';
ALTER CLUSTER DECOMMISSION 'crate-dn-001' ;

But it can’t gracefully shutdown. When I check the logs, I see the following:

[2021-12-17T13:06:59,850][INFO ][o.e.c.s.ClusterSettings  ] [crate-dn-001] updating [cluster.routing.allocation.enable] from [all] to [new_primaries]
[2021-12-17T13:09:06,119][INFO ][o.e.c.s.ClusterSettings  ] [crate-dn-001] updating [crate.internal.decommission.] from [{}] to [{"FM5C1UbYRJSyPnRQytu25Q":"true"}]
[2021-12-17T13:09:27,464][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:09:27,468][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:09:40,055][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:10:41,213][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:11:42,327][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:12:12,985][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:13:14,136][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:13:30,643][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:13:33,920][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:14:15,316][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:14:28,948][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:14:49,043][WARN ][o.e.i.IndexService       ] [crate-dn-001]  [data.meterreading][data.meterreading][3] failed to execute global checkpoint sync
org.elasticsearch.index.shard.ShardNotInPrimaryModeException: CurrentState[STARTED] shard is not in primary mode
        at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2667) [crate-server.jar:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:90) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:296) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:246) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2638) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.runUnderPrimaryPermit(IndexShard.java:2718) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.sync(IndexService.java:716) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.maybeSyncGlobalCheckpoints(IndexService.java:696) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService$AsyncGlobalCheckpointTask.runInternal(IndexService.java:869) [crate-server.jar:?]
		        at org.elasticsearch.index.IndexService$AsyncGlobalCheckpointTask.runInternal(IndexService.java:869) [crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:143) [crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
[2021-12-17T13:15:00,062][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:15:00,125][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:15:07,097][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:15:41,225][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:16:48,934][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:19:11,730][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:19:33,487][INFO ][i.c.a.HttpAuthUpstreamHandler] [crate-dn-001] password authentication failed for user=energytix from connection=/192.168.239.20
[2021-12-17T13:19:35,797][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last ha\
ndler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:20:46,540][WARN ][o.e.i.IndexService       ] [crate-dn-001]  [data.quarter][data.quarter][3] failed to execute global checkpoint sync
org.elasticsearch.index.shard.ShardNotInPrimaryModeException: CurrentState[STARTED] shard is not in primary mode
        at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2667) [crate-server.jar:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:90) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:296) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:246) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2638) [crate-server.jar:?]
        at org.elasticsearch.index.shard.IndexShard.runUnderPrimaryPermit(IndexShard.java:2718) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.sync(IndexService.java:716) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService.maybeSyncGlobalCheckpoints(IndexService.java:696) [crate-server.jar:?]
        at org.elasticsearch.index.IndexService$AsyncGlobalCheckpointTask.runInternal(IndexService.java:869) [crate-server.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:143) [crate-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
[2021-12-17T13:20:51,786][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:20:56,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:01,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:06,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:11,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:16,790][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:21,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:26,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:31,792][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:36,793][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:41,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:46,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:51,795][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:20:56,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:01,788][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:06,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:11,789][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:16,790][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:21,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:26,791][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:31,792][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:36,793][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:41,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:46,794][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:51,795][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:21:56,796][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:01,796][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:06,797][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:11,797][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:12,946][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:22:16,798][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:21,798][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:26,799][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:31,799][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:36,800][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:41,800][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:46,800][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:51,801][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:22:56,801][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:01,802][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:06,802][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:11,803][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:16,803][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:21,804][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:26,804][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:31,805][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:36,805][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:41,806][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:46,806][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:51,807][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:23:56,807][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:01,808][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:06,808][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:11,809][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:16,809][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:21,810][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:26,810][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:31,811][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:36,811][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:41,812][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:46,812][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:51,813][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:24:56,813][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:01,814][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:06,815][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:11,815][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:16,816][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:21,816][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:25:26,817][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:16,829][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:21,830][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:26,830][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:31,831][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:36,831][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:41,832][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:46,832][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:51,832][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:56,833][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:27:59,043][WARN ][i.c.p.p.Messages         ] [crate-dn-001] [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T13:28:01,833][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:06,834][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:11,834][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:16,835][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:21,835][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:26,836][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T13:28:31,836][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

Any ideas how I can make this Node 01 of our CrateDB cluster finish graceful shutdown?

On top of that: I can’t connect to it via crash utility (as superuser) anymore (it just hangs, stuck waiting).

proddata · December 17, 2021, 1:59pm

Was the cluster in a green state before starting the graceful shutdown?
How full are the clusters disks currently?
How many shards do you totally have in the cluster?

On top of that: I can’t connect to it via crash utility (as superuser) anymore (it just hangs, stuck waiting).

That is expected for that node, as first the sql-services (http/pg) are shut down

Emre_Sevinc · December 17, 2021, 2:06pm

Was the cluster in a green state before starting the graceful shutdown?

Yes, it was green, functioning normally (services writing and reading data in and out of it without any problems).

How full are the clusters disks currently?

Before the DECOMMISSION command, it was about 27 GB per node. Now I see that first node’s data disk almost empty, it became only 73 MB, and the others increased, to about 40 GB.

How many shards do you totally have in the cluster?

I couldn’t quickly find the SQL command for that, therefore I took a screenshot:

Emre_Sevinc · December 17, 2021, 2:18pm

Also:

select name, connections['psql']['open'] from sys.nodes ORDER BY name limit 100;

returns:

Result from query
1 / 1
name connections[‘psql’][‘open’]
crate-dn-001
0
crate-dn-002
3
crate-dn-003
1

It seems like there’s no psql connection to the first node. But I continue getting:

[2021-12-17T14:18:32,095][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T14:18:37,095][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T14:18:42,096][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

Emre_Sevinc · December 17, 2021, 2:24pm

And the web admin UI (on Node 02) still shows all of the nodes as up and the cluster as GREEN:

Emre_Sevinc · December 17, 2021, 2:28pm

I also see the following in the logs:

[2021-12-17T14:26:17,134][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T14:26:21,717][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,718][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,724][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,727][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,728][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:21,729][WARN ][i.n.c.DefaultChannelPipeline] [crate-dn-001] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected
[2021-12-17T14:26:22,134][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

Emre_Sevinc · December 17, 2021, 2:37pm

It’s been about 1.5 hours since I issued the decommission command. It still says “There are still active requests on this node, delaying graceful shutdown”.

We also quit all the database viewing applications that might have existing connections to that first node and restarted all of the data services that connect to the cluster. But it still says “There are still active requests on this node, delaying graceful shutdown” in the logs.

Is there a way to forcibly stop those “active requests”? (Or see what they actually are?)

proddata · December 17, 2021, 2:56pm

Before the DECOMMISSION command, it was about 27 GB per node. Now I see that first node’s data disk almost empty, it became only 73 MB , and the others increased , to about 40 GB .

Have you set

cluster.graceful_stop.min_availability=full

??? It looks like that shards have been relocated to the two remaining nodes. This would be the case if the number of replicas would be set to '1' (not 0-1) or if the min availbilty is set to full.

It seems like there’s no psql connection to the first node. But I continue getting:

select name, connections['psql']['open'], connections['http']['open'] from sys.nodes ORDER BY name limit 100;

There might also be open http connections (like used by crash). However I would think this is related to finishing up the relocation of shards and open transport connections.

I also see the following in the logs:

org.elasticsearch.transport.NodeDisconnectedException: [crate-dn-001][192.168.239.30:4300][sql] disconnected

those should be expected, as the sql service is blocking any new incoming connections from your LB.

It’s been about 1.5 hours since I issued the decommission command. It still says “There are still active requests on this node, delaying graceful shutdown”.

The default timeout is 2h for a graceful shutdown to fail. By default the node would then go back into operation

Is there a way to forcibly stop those “active requests”? (Or see what they actually are?)

If you check

SELECT * FROM sys.shards
where routing_state <> 'STARTED'

and 0 rows are returned you can also terminated the crate process.

Emre_Sevinc · December 17, 2021, 3:17pm

I didn’t change it to cluster.graceful_stop.min_availability=full setting, therefore I think it’s the default setting of primaries?

On second node, I ran:

SELECT * FROM sys.shards WHERE routing_state <> 'STARTED' ;

it didn’t return any rows.

The default timeout is 2h for a graceful shutdown to fail

According to the logs the DECOMMISSION started around 13:09:06 UTC::

[2021-12-17T13:09:06,119][INFO ][o.e.c.s.ClusterSettings  ] [crate-dn-001] updating [crate.internal.decommission.] from [{}] to [{"FM5C1UbYRJSyPnRQytu25Q":"true"}]

Right now time is about T15:14:02,368, so more than 2 hours passed, and I still get the same warnings:

[2021-12-17T15:14:42,371][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown
[2021-12-17T15:14:47,372][INFO ][i.c.c.g.DecommissioningService] [crate-dn-001] There are still active requests on this node, delaying graceful shutdown

based on your comments so far, I’ll run the following:

sudo systemctl stop crate

and then

sudo systemctl start crate

hoping it’ll restart properly.

Emre_Sevinc · December 23, 2021, 8:43am

To finalize this thread: I’ve stopped and started CrateDB on the first node, waited for it to be a part of cluster and then repeated the whole process, this time with a successful graceful shutdown. Then I upgraded CrateDB Debian packages and repeated the same thing for other nodes, finishing the cluster upgrade process in about 20 minutes.

I think the reason this process (graceful shutdown on the first node) was stuck waiting is:

After issuing the DECOMMISSION command via crash on the first node, I waited for about 1 minute and then Ctrl-C exited from crash command line utility.
And the reason I exited (Ctrl-C) is because I thought I could do it, and also I was expecting the decommission process to finish by that time. Apparently, it takes more than a few minutes!
In the successful case, I have seen that it takes about 7-8 minutes between the ALTER CLUSTER DECOMMISSION 'whatever-node-name-... ; and receiving ALTER OK, 1 row affected message.

During that process, min_availability was PRIMARIES (I never changed that):

 select settings['cluster']['graceful_stop']['min_availability'] from sys.cluster limit 100;
 settings['cluster']['graceful_stop']['min_availability']
----------------------------------------------------------
 PRIMARIES
(1 row)

And the largest time series table had '0-1' as the number of replicas:

 number_of_replicas = '0-1',

Topic		Replies	Views
Node timeout in clusters CrateDB	4	261	July 28, 2023
Exception when detaching node via crate-node CrateDB	0	646	June 15, 2020
Shards replication error CrateDB	2	1459	February 12, 2020
CrateDB database logs showing "shard is now inactive", and threads are getting blocked CrateDB	16	488	October 25, 2023
SQL Freeze after sync done SQL	3	656	October 20, 2021

How to handle "There are still active requests on this node, delaying graceful shutdown"?

Related Topics