How to handle "There are still active requests on this node, delaying graceful shutdown"?

To finalize this thread: I’ve stopped and started CrateDB on the first node, waited for it to be a part of cluster and then repeated the whole process, this time with a successful graceful shutdown. Then I upgraded CrateDB Debian packages and repeated the same thing for other nodes, finishing the cluster upgrade process in about 20 minutes.

I think the reason this process (graceful shutdown on the first node) was stuck waiting is:

  • After issuing the DECOMMISSION command via crash on the first node, I waited for about 1 minute and then Ctrl-C exited from crash command line utility.
  • And the reason I exited (Ctrl-C) is because I thought I could do it, and also I was expecting the decommission process to finish by that time. Apparently, it takes more than a few minutes!
  • In the successful case, I have seen that it takes about 7-8 minutes between the ALTER CLUSTER DECOMMISSION 'whatever-node-name-... ; and receiving ALTER OK, 1 row affected message.

During that process, min_availability was PRIMARIES (I never changed that):

 select settings['cluster']['graceful_stop']['min_availability'] from sys.cluster limit 100;
 settings['cluster']['graceful_stop']['min_availability']
----------------------------------------------------------
 PRIMARIES
(1 row)

And the largest time series table had '0-1' as the number of replicas:

 number_of_replicas = '0-1',