Is it possible to have zero-downtime major version upgrades?


In your documentation at Full restart upgrade — CrateDB: How-Tos I see that upgrading to a new major version (feature version) requires stopping all the CrateDB nodes, upgrading each one and restarting them.

I wonder if it’s possible to do a zero-downtime upgrade, e.g. by first having another cluster as a replica of the current cluster, redirecting traffic the via a load-balancer, and then redirecting the traffic again back to the updated cluster and make it catch up with the other cluster (in case new data arrived while the ‘main’ cluster was down).

Do you have any plans to support something like this out-of-the box?

Complete cluster restarts are typically only needed with changes in the underlying storage engine. This is typically done together with a “Major” version (i.e. the next major would be CrateDB 5.0.0). The last major version upgrade 4.0.0 happened in June 2019. The next major version is planned for 2022.

“Minor” version (e.g. 4.1, 4.2, …) that include new features can be done with rolling updates.

Logical Replication (similiar to Postgres) will be part of CrateDB 4.7. However I am not quite sure, how this would help with Upgrades.

What we sometimes see (especially with Cloud infrastructure), is the customers start a second cluster, and restore a recent snapshot there. With a message queue like Kafka/Eventhub/Kinesis, one would create a second consumer group, that feeds data into the 2nd cluster, and with the switch in the load balancer, allows basically a zero downtime major upgrade.

Do you have any plans to support something like this out-of-the box?

We don’t really see a short-downtime every 2-3 years as a big issue right now, and as mentioned above, for use cases that really needed, it already can be achieved with a second cluster for many workloads.