3 Node Cluster Not Working

We used to have a 3 node Crate cluster running Crate v3
However, since upgrading to Crate v4 the cluster simply cannot be formed.

Each node insists on becoming the master of its own cluster and they won’t load the data that we previously had stored in /var/lib/crate, even though the “nodes/” folder is still there.

ach@smartvalve02:/var/log/crate$ du -sh /var/lib/crate/
4.3G	/var/lib/crate/
ach@smartvalve04:/var/log/crate$ du -sh /var/lib/crate/
64G	/var/lib/crate/
ach@smartvalve05:/var/log/crate$ du -sh /var/lib/crate/
90G	/var/lib/crate/

CrateDB was installed via ppa on Ubuntu:

Package: crate
Version: 4.1.1-1~bionic
Priority: extra
Section: net
Maintainer: CRATE Technology GmbH <team@crate.io>
Installed-Size: 65.9 MB
Depends: default-jre-headless (>= 11), adduser
Homepage: https://crate.io/
Download-Size: 53.8 MB
APT-Manual-Installed: yes
APT-Sources: https://cdn.crate.io/downloads/deb/stable bionic/main amd64 Packages
Description: The fast, scalable, easy to use SQL database
Crate.io has built a new breed of database to serve today’s mammoth data needs.
Based on the familiar SQL syntax, CrateDB combines high availability, resiliency,
and scalability in a distributed design that allows you to query mountains of
data in realtime, not batches. We solve your data scaling problems and make
administration a breeze. Easy to scale, simple to use.

The 3 nodes in the cluster have the following IPs:

172.18.252.26 (smartvalve02)
172.18.252.28 (smartvalve04)
172.18.252.29 (smartvalve05)

ach@smartvalve02:/var/log/crate$ ifconfig 
ens160: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
    inet 172.18.252.26  netmask 255.255.255.0  broadcast 172.18.252.255
    inet6 fe80::20c:29ff:fe71:7308  prefixlen 64  scopeid 0x20<link>
    ether 00:0c:29:71:73:08  txqueuelen 1000  (Ethernet)
    RX packets 368624  bytes 40810982 (40.8 MB)
    RX errors 0  dropped 131519  overruns 0  frame 0
    TX packets 93100  bytes 96878675 (96.8 MB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
    inet 127.0.0.1  netmask 255.0.0.0
    inet6 ::1  prefixlen 128  scopeid 0x10<host>
    loop  txqueuelen 1000  (Local Loopback)
    RX packets 2546  bytes 204048 (204.0 KB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 2546  bytes 204048 (204.0 KB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ach@smartvalve04:/var/log/crate$ ifconfig 
ens160: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
    inet 172.18.252.28  netmask 255.255.255.0  broadcast 172.18.252.255
    inet6 fe80::20c:29ff:fea9:ff54  prefixlen 64  scopeid 0x20<link>
    ether 00:0c:29:a9:ff:54  txqueuelen 1000  (Ethernet)
    RX packets 656318  bytes 565698242 (565.6 MB)
    RX errors 0  dropped 131552  overruns 0  frame 0
    TX packets 246863  bytes 36173729 (36.1 MB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
    inet 127.0.0.1  netmask 255.0.0.0
    inet6 ::1  prefixlen 128  scopeid 0x10<host>
    loop  txqueuelen 1000  (Local Loopback)
    RX packets 1213  bytes 97955 (97.9 KB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 1213  bytes 97955 (97.9 KB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ach@smartvalve05:/var/log/crate$ ifconfig 
ens160: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
    inet 172.18.252.29  netmask 255.255.255.0  broadcast 172.18.252.255
    inet6 fe80::20c:29ff:fecf:50f1  prefixlen 64  scopeid 0x20<link>
    ether 00:0c:29:cf:50:f1  txqueuelen 1000  (Ethernet)
    RX packets 467113  bytes 292046714 (292.0 MB)
    RX errors 0  dropped 131546  overruns 0  frame 0
    TX packets 153541  bytes 21338417 (21.3 MB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
    inet 127.0.0.1  netmask 255.0.0.0
    inet6 ::1  prefixlen 128  scopeid 0x10<host>
    loop  txqueuelen 1000  (Local Loopback)
    RX packets 941  bytes 76172 (76.1 KB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 941  bytes 76172 (76.1 KB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

the configuration file crate.yml located in /etc/crate/ is as follows:

network.host: 172.18.252.26
network.publish_host: 172.18.252.26
transport.publish_port: 4300
discovery.seed_hosts:
  - 172.18.252.26:4300
  - 172.18.252.28:4300
  - 172.18.252.29:4300
cluster.initial_master_nodes:
  - 172.18.252.26
  - 172.18.252.28
  - 172.18.252.29
gateway:
  recover_after_nodes: 3
  recover_after_time: 1m
  expected_nodes: 3
auth.host_based.enabled: false
cluster.name: smartvalve
node.name: node1

The entries network.host, network.publish_host, and node.name are of course changed for each node to reflect that node’s IP and particular name. The cluster.name, discovery.seed_hosts, etc. are left the same for all nodes.

After running Crate with this configuration the network ports are connected as follows:

ach@smartvalve02:/var/log/crate$ sudo lsof -i -P -n
COMMAND     PID            USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
systemd-r   876 systemd-resolve   12u  IPv4  17299      0t0  UDP 127.0.0.53:53 
systemd-r   876 systemd-resolve   13u  IPv4  17300      0t0  TCP 127.0.0.53:53 (LISTEN)
sshd       1131            root    3u  IPv4  18107      0t0  TCP *:22 (LISTEN)
sshd       1131            root    4u  IPv6  18109      0t0  TCP *:22 (LISTEN)
sshd       2463            root    3u  IPv4  27181      0t0  TCP 172.18.252.26:22->172.18.252.111:46560 (ESTABLISHED)
sshd       2545             ach    3u  IPv4  27181      0t0  TCP 172.18.252.26:22->172.18.252.111:46560 (ESTABLISHED)
sshd       4837            root    3u  IPv4  47086      0t0  TCP 172.18.252.26:22->172.18.252.111:47430 (ESTABLISHED)
sshd       4952             ach    3u  IPv4  47086      0t0  TCP 172.18.252.26:22->172.18.252.111:47430 (ESTABLISHED)
sshd       5296            root    3u  IPv4  52014      0t0  TCP 172.18.252.26:22->172.18.252.111:47440 (ESTABLISHED)
sshd       5378             ach    3u  IPv4  52014      0t0  TCP 172.18.252.26:22->172.18.252.111:47440 (ESTABLISHED)
master    16084            root   13u  IPv4 307955      0t0  TCP *:25 (LISTEN)
master    16084            root   14u  IPv6 307956      0t0  TCP *:25 (LISTEN)    
java      19497           crate  130u  IPv6 332534      0t0  TCP 172.18.252.26:5432 (LISTEN)
java      19497           crate  159u  IPv6 332552      0t0  TCP 172.18.252.26:4200 (LISTEN)
java      19497           crate  208u  IPv6 332569      0t0  TCP 172.18.252.26:4300 (LISTEN)
java      19497           crate  210u  IPv6 332620      0t0  TCP 172.18.252.26:4300->172.18.252.28:54318 (ESTABLISHED)
java      19497           crate  211u  IPv6 332621      0t0  TCP 172.18.252.26:4300->172.18.252.28:54324 (ESTABLISHED)
java      19497           crate  212u  IPv6 332622      0t0  TCP 172.18.252.26:4300->172.18.252.28:54326 (ESTABLISHED)
java      19497           crate  213u  IPv6 332623      0t0  TCP 172.18.252.26:4300->172.18.252.28:54330 (ESTABLISHED)
java      19497           crate  214u  IPv6 332587      0t0  TCP 172.18.252.26:4300->172.18.252.29:45216 (ESTABLISHED)
java      19497           crate  215u  IPv6 332624      0t0  TCP 172.18.252.26:4300->172.18.252.28:54334 (ESTABLISHED)
java      19497           crate  216u  IPv6 332625      0t0  TCP 172.18.252.26:4300->172.18.252.28:54338 (ESTABLISHED)
java      19497           crate  217u  IPv6 338026      0t0  TCP 172.18.252.26:4300->172.18.252.28:54344 (ESTABLISHED)
java      19497           crate  218u  IPv6 338027      0t0  TCP 172.18.252.26:4300->172.18.252.28:54342 (ESTABLISHED)
java      19497           crate  219u  IPv6 338028      0t0  TCP 172.18.252.26:4300->172.18.252.28:54346 (ESTABLISHED)
java      19497           crate  220u  IPv6 338029      0t0  TCP 172.18.252.26:4300->172.18.252.28:54350 (ESTABLISHED)
java      19497           crate  221u  IPv6 338030      0t0  TCP 172.18.252.26:4300->172.18.252.28:54354 (ESTABLISHED)
java      19497           crate  222u  IPv6 338031      0t0  TCP 172.18.252.26:4300->172.18.252.28:54358 (ESTABLISHED)
java      19497           crate  223u  IPv6 337837      0t0  TCP 172.18.252.26:4200->172.18.252.111:47964 (ESTABLISHED)
java      19497           crate  225u  IPv6 337467      0t0  TCP 172.18.252.26:4300->172.18.252.29:45174 (ESTABLISHED)
java      19497           crate  226u  IPv6 337468      0t0  TCP 172.18.252.26:4300->172.18.252.29:45178 (ESTABLISHED)
java      19497           crate  227u  IPv6 337469      0t0  TCP 172.18.252.26:4300->172.18.252.29:45186 (ESTABLISHED)
java      19497           crate  228u  IPv6 337470      0t0  TCP 172.18.252.26:4300->172.18.252.29:45190 (ESTABLISHED)
java      19497           crate  229u  IPv6 332580      0t0  TCP 172.18.252.26:4300->172.18.252.29:45194 (ESTABLISHED)
java      19497           crate  230u  IPv6 332581      0t0  TCP 172.18.252.26:4300->172.18.252.29:45196 (ESTABLISHED)
java      19497           crate  231u  IPv6 332582      0t0  TCP 172.18.252.26:4300->172.18.252.29:45202 (ESTABLISHED)
java      19497           crate  232u  IPv6 332583      0t0  TCP 172.18.252.26:4300->172.18.252.29:45200 (ESTABLISHED)
java      19497           crate  233u  IPv6 332584      0t0  TCP 172.18.252.26:4300->172.18.252.29:45206 (ESTABLISHED)
java      19497           crate  234u  IPv6 332585      0t0  TCP 172.18.252.26:4300->172.18.252.29:45210 (ESTABLISHED)
java      19497           crate  235u  IPv6 332586      0t0  TCP 172.18.252.26:4300->172.18.252.29:45208 (ESTABLISHED)
java      19497           crate  236u  IPv6 332588      0t0  TCP 172.18.252.26:4300->172.18.252.29:45220 (ESTABLISHED)
java      19497           crate  237u  IPv6 338032      0t0  TCP 172.18.252.26:4300->172.18.252.28:54364 (ESTABLISHED)
java      19497           crate  240u  IPv6 337474      0t0  TCP 172.18.252.26:4200->172.18.252.111:47798 (ESTABLISHED)
java      19497           crate  241u  IPv6 337477      0t0  TCP 172.18.252.26:4200->172.18.252.111:47808 (ESTABLISHED)
ach@smartvalve02:/var/log/crate$ sudo netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      876/systemd-resolve 
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1131/sshd           
tcp        0      0 0.0.0.0:25              0.0.0.0:*               LISTEN      16084/master        
tcp6       0      0 172.18.252.26:4200      :::*                    LISTEN      19497/java          
tcp6       0      0 172.18.252.26:4300      :::*                    LISTEN      19497/java          
tcp6       0      0 :::22                   :::*                    LISTEN      1131/sshd           
tcp6       0      0 172.18.252.26:5432      :::*                    LISTEN      19497/java          
tcp6       0      0 :::25                   :::*                    LISTEN      16084/master        
udp        0      0 127.0.0.53:53           0.0.0.0:*                           876/systemd-resolve 

As you can see, Crate in 172.18.252.26 speaks to the other 2 nodes (172.18.252.28 and 172.18.252.29) over IPv6. However upon loading the Admin UI of each cluster, they all show as belonging to smartvalve cluster, but they are alone in the cluster and they don’t see the data in /var/lib/crate to start loading it.

I though it was because of the IPv6 so I opened the file /etc/default/crate
and wrote:
CRATE_USE_IPV4=true

This correctly forced Crate to use IPv4 instead of IPv6 (as observed in the previous command). However, all nodes are still unable to see the other nodes.

The log file created upon starting the service Crate is as follows:

[2020-02-17T14:17:25,244][INFO ][o.e.e.NodeEnvironment    ] [node1] using [1] data paths, mounts [[/ (/dev/sda2)]], net usable_space [360.6gb], net total_space [393.6gb], types [ext4]
[2020-02-17T14:17:25,253][INFO ][o.e.e.NodeEnvironment    ] [node1] heap size [20gb], compressed ordinary object pointers [true]
[2020-02-17T14:17:25,273][INFO ][o.e.n.Node               ] [node1] node name [node1], node ID [ISvG54peS42ip7QUHiO0Mg]
[2020-02-17T14:17:25,274][INFO ][o.e.n.Node               ] [node1] version[4.1.1], pid[19497], build[95e20da/2020-01-30T16:22:05Z], OS[Linux/4.15.0-76-generic/amd64], JVM[Ubuntu/OpenJDK 64-Bit Server VM/11.0.6/11.0.6+10-post-Ubuntu-1ubuntu118.04.1]
[2020-02-17T14:17:25,467][INFO ][i.c.plugin               ] [node1] plugins loaded: [enterpriseFunctions, lang-js, jmx-monitoring] 
[2020-02-17T14:17:26,236][INFO ][o.e.p.PluginsService     ] [node1] no modules loaded
[2020-02-17T14:17:26,240][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [crate-azure-discovery]
[2020-02-17T14:17:26,241][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [es-repository-hdfs]
[2020-02-17T14:17:26,241][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.BlobPlugin]
[2020-02-17T14:17:26,241][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.CrateCommonPlugin]
[2020-02-17T14:17:26,241][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.HttpTransportPlugin]
[2020-02-17T14:17:26,241][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.PluginLoaderPlugin]
[2020-02-17T14:17:26,242][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.SrvPlugin]
[2020-02-17T14:17:26,242][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.udc.plugin.UDCPlugin]
[2020-02-17T14:17:26,242][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.analysis.common.CommonAnalysisPlugin]
[2020-02-17T14:17:26,242][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]
[2020-02-17T14:17:26,242][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.plugin.analysis.AnalysisPhoneticPlugin]
[2020-02-17T14:17:26,243][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]
[2020-02-17T14:17:26,243][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.repositories.azure.AzureRepositoryPlugin]
[2020-02-17T14:17:26,243][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]
[2020-02-17T14:17:26,243][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
[2020-02-17T14:17:26,846][INFO ][o.e.d.DiscoveryModule    ] [node1] using discovery type [zen] and seed hosts providers [settings]
[2020-02-17T14:17:27,404][INFO ][psql                     ] [node1] PSQL SSL support is disabled.
[2020-02-17T14:17:27,506][INFO ][i.c.p.PipelineRegistry   ] [node1] HTTP SSL support is disabled.
[2020-02-17T14:17:27,548][INFO ][o.e.n.Node               ] [node1] initialized
[2020-02-17T14:17:27,549][INFO ][o.e.n.Node               ] [node1] starting ...
[2020-02-17T14:17:27,690][INFO ][psql                     ] [node1] publish_address {172.18.252.26:5432}, bound_addresses {172.18.252.26:5432}
[2020-02-17T14:17:27,703][INFO ][i.c.p.h.CrateNettyHttpServerTransport] [node1] publish_address {172.18.252.26:4200}, bound_addresses {172.18.252.26:4200}
[2020-02-17T14:17:27,715][INFO ][o.e.t.TransportService   ] [node1] publish_address {172.18.252.26:4300}, bound_addresses {172.18.252.26:4300}
[2020-02-17T14:17:27,719][INFO ][o.e.b.BootstrapChecks    ] [node1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2020-02-17T14:17:27,844][INFO ][o.e.c.s.MasterService    ] [node1] elected-as-master ([1] nodes joined)[{node1}{ISvG54peS42ip7QUHiO0Mg}{s6QTJjnoQQ-LQQtM1XDSrA}{172.18.252.26}{172.18.252.26:4300}{http_address=172.18.252.26:4200} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 31, version: 1038418, reason: master node changed {previous [], current [{node1}{ISvG54peS42ip7QUHiO0Mg}{s6QTJjnoQQ-LQQtM1XDSrA}{172.18.252.26}{172.18.252.26:4300}{http_address=172.18.252.26:4200}]}
[2020-02-17T14:17:28,017][INFO ][o.e.c.s.ClusterApplierService] [node1] master node changed {previous [], current [{node1}{ISvG54peS42ip7QUHiO0Mg}{s6QTJjnoQQ-LQQtM1XDSrA}{172.18.252.26}{172.18.252.26:4300}{http_address=172.18.252.26:4200}]}, term: 31, version: 1038418, reason: Publication{term=31, version=1038418}
[2020-02-17T14:17:28,026][INFO ][o.e.n.Node               ] [node1] started

For node2 the behavior is the same, node2 becomes the master of its own cluster:

[2020-02-17T14:19:19,362][INFO ][o.e.c.s.MasterService    ] [node2] elected-as-master ([1] nodes joined)[{node2}{na73zHayR2K5b8RCIl3_VQ}{ldyMyUovS4qqUmhDl12yeQ}{172.18.252.28}{172.18.252.28:4300}{http_address=172.18.252.28:4200} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 8, version: 27427850, reason: master node changed {previous [], current [{node2}{na73zHayR2K5b8RCIl3_VQ}{ldyMyUovS4qqUmhDl12yeQ}{172.18.252.28}{172.18.252.28:4300}{http_address=172.18.252.28:4200}]}
[2020-02-17T14:19:19,549][INFO ][o.e.c.s.ClusterApplierService] [node2] master node changed {previous [], current [{node2}{na73zHayR2K5b8RCIl3_VQ}{ldyMyUovS4qqUmhDl12yeQ}{172.18.252.28}{172.18.252.28:4300}{http_address=172.18.252.28:4200}]}, term: 8, version: 27427850, reason: Publication{term=8, version=27427850}
[2020-02-17T14:19:19,559][INFO ][o.e.n.Node               ] [node2] started

And the node3 also becomes the master of its own cluster:

[2020-02-17T14:17:35,903][INFO ][o.e.c.s.MasterService    ] [node3] elected-as-master ([1] nodes joined)[{node3}{rv8fGDeCSz6fju3-5wO85A}{fAN0wzbJQSiF4H4koBlfog}{172.18.252.29}{172.18.252.29:4300}{http_address=172.18.252.29:4200} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 5, version: 27249552, reason: master node changed {previous [], current [{node3}{rv8fGDeCSz6fju3-5wO85A}{fAN0wzbJQSiF4H4koBlfog}{172.18.252.29}{172.18.252.29:4300}{http_address=172.18.252.29:4200}]}
[2020-02-17T14:17:36,082][INFO ][o.e.c.s.ClusterApplierService] [node3] master node changed {previous [], current [{node3}{rv8fGDeCSz6fju3-5wO85A}{fAN0wzbJQSiF4H4koBlfog}{172.18.252.29}{172.18.252.29:4300}{http_address=172.18.252.29:4200}]}, term: 5, version: 27249552, reason: Publication{term=5, version=27249552}
[2020-02-17T14:17:36,091][INFO ][o.e.n.Node               ] [node3] started

I tried changing the crate.yml so that all nodes accept only node1 as master by doing:

    cluster.initial_master_nodes:
      - 172.18.252.26

However, this didn’t change anything.

This is one of the references that I followed:
https://crate.io/docs/crate/guide/en/latest/scaling/multi-node-setup.html

Please, let me know if I misconfigured something, but it was working perfectly right before changing from Crate v3 to Crate v4.

Hi, with CrateDB 4.x there was a change in how node discovery works - the article you linked is showing how to configure CrateDB 4.x for that - https://crate.io/docs/crate/guide/en/latest/scaling/multi-node-setup.html#node-discovery.

Have you tried making the appropriate changes to your node configuration files?

Hi,

Yes I know the discovery mode changed in 4.x, that is why in my post I wrote towards the end:

“This is one of the references that I followed:
https://crate.io/docs/crate/guide/en/latest/scaling/multi-node-setup.html

In your reply you sent me the same link…

“Have you tried making the appropriate changes to your node configuration files?”
Yes… the configuration I used is in the post above…