Docker Swarm cluster

Hello,

I’m trying to setup a Crate cluster on Docker Swarm with replication.
I followed the Docker Compose template provided in the docs and everything works however all the nodes are “statically” declared. What I would like to do is use the replication feature. My idea was that I could list all the future nodes in the seed_hosts and initial_master_nodes config then scale the service up/down as needed.

Here is the template I am currently using:

version: '3.5'
services:
  node:
    image: crate:latest
    ports:
      - "4200:4200"
    environment:
      SLOT: "{{.Task.Slot}}"
    volumes:
      - crate-data:/data
    hostname: "node{{.Task.Slot}}"
    command: >
      crate
      -Ccluster.name=swarm
      -Cnode.name=node$${SLOT}
      -Cnetwork.host=_site_
      -Cdiscovery.seed_hosts=node1,node2
      -Ccluster.initial_master_nodes=node1,node2
      -Cgateway.expected_nodes=2
volumes:
  crate-data:
    name: 'crate-data-{{.Task.Slot}}'

The nodes start but are not forming a cluster even though they are reachable from other containers (tried pinging other nodes).

Any ideas why this doesn’t work ? What am I missing ? Is there a better way of doing this ?

The config does look ok IMHO. Could you make sure that the “disks” are empty and do not hold a “single-node” cluster state? My guess (also just a guess :-)) that it is something similar to Crate DB Clustering on EC2 where the nodes where started as single nodes first.

I deleted everything and started from scratch… still doesn’t work even though the addresses seem to be bound correctly and are accessible.

NODE 1 LOGS

[2021-07-23T07:11:13,667][INFO ][o.e.e.NodeEnvironment    ] [node1] using [1] data paths, mounts [[/data (/dev/md2)]], net usable_space [418.9gb], net total_space [452gb], types [ext4]


[2021-07-23T07:11:13,669][INFO ][o.e.e.NodeEnvironment    ] [node1] heap size [512mb], compressed ordinary object pointers [true]


[2021-07-23T07:11:13,671][INFO ][o.e.n.Node               ] [node1] node name [node1], node ID [MJFu-sWjTRWf2fmcRBzErQ], cluster name [swarm]


[2021-07-23T07:11:13,685][INFO ][o.e.n.Node               ] [node1] version[4.5.4], pid[1], build[bb0550e/2021-07-13T09:44:57Z], OS[Linux/5.4.0-77-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/16.0.1/16.0.1+9]


[2021-07-23T07:11:13,897][INFO ][i.c.plugin               ] [node1] plugins loaded: [jmx-monitoring, lang-js, enterpriseFunctions] 


SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".


SLF4J: Defaulting to no-operation (NOP) logger implementation


SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


[2021-07-23T07:11:14,563][INFO ][o.e.p.PluginsService     ] [node1] no modules loaded


[2021-07-23T07:11:14,564][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [crate-azure-discovery]


[2021-07-23T07:11:14,564][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [es-repository-hdfs]


[2021-07-23T07:11:14,564][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.BlobPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.PluginLoaderPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.plugin.SrvPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [io.crate.udc.plugin.UDCPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.analysis.common.CommonAnalysisPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.plugin.analysis.AnalysisPhoneticPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.repositories.azure.AzureRepositoryPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]


[2021-07-23T07:11:14,565][INFO ][o.e.p.PluginsService     ] [node1] loaded plugin [org.elasticsearch.transport.Netty4Plugin]


[2021-07-23T07:11:14,970][INFO ][o.e.d.DiscoveryModule    ] [node1] using discovery type [zen] and seed hosts providers [settings]


[2021-07-23T07:11:15,401][INFO ][psql                     ] [node1] PSQL SSL support is disabled.


[2021-07-23T07:11:15,586][INFO ][i.c.p.PipelineRegistry   ] [node1] HTTP SSL support is disabled.


[2021-07-23T07:11:15,619][INFO ][o.e.n.Node               ] [node1] initialized


[2021-07-23T07:11:15,619][INFO ][o.e.n.Node               ] [node1] starting ...


[2021-07-23T07:11:15,702][INFO ][psql                     ] [node1] publish_address {10.0.0.120:5432}, bound_addresses {172.18.0.3:5432}, {10.0.19.3:5432}, {10.0.0.120:5432}


[2021-07-23T07:11:15,710][INFO ][o.e.h.n.Netty4HttpServerTransport] [node1] publish_address {10.0.0.120:4200}, bound_addresses {172.18.0.3:4200}, {10.0.19.3:4200}, {10.0.0.120:4200}


[2021-07-23T07:11:15,721][INFO ][o.e.t.TransportService   ] [node1] publish_address {10.0.0.120:4300}, bound_addresses {172.18.0.3:4300}, {10.0.19.3:4300}, {10.0.0.120:4300}


[2021-07-23T07:11:15,771][INFO ][o.e.b.BootstrapChecks    ] [node1] bound or publishing to a non-loopback address, enforcing bootstrap checks


[2021-07-23T07:11:25,782][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node1] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}]; discovery will continue using [10.0.19.4:4300] from hosts providers and [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0


[2021-07-23T07:11:35,785][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node1] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}]; discovery will continue using [10.0.19.4:4300] from hosts providers and [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0


[2021-07-23T07:11:45,782][WARN ][o.e.n.Node               ] [node1] timed out while waiting for initial discovery state - timeout: 30s


[2021-07-23T07:11:45,783][INFO ][o.e.n.Node               ] [node1] started


[2021-07-23T07:11:45,789][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node1] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}]; discovery will continue using [10.0.19.4:4300] from hosts providers and [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0


[2021-07-23T07:11:55,794][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node1] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}]; discovery will continue using [10.0.19.4:4300] from hosts providers and [{node1}{MJFu-sWjTRWf2fmcRBzErQ}{5pgyq3X0RHS0Mod1Z30uWg}{10.0.0.120}{10.0.0.120:4300}{http_address=10.0.0.120:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

NODE 2 LOGS

[2021-07-23T07:11:13,398][INFO ][o.e.e.NodeEnvironment    ] [node2] using [1] data paths, mounts [[/data (/dev/md2)]], net usable_space [420.3gb], net total_space [452gb], types [ext4]


[2021-07-23T07:11:13,400][INFO ][o.e.e.NodeEnvironment    ] [node2] heap size [512mb], compressed ordinary object pointers [true]


[2021-07-23T07:11:13,401][INFO ][o.e.n.Node               ] [node2] node name [node2], node ID [ExG5qcCfRhqLJNYvxeslPg], cluster name [swarm]


[2021-07-23T07:11:13,412][INFO ][o.e.n.Node               ] [node2] version[4.5.4], pid[1], build[bb0550e/2021-07-13T09:44:57Z], OS[Linux/5.4.0-73-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/16.0.1/16.0.1+9]


[2021-07-23T07:11:13,606][INFO ][i.c.plugin               ] [node2] plugins loaded: [lang-js, enterpriseFunctions, jmx-monitoring] 


SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".


SLF4J: Defaulting to no-operation (NOP) logger implementation


SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


[2021-07-23T07:11:14,260][INFO ][o.e.p.PluginsService     ] [node2] no modules loaded


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [crate-azure-discovery]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [es-repository-hdfs]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [io.crate.plugin.BlobPlugin]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [io.crate.plugin.PluginLoaderPlugin]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [io.crate.plugin.SrvPlugin]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [io.crate.udc.plugin.UDCPlugin]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [org.elasticsearch.analysis.common.CommonAnalysisPlugin]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [org.elasticsearch.plugin.analysis.AnalysisPhoneticPlugin]


[2021-07-23T07:11:14,261][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]


[2021-07-23T07:11:14,262][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [org.elasticsearch.repositories.azure.AzureRepositoryPlugin]


[2021-07-23T07:11:14,262][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]


[2021-07-23T07:11:14,262][INFO ][o.e.p.PluginsService     ] [node2] loaded plugin [org.elasticsearch.transport.Netty4Plugin]


[2021-07-23T07:11:14,666][INFO ][o.e.d.DiscoveryModule    ] [node2] using discovery type [zen] and seed hosts providers [settings]


[2021-07-23T07:11:15,195][INFO ][psql                     ] [node2] PSQL SSL support is disabled.


[2021-07-23T07:11:15,296][INFO ][i.c.p.PipelineRegistry   ] [node2] HTTP SSL support is disabled.


[2021-07-23T07:11:15,325][INFO ][o.e.n.Node               ] [node2] initialized


[2021-07-23T07:11:15,325][INFO ][o.e.n.Node               ] [node2] starting ...


[2021-07-23T07:11:15,396][INFO ][psql                     ] [node2] publish_address {10.0.0.121:5432}, bound_addresses {10.0.19.4:5432}, {10.0.0.121:5432}, {172.18.0.12:5432}


[2021-07-23T07:11:15,403][INFO ][o.e.h.n.Netty4HttpServerTransport] [node2] publish_address {10.0.0.121:4200}, bound_addresses {10.0.19.4:4200}, {10.0.0.121:4200}, {172.18.0.12:4200}


[2021-07-23T07:11:15,414][INFO ][o.e.t.TransportService   ] [node2] publish_address {10.0.0.121:4300}, bound_addresses {10.0.19.4:4300}, {10.0.0.121:4300}, {172.18.0.12:4300}


[2021-07-23T07:11:15,466][INFO ][o.e.b.BootstrapChecks    ] [node2] bound or publishing to a non-loopback address, enforcing bootstrap checks


[2021-07-23T07:11:25,475][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node2] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}]; discovery will continue using [10.0.19.3:4300] from hosts providers and [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0


[2021-07-23T07:11:35,477][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node2] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}]; discovery will continue using [10.0.19.3:4300] from hosts providers and [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0


[2021-07-23T07:11:45,477][WARN ][o.e.n.Node               ] [node2] timed out while waiting for initial discovery state - timeout: 30s


[2021-07-23T07:11:45,478][INFO ][o.e.n.Node               ] [node2] started


[2021-07-23T07:11:45,481][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node2] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}]; discovery will continue using [10.0.19.3:4300] from hosts providers and [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0


[2021-07-23T07:11:55,484][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node2] master not discovered yet, this node has not previously joined a bootstrapped (v4+) cluster, and this node must discover master-eligible nodes [node1, node2] to bootstrap a cluster: have discovered [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}]; discovery will continue using [10.0.19.3:4300] from hosts providers and [{node2}{ExG5qcCfRhqLJNYvxeslPg}{7D1VLR0CRXiJSvfwn9q1rQ}{10.0.0.121}{10.0.0.121:4300}{http_address=10.0.0.121:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

Can you check again the DNS resolution? The discovery is on the 10.x.x.x. networks. Can you ping node2 or do a curl node2:4300? In anycase you could also use the IP addresses for discovery.

Here is the output:

[root@node1 data]# ping node2 
PING node2 (10.0.19.4) 56(84) bytes of data.
64 bytes from crate_node.2.kzuiblv8dkrtprzgkv4l7tdvf.crate_default (10.0.19.4): icmp_seq=1 ttl=64 time=0.612 ms
64 bytes from crate_node.2.kzuiblv8dkrtprzgkv4l7tdvf.crate_default (10.0.19.4): icmp_seq=2 ttl=64 time=0.577 ms
^C
--- node2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.577/0.594/0.612/0.030 ms
[root@node1 data]# curl node2:4300
This is not a HTTP port
[root@node1 data]# 

Could you use the IP addresses instead of the node names on the cluster.initial_master_nodes. Cluster-wide settings — CrateDB: Reference

or something like this - cluster.initial_master_nodes=node.1,node.2 as discussed here 7.2.0 fails to start on docker-swarm · Issue #410 · deviantony/docker-elk · GitHub

Finally fixed it !

image

It was a networking issue, as pointed out in this issue.

The working stack:

version: '3.5'
services:
  crate:
    image: crate:latest
    ports:
      - "4200:4200"
    environment:
      SLOT: "{{.Task.Slot}}"
    volumes:
      - crate-data:/data
    hostname: "crate-{{.Task.Slot}}"
    command: >
      crate
      -Ccluster.name=swarm
      -Cnode.name=crate-$${SLOT}
      -Cnetwork.publish_host=_eth0_
      -Cdiscovery.type='zen'
      -Cdiscovery.seed_hosts=tasks.crate
      -Ccluster.initial_master_nodes=crate-1
    deploy:
      replicas: 3
volumes:
  crate-data:
    name: 'crate-data-{{.Task.Slot}}'
3 Likes

Thank you for reporting back and great that it works!

I am not entirely sure what tasks.crate in discovery.seed_hosts=tasks.crate really does :thinking:
But the hostnames crate_node.2.kzuiblv8dkrtprzgkv4l7tdvf.crate_default looked a bit suspicious.

I wasn’t sure either, just grabbed the Swarm config from the wiki

# docker-stack.yml

  elasticsearch:
    environment:
      # Force publishing on the 'elk' overlay.
      network.publish_host: _eth0_
      # Set a predictable node name.
      node.name: elk_elasticsearch.{{.Task.Slot}}
      # Disable single-node discovery.
      discovery.type: ''
      # Use internal Docker round-robin DNS for unicast discovery.
      discovery.seed_hosts: tasks.elasticsearch
      # Define initial masters, assuming a cluster size of at least 3.
      cluster.initial_master_nodes: elk_elasticsearch.1,elk_elasticsearch.2,elk_elasticsearch.3