Cluster with Docker and two Docker-Hosts

Hi,

I have installed a “cluster” with docker-compose on one node – this works well.

But then I wanted to connect one more node (on another physical machine!) to this cluster – but with that I failed.
completed handshake with [{node02}{5-0wKhgIR-ebx2t2I5oOQA}{kBJLysioRqS3IDkso3tdrQ}{172.30.0.3}{172.30.0.3:4300}{http_address=172.30.0.3:4200}] but followup connection failed

It seems that the first nodes are the sending back the IP addresses of the docker internal networks – and those are not accessible from the second docker-host.

I know this is not a “real-world” scenario, but for testing purposes it would be great if this works somehow.

Below are my docker-compose.yml files …

Maybe someone can give me a hint how I can get this up and running.

My first hosts IP is 192.168.1.3.

First docker host:

version: '3.8'
services:
  node01:
    image: crate:5.1.1
    ports:
      - "4201:4200"
      - "4301:4300"
      - "5434:5432"
    volumes:
      - /u04/data/crate/01:/data
    command: ["crate",
              "-Ccluster.name=crate-docker-cluster",
              "-Cdiscovery.seed_hosts=node02",
              "-Cnode.name=node01",
              "-Cnode.data=true",
              "-Cnetwork.host=_site_",
              "-Cgateway.expected_data_nodes=2",
              "-Cgateway.recover_after_data_nodes=2",
              "-Ccluster.initial_master_nodes=node01,node02"]
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    environment:
      - CRATE_HEAP_SIZE=4g
 
  node02:
   image: crate:5.1.1
    ports:
      - "4202:4200"
      - "4302:4300"
    volumes:
      - /u04/data/crate/02:/data
    command: ["crate",
              "-Ccluster.name=crate-docker-cluster",
              "-Cdiscovery.seed_hosts=node01",
              "-Cnode.name=node02",
              "-Cnode.data=true",
              "-Cnetwork.host=_site_",
              "-Cgateway.expected_data_nodes=2",
              "-Cgateway.recover_after_data_nodes=2",
              "-Ccluster.initial_master_nodes=node01,node02"]
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    environment:
      - CRATE_HEAP_SIZE=4g

Second docker host

version: '3.3'
services:
 node03:
   image: crate:5.1.1
   ports:
     - "4201:4200"
     - "4301:4300"
     - "5434:5432"
   volumes:
     - /u04/data/crate/01:/data
   command: ["crate",
             "-Ccluster.name=crate-docker-cluster",
             "-Cdiscovery.seed_hosts=192.168.1.3:4301,192.168.1.3:4302",
             "-Cnode.name=node03",
             "-Cnode.data=true",
             "-Cnetwork.host=_site_",
             "-Cgateway.expected_data_nodes=2",
             "-Cgateway.recover_after_data_nodes=2",
             "-Ccluster.initial_master_nodes=node01,node02"]
   environment:
     - CRATE_HEAP_SIZE=4g

Errors when starting the node on the second host

node03_1  | [2022-11-22T12:20:36,134][INFO ][o.e.n.Node               ] [node03] initialized
node03_1  | [2022-11-22T12:20:36,134][INFO ][o.e.n.Node               ] [node03] starting ...
node03_1  | [2022-11-22T12:20:36,181][INFO ][psql                     ] [node03] publish_address {172.18.0.2:5432}, bound_addresses {172.18.0.2:5432}
node03_1  | [2022-11-22T12:20:36,187][INFO ][o.e.h.n.Netty4HttpServerTransport] [node03] publish_address {172.18.0.2:4200}, bound_addresses {172.18.0.2:4200}
node03_1  | [2022-11-22T12:20:36,197][INFO ][o.e.t.TransportService   ] [node03] publish_address {172.18.0.2:4300}, bound_addresses {172.18.0.2:4300}
node03_1  | [2022-11-22T12:20:36,359][INFO ][o.e.b.BootstrapChecks    ] [node03] bound or publishing to a non-loopback address, enforcing bootstrap checks
node03_1  | [2022-11-22T12:20:36,369][INFO ][o.e.c.c.ClusterBootstrapService] [node03] skipping cluster bootstrapping as local node does not match bootstrap requirements: [node01, node02]
node03_1  | [2022-11-22T12:20:46,371][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node03] master not discovered yet, this node has not previously joined a bootstrapped (v5+) cluster, and this node must discover master-eligible nodes [node01, node02] to bootstrap a cluster: have discovered [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}]; discovery will continue using [192.168.1.3:4301, 192.168.1.3:4302] from hosts providers and [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
node03_1  | [2022-11-22T12:20:56,372][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node03] master not discovered yet, this node has not previously joined a bootstrapped (v5+) cluster, and this node must discover master-eligible nodes [node01, node02] to bootstrap a cluster: have discovered [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}]; discovery will continue using [192.168.1.3:4301, 192.168.1.3:4302] from hosts providers and [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
node03_1  | [2022-11-22T12:21:06,374][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node03] master not discovered yet, this node has not previously joined a bootstrapped (v5+) cluster, and this node must discover master-eligible nodes [node01, node02] to bootstrap a cluster: have discovered [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}]; discovery will continue using [192.168.1.3:4301, 192.168.1.3:4302] from hosts providers and [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
node03_1  | [2022-11-22T12:21:06,375][WARN ][o.e.n.Node               ] [node03] timed out while waiting for initial discovery state - timeout: 30s
node03_1  | [2022-11-22T12:21:06,376][INFO ][o.e.n.Node               ] [node03] started
node03_1  | [2022-11-22T12:21:06,458][WARN ][o.e.d.HandshakingTransportAddressConnector] [node03] [connectToRemoteMasterNode[192.168.1.3:4302]] completed handshake with [{node02}{5-0wKhgIR-ebx2t2I5oOQA}{kBJLysioRqS3IDkso3tdrQ}{172.30.0.3}{172.30.0.3:4300}{http_address=172.30.0.3:4200}] but followup connection failed
node03_1  | org.elasticsearch.transport.ConnectTransportException: [node02][172.30.0.3:4300] connect_timeout[30s]
node03_1  |     at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:967) ~[crate-server.jar:?]
node03_1  |     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
node03_1  |     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
node03_1  |     at java.lang.Thread.run(Thread.java:833) ~[?:?]
node03_1  | [2022-11-22T12:21:06,459][WARN ][o.e.d.HandshakingTransportAddressConnector] [node03] [connectToRemoteMasterNode[192.168.1.3:4301]] completed handshake with [{node01}{Xu-cZFv5SAyzyCy-jK0I6A}{rAwDTwC8TQmlID7fr0qt6Q}{172.30.0.2}{172.30.0.2:4300}{http_address=172.30.0.2:4200}] but followup connection failed
node03_1  | org.elasticsearch.transport.ConnectTransportException: [node01][172.30.0.2:4300] connect_timeout[30s]
node03_1  |     at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:967) ~[crate-server.jar:?]
node03_1  |     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
node03_1  |     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
node03_1  |     at java.lang.Thread.run(Thread.java:833) ~[?:?]
node03_1  | [2022-11-22T12:21:16,375][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node03] master not discovered yet, this node has not previously joined a bootstrapped (v5+) cluster, and this node must discover master-eligible nodes [node01, node02] to bootstrap a cluster: have discovered [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}]; discovery will continue using [192.168.1.3:4301, 192.168.1.3:4302] from hosts providers and [{node03}{qHBmOTkZQMuYNSP1Fn1dIw}{yrwcZWrLTu2TjGUxp5f2lg}{172.18.0.2}{172.18.0.2:4300}{http_address=172.18.0.2:4200}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

I haven’t tested it yet, but I think you either need to change the network settings in docker using host networking (Use host networking | Docker Documentation) or use Docker Swarm with an overlay network.

While network.host=_site_ binds the container to the site-local address, as you mentioned and as can be seen in the logs _site_ equals 172.18.0.2

Thank you! will check if “host networking” can help. For that I need to adjust the ports for Crate somehow… because I have two Crate-Nodes running on one Host.