ETCD node failover

Unreachable member

A cluster with etcd containers is created successfully.
Check the cluster status with the following command.

# etcdctl --endpoint cluster-health

If the cluster is running normally, the output looks like:

member xxx is healthy: got healthy result from https://10.23.2.109:3379
member xxx is healthy: got healthy result from https://10.23.2.108:3379
member xxx is healthy: got healthy result from https://10.23.2.110:3379
cluster is healthy

If one member failed, the output may look like:

failed to check the health of member xxx on https://10.23.2.109:3379: Get https://10.23.2.109:3379/health: dial tcp 10.23.2.109:3379: connect: connection refused
member xxx is unreachable: [https://10.23.2.109:3379] are all unreachable
member xxx is healthy: got healthy result from https://10.23.2.108:3379
member xxx is healthy: got healthy result from https://10.23.2.110:3379
cluster is healthy

The reason may meet one of the following four cases.

Case 1: The whole environment of an etcd container was destroyed.

Solution

  • Remove the destroyed member with etcdctl.
# etcdctl member remove xxx
xxx is memberID of the unreachable member.
  • Create a new etcd container with adding the following environment variables to env in config file.
"ETCD_INITIAL_CLUSTER_STATE": "existing"
"ETCD_INITIAL_CLUSTER": <The cluster peer urls with the new etcd container>

"hostname2=https://10.23.2.108:3380,hostname3=https://10.23.2.110:3380" in ETCD_INITIAL_CLUSTER are the peer urls of the cluster after removing the destroyed member.

  • Add the new container to the existing cluster.
# etcdctl --endpoint member add <name> <peerURL>

<name> is hostname in its config file.

<peerURL> is one of ETCD_INITIAL_ADVERTISE_PEER_URLS in its config file.

Case 2: The etcd container doesn't exist.

Solution

  • Add "ETCD_INITIAL_CLUSTER_STATE": "existing" to the container creation config file.
  • Create the container with the new config file, but keep the other configurations as same as before.

Case 3: The etcd container was stopped.

Solution

Start the container.

# docker start <container>

Case 4: The etcd service was stopped in its container.

Solution

Restart the stopped etcd container.

# docker restart <container>

Unhealthy member

If a member is unhealthy, we can refer to above case 2 to remove its container with metadata, then create a new one to fix it.

相关推荐