ETCD node failover
Unreachable member
A cluster with etcd containers is created successfully.
Check the cluster status with the following command.
# etcdctl --endpoint cluster-health
If the cluster is running normally, the output looks like:
member xxx is healthy: got healthy result from https://10.23.2.109:3379 member xxx is healthy: got healthy result from https://10.23.2.108:3379 member xxx is healthy: got healthy result from https://10.23.2.110:3379 cluster is healthy
If one member failed, the output may look like:
failed to check the health of member xxx on https://10.23.2.109:3379: Get https://10.23.2.109:3379/health: dial tcp 10.23.2.109:3379: connect: connection refused member xxx is unreachable: [https://10.23.2.109:3379] are all unreachable member xxx is healthy: got healthy result from https://10.23.2.108:3379 member xxx is healthy: got healthy result from https://10.23.2.110:3379 cluster is healthy
The reason may meet one of the following four cases.
Case 1: The whole environment of an etcd container was destroyed.
Solution
- Remove the destroyed member with etcdctl.
# etcdctl member remove xxx
xxx is memberID of the unreachable member.
- Create a new etcd container with adding the following environment variables to env in config file.
"ETCD_INITIAL_CLUSTER_STATE": "existing" "ETCD_INITIAL_CLUSTER": <The cluster peer urls with the new etcd container>
"hostname2=https://10.23.2.108:3380,hostname3=https://10.23.2.110:3380" in ETCD_INITIAL_CLUSTER are the peer urls of the cluster after removing the destroyed member.
- Add the new container to the existing cluster.
# etcdctl --endpoint member add <name> <peerURL>
<name> is hostname in its config file.
<peerURL> is one of ETCD_INITIAL_ADVERTISE_PEER_URLS in its config file.
Case 2: The etcd container doesn't exist.
Solution
- Add "ETCD_INITIAL_CLUSTER_STATE": "existing" to the container creation config file.
- Create the container with the new config file, but keep the other configurations as same as before.
Case 3: The etcd container was stopped.
Solution
Start the container.
# docker start <container>
Case 4: The etcd service was stopped in its container.
Solution
Restart the stopped etcd container.
# docker restart <container>
Unhealthy member
If a member is unhealthy, we can refer to above case 2 to remove its container with metadata, then create a new one to fix it.
相关推荐
###host字段指定授权使用该证书的etcd节点IP或子网列表,需要将etcd集群的3个节点都添加其中。cp etcd-v3.3.13-linux-amd64/etcd* /opt/k8s/bin/