5.1.5.1. etcd Diagnostic Checklist
Before starting diagnostics, make sure the
ETCDCTL_API=3variable is set and TLS certificates for connecting to the cluster are available.
Preliminary checks
Cluster health
etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Cluster member list
etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w table
Endpoint status
etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w table
Disk usage
etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w json | python3 -m json.tool | grep dbSize
Alarm check
etcdctl alarm list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Common errors
connection refused
| Symptom | dial tcp 127.0.0.1:2379: connect: connection refused |
| Cause | The etcd process is not running or is listening on a different address/port. |
| Solution | Check the etcd pod/service status: crictl ps | grep etcd. Check logs: crictl logs <container-id>. Make sure --listen-client-urls contains the required address. |
certificate expired
| Symptom | certificate has expired or is not yet valid |
| Cause | The etcd TLS certificate has expired. |
| Solution | Check the expiration date: openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -dates. Reissue the certificates and restart etcd. |
quorum lost
| Symptom | etcdserver: request timed out or rafthttp: failed to reach the peer |
| Cause | More than half of the cluster members are unavailable — quorum is lost. |
| Solution | Restore node availability. If a node cannot be recovered — remove it from the cluster (etcdctl member remove) and add it again. In a critical case, restore the cluster from a snapshot: etcdctl snapshot restore. |
database size exceeded
| Symptom | mvcc: database space exceeded |
| Cause | The etcd database size has exceeded the quota (default is 2 GB). |
| Solution | 1. Perform compaction: etcdctl compact $(etcdctl endpoint status -w json | python3 -c "import sys,json: print(json.load(sys.stdin)[0]['Status']['header']['revision'])"). 2. Run defragmentation: etcdctl defrag. 3. Disarm the alarm: etcdctl alarm disarm. 4. Consider increasing the quota via --quota-backend-bytes. |
too many learner members in cluster
| Symptom | etcdserver: too many learner members in cluster |
| Cause | The etcd cluster already contains one or more nodes in a learner state. When adding a new control-plane node, kubeadm first adds the etcd node as a learner. etcd allows a limited number of learner nodes (usually one), so adding a new node fails with this error. |
| Solution | Check the cluster member list: etcdctl member list. If a stuck learner is found, remove it: etcdctl member remove <member-id>. Then retry the kubeadm join procedure. If the node was supposed to become a voter, try running etcdctl member promote <member-id>. |
rpc not supported for learner
| Symptom | rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner |
| Cause | The etcd node is in a learner state and has not yet synchronized with the cluster. Some operations are unavailable for learner nodes until they are promoted to voter. |
| Solution | Check the node status: etcdctl member list. Ensure the node is synchronized. Once replication is complete, run etcdctl member promote <member-id> or retry the kubeadm join procedure so that kubeadm automatically completes the node promotion process. |