5.1.5.1. etcd Diagnostic Checklist
Before starting diagnostics, make sure the
ETCDCTL_API=3variable is set and TLS certificates for connecting to the cluster are available.
Preliminary checks
Cluster health
etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Cluster member list
etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w table
Endpoint status
etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w table
Disk usage
etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w json | python3 -m json.tool | grep dbSize
Alarm check
etcdctl alarm list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Common errors
connection refused
| Symptom | dial tcp 127.0.0.1:2379: connect: connection refused |
| Cause | The etcd process is not running or is listening on a different address/port. |
| Solution | Check the etcd pod/service status: crictl ps | grep etcd. Check logs: crictl logs <container-id>. Make sure --listen-client-urls contains the required address. |
certificate expired
| Symptom | certificate has expired or is not yet valid |
| Cause | The etcd TLS certificate has expired. |
| Solution | Check the expiration date: openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -dates. Reissue the certificates and restart etcd. |
quorum lost
| Symptom | etcdserver: request timed out or rafthttp: failed to reach the peer |
| Cause | More than half of the cluster members are unavailable — quorum is lost. |
| Solution | Restore node availability. If a node cannot be recovered — remove it from the cluster (etcdctl member remove) and add it again. In a critical case, restore the cluster from a snapshot: etcdctl snapshot restore. |
database size exceeded
| Symptom | mvcc: database space exceeded |
| Cause | The etcd database size has exceeded the quota (default is 2 GB). |
| Solution | 1. Perform compaction: etcdctl compact $(etcdctl endpoint status -w json | python3 -c "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])"). 2. Run defragmentation: etcdctl defrag. 3. Disarm the alarm: etcdctl alarm disarm. 4. Consider increasing the quota via --quota-backend-bytes. |