Skip to main content

5.1.5.1. etcd Diagnostic Checklist

Before starting diagnostics, make sure the ETCDCTL_API=3 variable is set and TLS certificates for connecting to the cluster are available.

Preliminary checks

Cluster health

etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Cluster member list

etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w table

Endpoint status

etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w table

Disk usage

etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w json | python3 -m json.tool | grep dbSize

Alarm check

etcdctl alarm list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Common errors

connection refused

Symptomdial tcp 127.0.0.1:2379: connect: connection refused
CauseThe etcd process is not running or is listening on a different address/port.
SolutionCheck the etcd pod/service status: crictl ps | grep etcd. Check logs: crictl logs <container-id>. Make sure --listen-client-urls contains the required address.

certificate expired

Symptomcertificate has expired or is not yet valid
CauseThe etcd TLS certificate has expired.
SolutionCheck the expiration date: openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -dates. Reissue the certificates and restart etcd.

quorum lost

Symptometcdserver: request timed out or rafthttp: failed to reach the peer
CauseMore than half of the cluster members are unavailable — quorum is lost.
SolutionRestore node availability. If a node cannot be recovered — remove it from the cluster (etcdctl member remove) and add it again. In a critical case, restore the cluster from a snapshot: etcdctl snapshot restore.

database size exceeded

Symptommvcc: database space exceeded
CauseThe etcd database size has exceeded the quota (default is 2 GB).
Solution1. Perform compaction: etcdctl compact $(etcdctl endpoint status -w json | python3 -c "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])"). 2. Run defragmentation: etcdctl defrag. 3. Disarm the alarm: etcdctl alarm disarm. 4. Consider increasing the quota via --quota-backend-bytes.