Kubernetes incidents rarely fail for mysterious reasons; most of the time, the same small set of failure modes appears again and again under different names. This checklist is designed as a reusable field guide for kubernetes troubleshooting: start with a short triage sequence, identify the failure pattern, run a few high-signal commands, and choose the least risky fix path before making changes. Keep it nearby for day-to-day debugging, on-call response, and post-change validation.
Overview
A useful Kubernetes troubleshooting checklist does two things well: it narrows the search space quickly, and it helps teams avoid making a degraded cluster worse. When an application is down, pods are stuck, or traffic is failing, the pressure to restart everything is strong. In practice, the better path is usually structured triage.
Use this order of operations before diving into any one component:
- Define the symptom precisely. Is the problem a failed deployment, traffic outage, DNS issue, scheduling failure, image pull error, crash loop, or storage problem?
- Confirm the blast radius. Is one pod affected, one namespace, one node pool, or the whole cluster?
- Check recent change events. New image, config update, secret rotation, network policy change, node upgrade, autoscaler activity, or certificate renewal often explains the timing.
- Read the object status before reading assumptions. Kubernetes already records a lot in events, conditions, and pod state.
- Prefer inspection over intervention. Gather evidence first, then restart or roll back only if you know why.
Start with a compact baseline command set:
kubectl get pods -A
kubectl get events -A --sort-by=.lastTimestamp
kubectl get nodes
kubectl top nodes
kubectl top pods -A
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl get deploy,rs,sts,ds -A
kubectl get svc,ep,endpointslices -A
kubectl get ingress -AIf your team uses metrics, logs, and traces consistently, cross-check Kubernetes status with application telemetry rather than treating them as separate systems. This is especially important when symptoms look like cluster failure but the root cause is actually a slow dependency, external API issue, or bad rollout. For broader reliability practices, teams often pair this checklist with observability and pipeline discipline similar to the patterns discussed in CI/CD for Maps: Versioning, Tests and Deployments for Spatial Analytics.
Checklist by scenario
This section maps common Kubernetes common errors to fast diagnostics and practical fix paths. Work from the symptom that best matches what you see.
1. Pod is Pending
What it usually means: the scheduler cannot place the pod on a node.
Check:
kubectl describe pod <pod> -n <namespace>kubectl get nodeskubectl describe nodes
Look for: insufficient CPU or memory, taints without tolerations, node selectors that match nothing, affinity rules that are too strict, unbound persistent volume claims, or quota limits.
Fix path:
- Read scheduler events in the pod description.
- Confirm requests and limits are realistic for the cluster size.
- Check whether a recent policy or affinity change reduced eligible nodes.
- If storage is involved, inspect PVC state before touching compute settings.
- Scale node capacity or relax placement constraints only after confirming the intended scheduling policy.
2. Pod is in CrashLoopBackOff
What it usually means: the container starts, exits, and keeps restarting.
Check:
kubectl logs <pod> -n <namespace>kubectl logs <pod> -n <namespace> --previouskubectl describe pod <pod> -n <namespace>
Look for: application boot failures, missing environment variables, bad config mounts, secret issues, port conflicts, dependency timeouts, or liveness probes killing a slow-starting process.
Fix path:
- Read previous container logs first; current logs may miss the actual crash.
- Inspect exit codes and termination reasons.
- Compare the running deployment to the last known good revision.
- Validate configmaps, secrets, and command/args wiring.
- If probes are failing, check startup timing before disabling probes entirely.
A frequent mistake is blaming Kubernetes for an application error. If the process exits because it cannot parse config or connect to a database, the scheduler is doing its job. The fix belongs in the app, secret, or deployment spec.
3. ImagePullBackOff or ErrImagePull
What it usually means: the node cannot pull the container image.
Check:
kubectl describe pod <pod> -n <namespace>kubectl get serviceaccount -n <namespace> -o yamlkubectl get secret -n <namespace>
Look for: wrong image tag, missing registry credentials, expired pull secret, private registry access issues, or policy blocking the image.
Fix path:
- Verify the image reference exactly as deployed.
- Confirm the registry secret exists in the correct namespace.
- Check the service account used by the pod.
- Re-test with a known good image if you need to isolate registry access from application packaging.
Identity and secret handling are common root causes here. If your cluster relies on workload identities and external secret systems, keep access design documented and review it alongside pieces like Workload Identity vs Human Identity: A Zero-Trust Blueprint for Mixed SaaS Ecosystems and Distinguishing Nonhuman from Human Identities in SaaS: Practical Detection and Governance.
4. Service exists, but traffic fails
What it usually means: the service, endpoints, or application ports do not line up.
Check:
kubectl get svc -n <namespace>kubectl get ep,endpointslices -n <namespace>kubectl describe svc <service> -n <namespace>kubectl get pods -l app=<label> -n <namespace>
Look for: selector mismatch, targetPort mismatch, pods not Ready, readiness probes failing, or application listening on the wrong port.
Fix path:
- Verify the service selector matches the intended pods.
- Confirm endpoints exist; no endpoints means no healthy backing pods.
- Match
port,targetPort, and container port configuration. - Check readiness probes before changing the service itself.
5. Ingress or external access is broken
What it usually means: the problem sits in ingress rules, class selection, TLS, DNS, or the backing service.
Check:
kubectl get ingress -Akubectl describe ingress <name> -n <namespace>kubectl get svc -n <namespace>kubectl get secret -n <namespace>
Look for: wrong ingress class, bad host/path rules, missing TLS secret, backend service mismatch, or DNS still pointing at an old endpoint.
Fix path:
- Validate the request path from DNS to ingress to service to pod.
- Confirm the ingress controller is actually watching that ingress class.
- Check certificate secret names and namespace placement.
- If only some routes fail, compare path matching and rewrite behavior.
6. DNS resolution inside the cluster fails
What it usually means: CoreDNS, network policy, or service naming assumptions are broken.
Check:
kubectl get pods -n kube-systemkubectl logs -n kube-system -l k8s-app=kube-dnskubectl exec -it <pod> -n <namespace> -- nslookup <service>
Look for: CoreDNS restarts, upstream resolver issues, blocked egress, or incorrect fully qualified service names.
Fix path:
- Test resolution from a pod in the same namespace first.
- Use the full service DNS name if there is any doubt.
- Confirm network policies allow DNS traffic.
- Inspect CoreDNS config only after ruling out local application assumptions.
7. Node NotReady or workloads unstable on one node
What it usually means: the node is under pressure, disconnected, unhealthy, or has runtime problems.
Check:
kubectl get nodeskubectl describe node <node>kubectl top node <node>kubectl get pods -A -o wide | grep <node>
Look for: memory pressure, disk pressure, network unavailable, kubelet issues, container runtime errors, or one noisy workload exhausting resources.
Fix path:
- Cordon the node if workloads are churning.
- Drain carefully if the issue is isolated and disruption budgets allow it.
- Inspect node-level logs outside Kubernetes if you manage the nodes directly.
- Review requests, limits, and eviction behavior to prevent recurrence.
8. PVC is Pending or storage mounts fail
What it usually means: storage class, binding, capacity, access mode, or CSI health is wrong.
Check:
kubectl get pvc,pv -Akubectl describe pvc <claim> -n <namespace>kubectl get storageclass
Look for: no matching storage class, unavailable provisioner, unsupported access mode, zone mismatch, or volume attachment errors.
Fix path:
- Confirm the storage class name and default behavior.
- Match claim requirements to what the provisioner supports.
- Check whether topology rules restrict where the pod can run.
- Avoid deleting claims casually if persistent data matters.
9. Deployment rollout is stuck
What it usually means: new replicas cannot become Ready, progress deadline is exceeded, or old replicas cannot terminate safely.
Check:
kubectl rollout status deploy/<name> -n <namespace>kubectl describe deploy <name> -n <namespace>kubectl get rs -n <namespace>
Look for: failing probes, quota exhaustion, image pull errors, immutable field assumptions, or PodDisruptionBudget interactions.
Fix path:
- Inspect the new ReplicaSet rather than just the Deployment.
- Check the exact reason the new pods are not becoming Ready.
- If production is impacted, consider rollback sooner rather than repeatedly patching a bad release in place.
- Review release process hygiene so the same issue does not return in the next deploy.
10. NetworkPolicy caused a silent outage
What it usually means: traffic is now denied by default or allowed only for paths you did not account for.
Check:
kubectl get networkpolicy -Akubectl describe networkpolicy <name> -n <namespace>- Connectivity tests from source pod to destination service
Look for: namespace selector mistakes, missing egress rules, DNS traffic not allowed, or policies applied to labels broader than intended.
Fix path:
- Confirm whether the namespace moved from allow-all to default-deny behavior.
- Test both ingress and egress assumptions.
- Version policy changes carefully and keep known-good examples nearby.
For teams running regulated or high-change environments, it helps to treat network and identity rules as auditable infrastructure artifacts, similar to the governance mindset in Design Patterns for Auditable AI Flows: Data Lineage, Reproducibility and Access Controls.
What to double-check
Once you think you have found the issue, pause for a short verification pass. Many Kubernetes incidents turn into longer outages because the first plausible explanation was accepted too quickly.
- Namespace: Are you looking in the correct namespace? A surprising amount of confusion starts here.
- Labels and selectors: Service selectors, deployment selectors, and policy selectors must align exactly.
- Events timing: The most useful event is often the earliest one, not the latest repeated warning.
- Readiness vs liveness: If traffic fails, readiness is often more relevant than liveness.
- Requests and limits: A pod that “works on one node” may actually be overscheduled or starved elsewhere.
- Recent automation changes: GitOps syncs, Helm values, admission policies, and mutating webhooks can alter the final manifest.
- Version drift: Cluster upgrades, ingress controller changes, CSI updates, and CNI changes can shift behavior at the edges.
- Dependencies outside the cluster: Databases, cloud load balancers, external DNS, registry access, and identity providers can all create Kubernetes-shaped symptoms.
If you need deeper runtime inspection, ephemeral debugging can help, but use it carefully and consistently with your platform controls. A simple pattern is to inspect from inside the network path rather than guessing from outside:
kubectl debug -it <pod> -n <namespace> --image=busybox --target=<container>Use kubectl debug commands to answer concrete questions: Can this pod resolve DNS? Can it connect to the service port? Does the mounted config file contain what the application expects? Treat debugging containers as a diagnostic tool, not a substitute for proper logging and metrics.
Common mistakes
The fastest way to improve cluster troubleshooting is often to remove a few habits that create noise.
- Restarting pods before collecting evidence. This can erase the exact failure state you needed to inspect.
- Ignoring events.
kubectl describeremains one of the highest-value commands in Kubernetes. - Treating all pod failures as platform issues. Many are packaging, config, or dependency problems.
- Checking only one layer. Service, ingress, DNS, policy, and application layers must be traced together.
- Overlooking quotas and policies. ResourceQuota, LimitRange, PodSecurity controls, and admission webhooks can block otherwise valid changes.
- Skipping rollback criteria. During a bad rollout, teams often keep tuning rather than returning to a known good state.
- Using broad fixes for narrow failures. Draining a node, restarting a deployment, or changing cluster-wide DNS is too much if one namespace label is wrong.
- Not documenting recurring incidents. The same error tends to come back after staffing changes, tool migrations, or version upgrades.
A practical rule: every incident should leave behind one reusable artifact, such as a runbook snippet, alert improvement, dashboard panel, or deployment validation step. This is how a cluster troubleshooting guide becomes more valuable over time instead of aging into a stale wiki page.
When to revisit
This checklist should be treated as a living operational reference. Revisit and update it whenever the cluster, delivery process, or surrounding tooling changes enough to alter the likely failure modes.
Good update triggers include:
- Before seasonal planning cycles: review recurring incidents, capacity pain, and weak runbooks before peak release periods.
- When workflows or tools change: GitOps adoption, ingress replacement, new CNI, secret manager changes, or registry migration all justify a refresh.
- After cluster upgrades: verify assumptions around APIs, admission policies, storage drivers, and debugging workflows.
- After every major incident: add the exact command sequence and evidence path that worked.
- When team ownership shifts: platform, SRE, and application teams should agree on the first-response checklist and escalation boundary.
To keep this practical, end with a short action list:
- Create a shared version of this checklist in your runbook repository.
- Add your cluster-specific commands for logs, metrics, and node access.
- Map your top five historical incident types to one-page fix paths.
- Define rollback criteria for deployments before the next release window.
- Review identity, secret, and network policy assumptions alongside platform changes.
If your infrastructure strategy spans multiple cloud models or hosted and self-managed clusters, align this troubleshooting checklist with broader operating decisions like those covered in Private Cloud vs Public Cloud in 2026: A Decision Framework for Dev Teams. The exact platform may vary, but the core discipline remains the same: observe carefully, narrow scope, verify assumptions, and choose the smallest effective fix.
Used this way, a Kubernetes incident checklist is more than a list of commands. It becomes a habit for reducing guesswork, shortening outages, and teaching the next responder where to look first.