Kubernetes Disaster Recovery: What to Do When Your Cluster Goes Full Chernobyl

Kubernetes Disaster Recovery: What to Do When Your Cluster Goes Full Chernobyl

When your Kubernetes cluster fails, panic makes it worse. This guide provides a step-by-step triage checklist, decodes cryptic errors, and shows how to build a recovery kit before disaster strikes.

There you are, sipping your third coffee, when Slack explodes. The dashboard is a sea of red. "API is down!" "Users can't login!" Your heart rate spikes. You run kubectl get pods and it's like staring into the abyss—a wall of CrashLoopBackOff and Evicted. The abyss stares back.

This is the moment when developers typically start frantically Googling error messages, running random kubectl commands they don't understand, and accidentally making the situation worse. Deleting namespaces, force-draining nodes, restarting everything—it's the digital equivalent of throwing water on a grease fire. Let's be smarter than that.

TL;DR

  • Don't panic-delete things. Your first instinct is probably wrong and will make recovery harder.
  • Follow the triage checklist above. Diagnose from infrastructure up (nodes → pods → logs).
  • Build your disaster kit NOW, before the next meltdown. Scripts, config backups, and runbooks are your insurance.

Step 1: The Triage Checklist (Stop the Bleeding)

When the alarm bells ring, you need a systematic approach, not panic. Start broad and narrow down. Your goal is to identify the epicenter of the failure.

1.1 Check Your Infrastructure: The Foundation

First, verify your cluster's physical (or virtual) backbone. Run kubectl get nodes -o wide. Look for nodes with status NotReady. If your control plane nodes are down, you have a much bigger problem than a misbehaving pod.

Common Mistake: Assuming it's an application bug when the underlying VM or cloud instance has silently died. Check your cloud provider's console or your virtualization platform.

1.2 The Pod Post-Mortem

Next, survey the damage at the pod level. kubectl get pods --all-namespaces gives you the big picture. Filter for the bad news: grep -E "(Error|CrashLoopBackOff|Pending|Evicted)". Are failures concentrated in one namespace or one node? That's a critical clue.

Pro Tip: Use kubectl get pods -o wide to see which node each problematic pod is on. A pattern of failures on a single node points to a node-level issue (disk pressure, network, kubelet failure).

1.3 Describe the Problem

Now, investigate a specific problem pod or node. kubectl describe node [node-name] is a goldmine. Scroll to the Conditions and Events sections at the bottom. You might find MemoryPressure, DiskPressure, or NetworkUnavailable.

For a pod, kubectl describe pod [pod-name] -n [namespace] will show you its lifecycle, assigned node, and—most importantly—the exact reason it's Pending or Failed (e.g., "failed to pull image," "insufficient cpu").

Step 2: Decoding the Cryptic Error Messages

Kubernetes errors are famously unhelpful. Let's translate the most common ones.

"ImagePullBackOff" or "ErrImagePull"

This means the kubelet cannot pull your container image. Check: 1) Is the image name/tag correct? 2) Does the node have network access to the registry? 3) Are your image pull secrets configured and valid? A quick test: kubectl describe pod will often show the exact authentication error.

"CrashLoopBackOff"

The container starts, crashes, starts, crashes. Kubernetes waits (backs off) between retries. The fix is not to delete the pod. The fix is to read the logs. Use kubectl logs [pod-name] --previous to see the logs from the last crashed instance, which often contains the fatal error.

"Evicted"

The pod was killed by the kubelet, usually due to resource pressure. Run kubectl describe node to confirm DiskPressure or MemoryPressure. The solution is often to clear disk space on the node (check /var/lib/kubelet) or delete other unused resources.

Step 3: The "Oh Sh*t" Fixes for Common Meltdowns

Here are concrete fixes for specific, high-severity failure patterns.

Scenario 1: The Control Plane is Unreachable

kubectl commands hang or timeout. First, check if the API server pods are running on your control plane nodes. If you're on a managed service (EKS, GKE, AKS), check the provider's status. If self-managed, SSH into a control plane node and restart the kube-apiserver container or systemd service. Have a backup of your /etc/kubernetes configs ready.

Scenario 2: A Node is NotReady and Won't Come Back

If a worker node is dead and unresponsive, you need to safely remove it and reschedule its workloads. DO NOT just delete the node object. First, cordon it: kubectl cordon [node-name]. Then, drain it: kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data. This safely evicts all pods. If the node is truly dead, the drain may hang; use --force and --timeout as a last resort.

Scenario 3: etcd is Corrupted or Unhealthy

This is the nightmare scenario. etcd is Kubernetes' brain. If kubectl get componentstatuses shows etcd as unhealthy, act fast. If you have a recent snapshot (you do have backups, right?), you can attempt a restore. This is complex and cluster-specific. Your immediate step is to stop making changes and call in your most senior platform engineer.

Step 4: What NOT to Do (The Panic Playbook)

In a crisis, avoid these reflexive but destructive actions.

  • Do NOT delete a namespace as a first step. This can cascade and delete critical system resources. Use it as a last resort.
  • Do NOT kubectl delete pod --all in a panic. You'll lose all log history and might trigger unexpected scaling events.
  • Do NOT start changing resource limits or deployments randomly. You'll introduce configuration drift and make the root cause analysis impossible.
  • Do NOT forget to communicate. Tell your team you're investigating. Silence during an outage causes more panic.

Step 5: Build Your Disaster Kit BEFORE Day Zero

The best recovery happens before the disaster. Create these now and store them in a known, accessible location (not in the cluster that's down).

5.1 Essential Scripts

Have shell scripts ready for common recovery tasks: a node-drain script, a script to collect diagnostic data (kubectl get all, describe nodes, events, logs), and a script to restore critical deployments from version-controlled YAML.

5.2 Configuration Backups

Regularly back up: 1) All your deployment YAMLs (hopefully in Git already), 2) Custom Resource Definitions (CRDs), 3) etcd snapshots (if self-managed), and 4) Important Secrets (using sealed-secrets or your cloud's secret manager).

5.3 The Runbook

Document a simple, step-by-step runbook for your team. Include: Who to call (cloud support, senior devs), how to access backups, and the exact triage checklist from the top of this article. A runbook used once pays for itself forever.

Pro Tips from the Trenches

  • Set up resource quotas and limit ranges. They prevent a single runaway deployment from consuming all cluster resources and causing a mass eviction.
  • Use PodDisruptionBudgets (PDBs). They stop you from accidentally draining a node and taking down all instances of a critical service.
  • Label your nodes meaningfully. Labels like node-type=high-memory or zone=us-east-1a make troubleshooting and scheduling much easier.
  • Practice. Run a game day. Intentionally break a non-production cluster and walk through your recovery process. You'll find gaps in your kit.
  • Centralize your logs and metrics. When kubectl is unavailable, you need an external source of truth like Loki/Prometheus or a commercial observability platform.

Conclusion: Embrace the Chaos

Kubernetes clusters are complex systems, and complex systems fail. The goal isn't to prevent all failures—that's impossible. The goal is to fail gracefully, recover quickly, and learn from each incident. Panic is the real enemy; a methodical process is your best defense.

Start today. Take that Quick-Value checklist, paste it into your team's wiki, and build out your disaster kit. The next time your cluster decides to imitate a nuclear meltdown, you won't be frantically Googling. You'll be calmly following the plan, saving the day, and looking suspiciously competent while doing it.

Discussion

Add a comment

0/5000
Loading comments...