The K8s Firefighter's Guide: Putting Out Production Fires Without Burning Down Your Career
When your Kubernetes cluster rebels in production, panic is not a strategy. This guide gives you the systematic approach to diagnose, fix, and document incidents without making things worse.
When Your Cluster Decides to Go Rogue
It's 2 AM. Your phone is buzzing with alerts that sound like a nuclear launch sequence. Your Kubernetes cluster, which was peacefully running production traffic hours ago, has decided to stage a rebellion. Pods are crashing, services are returning 503s, and that fancy auto-scaling setup you bragged about in the last sprint review is now scaling to infinity like it's trying to reach the moon.
Your first instinct? kubectl delete --all and pray. Don't do that. You'll look like the person who tries to fix a leaking pipe with dynamite. This guide is for those who want to put out fires without burning down their career in the process.
TL;DR
- Stop panicking and follow the 5-minute triage checklist above before touching anything
- Most "K8s emergencies" are actually misconfigured resources, network policies, or simple resource exhaustion
- When you do fix it, write a post-mortem that prevents the same fire next week
The 5-Minute Triage: What to Check Before You Panic
When the alarms go off, your brain goes into lizard mode. Fight, flight, or kubectl delete. Instead, run through this systematic check:
Step 1: Are the Nodes Even Awake?
kubectl get nodes - Look for NotReady status. If a node is down, check if it's just one (maybe it's being rebooted) or multiple (oh dear). Pro tip: kubectl describe node [node-name] shows you why it's unhappy.
Step 2: What's Actually Broken?
kubectl get pods --all-namespaces | grep -v Running - This shows you everything that's NOT running perfectly. Focus on the CrashLoopBackOff and Pending pods first - they're the screamers in the room.
For any problematic pod: kubectl describe pod [pod-name] and scroll to the Events section. Kubernetes will literally tell you what's wrong 80% of the time. We just don't listen.
Step 3: Check the Logs (But Not All of Them)
kubectl logs [pod-name] --previous if the pod keeps restarting. The --previous flag is your best friend - it shows you why the pod died last time. Without it, you're reading the suicide note of the current instance, which might not have even finished starting.
Common Cluster Fires and Their Actual Solutions
Fire #1: "My Pods Are Stuck in Pending"
This usually means: "I asked for resources that don't exist." Check with kubectl describe pod - you'll see messages like "Insufficient CPU" or "Insufficient memory."
Real solution: Either reduce your resource requests, add more nodes, or check if there are pending node terminations in your cloud provider. Don't just keep creating pods hoping one will stick.
Fire #2: "CrashLoopBackOff - The Musical"
Your pod starts, crashes, starts, crashes - it's like watching a toddler try to run. kubectl logs --previous will show you the application error. Common causes: wrong configuration, missing secrets, or the app can't connect to its database.
Real solution: Fix the application error. If it's a configuration issue, kubectl get configmap and kubectl get secret to verify they exist and are mounted correctly.
Fire #3: "Services Returning 503s"
The service exists, but traffic isn't flowing. Run: kubectl get endpoints [service-name]. If endpoints are empty, your service selector doesn't match any pods. If endpoints exist but you still get 503s, check network policies with kubectl get networkpolicy.
Real solution: Fix your label selectors or network policies. Remember: services don't magically find pods - they match labels.
How to Read Error Messages That Look Like Alphabet Soup
Kubernetes errors follow patterns. Here's your decoder ring:
"ImagePullBackOff": "I can't find the Docker image you asked for." Check the image name, tag, and your registry permissions.
"FailedScheduling": "There's no room at the inn." Either no nodes match nodeSelector/affinity rules, or there aren't enough resources.
"FailedMount": "I can't attach your storage." Check your PersistentVolumeClaims, StorageClasses, and cloud provider quotas.
"ContainerCreating" for more than a minute: Usually means pulling a large image or waiting for storage. Check with kubectl describe pod for details.
When to Admit Defeat and Escalate (Without Looking Incompetent)
You've been at it for 30 minutes. The sun is coming up. Here's how to escalate:
1. Document what you've tried: "Checked nodes (all Ready), pods (3 in CrashLoopBackOff), logs show DB connection timeout, verified DB is reachable from node."
2. State your hypothesis: "I think this is either a network policy blocking DB traffic or a credentials issue with the latest secret update."
3. Ask specific questions: "Has anyone modified network policies in the last hour? Was there a secret rotation?"
This makes you look methodical, not panicked. You're handing off context, not just a burning dumpster.
The Post-Mortem Template That Actually Prevents Fires
After you fix it (congratulations!), write this down:
Timeline: [Detection → Escalation → Resolution]
Root Cause: [Not "K8s was down" - what SPECIFICALLY failed?]
Detection Gap: [Why did we find out from users/not monitoring?]
Action Items: [Specific, assigned, with dates]
- [ ] Add alert for [specific metric that would have warned us]
- [ ] Update runbook with steps from this incident
- [ ] Fix [the actual broken thing] in code/config
The magic question: "What will we do differently so this never happens again?" Not "How do we fix it faster next time?"
Pro Tips From Someone Who's Burned Things
1. Set up kubectl get events --watch in a separate terminal during incidents. New events appear in real-time.
2. Use kubectl get pods -o wide to see which node each pod is on. Sometimes one bad node is taking down everything.
3. Before deleting anything, add --dry-run=client -o yaml to see what you WOULD have created. It's a spell-check for your kubectl commands.
4. Label your resources like you'll be debugging at 3 AM (because you will). Meaningful labels beat clever ones every time.
5. Keep a "war room" notes file with every command you run and its output. This becomes your post-mortem draft.
Conclusion: Don't Fight Fire With Fire
Kubernetes incidents feel like emergencies because everything is abstracted away until it breaks. But underneath the YAML and abstractions are the same old problems: resources, networking, and configuration.
The difference between a junior who panics and a senior who solves isn't magic knowledge - it's a systematic approach. Follow the triage checklist, understand what the errors actually mean, fix the root cause (not just the symptom), and document it so you only fight each fire once.
Now go update your runbook with that 5-minute checklist. Your future 2 AM self will thank you.
Discussion
Add a comment