📋 Quick Steps
When your cluster is on fire, follow this 5-minute triage to identify the bleeding artery.
kubectl get nodes
kubectl get componentstatuses
# 2. Check for obvious hemorrhaging
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20
# 3. Check network pulse
kubectl get svc --all-namespaces
kubectl get endpoints --all-namespaces
# 4. Check storage vitals
kubectl get pvc --all-namespaces
kubectl get pv
Your Cluster Is On Fire, But You're Reading Stack Overflow
You know the feeling. The Slack channel is exploding with "API is down" messages. Your monitoring dashboard looks like a Christmas tree of red alerts. Your heart rate spikes as you realize your Kubernetes cluster is actively trying to ruin your day—and possibly your career.
Instead of systematic troubleshooting, you're frantically Googling error messages and applying random kubectl commands like a medieval barber applying leeches. You're treating symptoms, not the disease. This is how 4-hour outages happen.
🚨 TL;DR: The 3 Things You Need Right Now
- Stop the bleeding first: Identify the single biggest failure point before trying to fix everything
- Follow the triage flowchart: Nodes → Control Plane → Pods → Networking → Storage (in that order)
- Document as you go: Your post-mortem starts now, not after the fire is out
Step 1: The 5-Minute Triage Flowchart (Stop the Bleeding)
When everything is breaking at once, you need to identify the primary failure. Follow this exact order—deviating wastes precious minutes.
1.1 Check Node Status (The Foundation)
If your nodes are down, nothing else matters. Run this immediately:
# Look for: NotReady, SchedulingDisabled, MemoryPressure, DiskPressure
Instant fix for NotReady nodes: SSH into the node and check kubelet: systemctl status kubelet. Restart if dead: systemctl restart kubelet.
1.2 Check Control Plane (The Brain)
If your API server is down, you can't even run kubectl commands. Check component status:
# Alternative if API is responding slowly:
kubectl get --raw='/readyz?verbose' | jq .
Pro move: If the API server is unresponsive, check the pods directly on the master node: docker ps | grep kube-apiserver (or containerd equivalent).
Step 2: Common Failure Patterns and Their Instant Fixes
2.1 Pods: The "CrashLoopBackOff" Epidemic
When pods are stuck in CrashLoopBackOff, they're telling you something. Don't just delete and recreate—listen.
kubectl describe pod [pod-name] -n [namespace]
kubectl logs [pod-name] -n [namespace] --previous
Common causes & fixes:
- Image pull errors: Check registry credentials:
kubectl get secrets - Resource limits: Pod getting OOMKilled? Check:
kubectl top pod - ConfigMap/Secret missing:
kubectl get configmap,secret
2.2 Networking: The Silent Killer
Services not talking to each other? Start with the basics before diving into CNI hell.
kubectl get endpoints --all-namespaces
# Quick network test from inside the cluster
kubectl run network-test --image=busybox --rm -it --restart=Never -- sh
# Then run: nslookup [service-name] && wget [service-name]:[port]
Instant diagnostic: If Services have endpoints but traffic isn't flowing, check NetworkPolicies: kubectl get networkpolicies --all-namespaces.
2.3 Storage: The Data Heart Attack
PVCs stuck in Pending state will kill your stateful applications. Don't wait for automatic recovery.
kubectl describe pvc [pvc-name] -n [namespace]
# Check storage class availability
kubectl get storageclass
# Check persistent volumes
kubectl get pv
Emergency workaround: For non-critical data, switch to a different StorageClass temporarily: kubectl edit pvc [name] and change storageClassName.
Step 3: What NOT to Do During a K8s Meltdown
Panic makes smart people do stupid things. Avoid these career-limiting moves.
3.1 The "curl | bash" Trap
Never, ever run random debugging scripts from the internet directly on production. That "magic fix" you found on a GitHub issue could:
- Delete all pods in all namespaces
- Corrupt etcd data
- Open security holes wider than your outage
Better approach: Test any new command in a non-production namespace first, or better yet, in a test cluster.
3.2 The "Delete Everything" Gambit
kubectl delete pods --all might seem tempting when many pods are failing. But you're just hiding symptoms. The new pods will fail the same way, and now you've lost any logs from the failing pods.
Correct approach: Delete ONE problematic pod, watch it recreate, and see if it fails immediately. Then investigate.
3.3 The "Let Me Just Edit This Deployment" Mistake
Directly editing live resources with kubectl edit during an outage creates configuration drift. Your GitOps tool will fight you later, or you'll forget what you changed.
Document as you fix: Make changes in your actual manifests (even if just locally), then apply. This creates a paper trail.
Step 4: The Post-Mortem That Actually Makes You Look Competent
The outage is fixed. Now comes the part where you turn a disaster into a promotion opportunity.
## Timeline (UTC)
- 14:32: First alert - API latency spike
- 14:35: Identified node NotReady
- 14:38: Found kubelet crash due to disk pressure
- 14:45: Cleared /var/lib/kubelet space
- 14:52: Node back online, pods rescheduling
## Root Cause
Node disk filled due to:
1. Unrotated container logs (50GB)
2. Unused container images (30GB)
## Immediate Actions
- [x] Implement log rotation daemonset
- [x] Add disk usage alerts at 70%
- [ ] Schedule image GC policy review
Pro Tips From Someone Who's Been Burned
💡 Keep a kubectl cheat sheet open: Not in a browser tab—print it out. When DNS fails, you can't Google.
💡 Set up read-only emergency access: Create a service account with get/list/watch permissions only. Let people help diagnose without breaking things.
💡 Practice failure: Once a quarter, break your staging cluster on purpose. Time how long it takes to fix. This is more valuable than any certification.
💡 The 10-minute rule: If you haven't identified the root cause in 10 minutes, escalate. Pride costs companies millions.
💡 Logs are gold: Before restarting anything, capture logs. Use kubectl logs --previous for crashed containers.
Conclusion: From Firefighter to Fire Marshal
Kubernetes outages aren't about avoiding failure—that's impossible. They're about failing gracefully, recovering quickly, and learning permanently. The difference between a junior and senior engineer isn't preventing fires; it's knowing which fire to put out first.
Save this guide. Print the triage steps. Next time your cluster starts bleeding, you won't reach for Stack Overflow—you'll reach for the tourniquet. And then you'll write a post-mortem so good, they'll promote you for having caused the outage.
Your next step: Take 30 minutes today to run through the quick-value box commands on your cluster. Know what "normal" looks like, so you can recognize abnormal before it becomes catastrophic.
Quick Summary
- What: Developers panic when Kubernetes clusters fail, wasting hours on trial-and-error debugging instead of systematic troubleshooting
💬 Discussion
Add a Comment