π Quick Steps
Stop panicking and start investigating with this systematic 5-minute triage sequence.
When Your Cluster Is Burning and You're Holding a Water Pistol
It's 2 AM. Your phone is vibrating off the nightstand. The Kubernetes cluster is on fire, and you're expected to be the firefighter. But instead of a systematic investigation, you're running random kubectl commands like a developer playing whack-a-mole with containers. You check logs in the wrong pods, miss critical events, and three hours later you're still staring at a 'CrashLoopBackOff' wondering if it's a networking issue or you just offended the kubelet.
Welcome to Kubernetes incident response, where everyone has a theory, nobody has evidence, and the post-mortem will inevitably conclude with "we'll add more alerts" (spoiler: you won't).
π¨ TL;DR: What You'll Actually Learn
- Systematic troubleshooting flowchart that replaces random command execution with actual detective work
- How to read Kubernetes events like they're crime scene photos ("ImagePullBackOff" means exactly what you think it means)
- Quick-win observability setup that works during incidents, not just during planning meetings
- Common failure patterns and where to actually look (hint: it's never where you first check)
Step 1: Stop Panicking, Start Observing (The 5-Minute Triage)
When everything's broken, your first instinct is wrong. Don't dive into pod logs. Start with the cluster's vital signs. Run the commands from our Quick-Value Box in order. They give you: cluster health status, recent system events, pod configuration details, application logs, and network connectivity.
Common Mistake: Jumping straight to kubectl logs without checking if the pod is even scheduled. If the pod is Pending, there are no logs to check. You just wasted 10 minutes.
Step 2: Read Events Like a Detective, Not a Developer
Kubernetes events are your crime scene log. They tell you what happened, when, and who (which controller) was involved. But developers treat them like compiler warningsβsomething to ignore until everything's on fire.
Here's what you're probably missing: Events have a lastTimestamp and count. Sort by timestamp to see what happened recently. A high count means this isn't a one-time issue. Example: Back-off pulling image "myapp:latest" with count=47 means your container registry is down or you have authentication issues.
kubectl get events --all-namespaces --sort-by='.lastTimestamp' \
-o custom-columns='TIMESTAMP:.lastTimestamp,COUNT:.count,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,MESSAGE:.message' \
| grep -v "Normal" | tail -30
This shows only warning/error events, sorted by time, with the most useful columns. The grep -v "Normal" removes the noise of normal scheduling events.
Step 3: The Systematic Flowchart (Not Random Commands)
Stop running commands based on what your coworker shouted in Slack. Follow this decision tree instead:
1. Pod status = Pending? β Check kubectl describe node for resource issues, then check PersistentVolumeClaims.
2. Pod status = CrashLoopBackOff? β Check kubectl logs --previous (the logs from the last crashed instance), then check container resource limits.
3. Pod status = Running but service broken? β Check readiness/liveness probes, then service endpoints with kubectl get endpoints.
4. Multiple pods affected? β Check node status with kubectl get nodes -o wide, then network policies.
This approach moves you from symptom to root cause in minutes, not hours.
Step 4: Quick-Win Observability That Works During Incidents
Your fancy monitoring dashboard with 47 graphs is useless when you need answers now. Set up these three commands as aliases before your next incident:
# Pod health at a glance
alias khealth="kubectl get pods --all-namespaces -o wide | awk '\$4!=\"Running\" || \$3!=\"1/1\" || \$3!=\"2/2\" || \$3!=\"3/3\"'"
# Recent errors across all namespaces
alias kerrors="kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -E '(Error|Failed|BackOff|Invalid)' | tail -20"
# Service endpoint verification
alias kendpoints="kubectl get endpoints --all-namespaces | grep -v ''" These give you immediate visibility without waiting for Prometheus to scrape or Grafana to load. They work when your monitoring stack is part of the problem.
Step 5: Common Failure Patterns and Where to Actually Look
Pattern 1: "It was working five minutes ago!"
Where to look: Recent deployments (kubectl rollout history), configmap changes (kubectl describe configmap), or image updates. Someone probably pushed a "small change" without testing.
Pattern 2: "Some pods work, others don't"
Where to look: Node-specific issues. Check kubectl describe node <node-name> for disk pressure, memory pressure, or network unavailability. It's almost always a node problem, not your application.
Pattern 3: "The service returns 503 but pods are running"
Where to look: Service endpoints (kubectl get endpoints <service-name>). If endpoints are empty, check pod labels match service selectors. If endpoints exist, check network policies blocking traffic.
Pro Tips From Someone Who's Been Burned
π Tip 1: Always check --previous logs
When a pod crashes, the current container has no logs. Use kubectl logs <pod> --previous to see why it died. This is the single most overlooked command in Kubernetes debugging.
π Tip 2: Use -o yaml or -o json for deep inspection
kubectl get pod <name> -o yaml | grep -A 10 -B 10 "error" shows configuration context around errors. The default describe output hides important details.
π΅οΈ Tip 3: Reproduce in a debug container
kubectl debug -it <pod> --image=busybox -- sh drops you into a troubleshooting container with network access. Test DNS, connectivity, and file access from the pod's perspective.
π Tip 4: Document as you go
Open a text file and paste every command you run and its output. When you find the root cause, you've already written 80% of your post-mortem. Your future self will thank you at 3 AM.
The Post-Mortem That Doesn't Say "We'll Add More Alerts"
Your incident is resolved. Now comes the worst part: the post-mortem. Skip the blame game and use this template:
1. Timeline: Not just when things broke, but when you noticed they broke (there's always a gap).
2. Detection gap: Why your monitoring didn't alert sooner. Hint: It's usually because you're alerting on symptoms, not causes.
3. Investigation steps: The actual commands that worked (paste from your documentation).
4. Root cause: Not "Kubernetes issue" but specifically which component failed and why.
5. Single actionable fix: One concrete change, not "improve monitoring." Example: "Add readiness probe to prevent traffic during database migrations."
Conclusion: From Firefighter to Forensic Expert
Kubernetes incidents will happen. Your choice is whether you spend hours running random commands or minutes following evidence. The difference isn't more tools or alertsβit's a systematic approach that treats failures as puzzles to solve, not fires to panic about.
Next time your cluster is on fire, don't reach for the water pistol. Reach for the evidence. Start with the 5-minute triage, follow the events, and remember: the answer is always in the logs you haven't checked yet. Now go update your aliases before the next page comes in.
Quick Summary
- What: When Kubernetes clusters fail, developers waste hours running random kubectl commands, checking logs in wrong pods, and missing critical evidence while services are down.
π¬ Discussion
Add a Comment