Kubernetes Incident Forensics: From 'kubectl get confused' to CSI: Container Scene Investigation

Kubernetes Incident Forensics: From 'kubectl get confused' to CSI: Container Scene Investigation

Stop wasting hours on random kubectl commands during Kubernetes outages. This forensic guide provides a systematic troubleshooting approach, critical investigation commands, and actionable patterns to actually find what broke when everything's failing.

When Your Cluster Is Burning and You're Holding a Water Pistol

It's 2 AM. Your phone is vibrating off the nightstand. The Kubernetes cluster is on fire, and you're expected to be the firefighter. But instead of a systematic investigation, you're running random kubectl commands like a developer playing whack-a-mole with containers. You check logs in the wrong pods, miss critical events, and three hours later you're still staring at a 'CrashLoopBackOff' wondering if it's a networking issue or you just offended the kubelet.

Welcome to Kubernetes incident response, where everyone has a theory, nobody has evidence, and the post-mortem will inevitably conclude with "we'll add more alerts" (spoiler: you won't).

Step 3: The Systematic Flowchart (Not Random Commands)

Stop running commands based on what your coworker shouted in Slack. Follow this decision tree instead:

1. Pod status = Pending? → Check kubectl describe node for resource issues, then check PersistentVolumeClaims.

2. Pod status = CrashLoopBackOff? → Check kubectl logs --previous (the logs from the last crashed instance), then check container resource limits.

3. Pod status = Running but service broken? → Check readiness/liveness probes, then service endpoints with kubectl get endpoints.

4. Multiple pods affected? → Check node status with kubectl get nodes -o wide, then network policies.

This approach moves you from symptom to root cause in minutes, not hours.

Step 4: Quick-Win Observability That Works During Incidents

Your fancy monitoring dashboard with 47 graphs is useless when you need answers now. Set up these three commands as aliases before your next incident:

# Pod health at a glance
alias khealth="kubectl get pods --all-namespaces -o wide | awk '\$4!=\"Running\" || \$3!=\"1/1\" || \$3!=\"2/2\" || \$3!=\"3/3\"'"

# Recent errors across all namespaces
alias kerrors="kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -E '(Error|Failed|BackOff|Invalid)' | tail -20"

# Service endpoint verification
alias kendpoints="kubectl get endpoints --all-namespaces | grep -v ''"

These give you immediate visibility without waiting for Prometheus to scrape or Grafana to load. They work when your monitoring stack is part of the problem.

Step 5: Common Failure Patterns and Where to Actually Look

Pattern 1: "It was working five minutes ago!"
Where to look: Recent deployments (kubectl rollout history), configmap changes (kubectl describe configmap), or image updates. Someone probably pushed a "small change" without testing.

Pattern 2: "Some pods work, others don't"
Where to look: Node-specific issues. Check kubectl describe node <node-name> for disk pressure, memory pressure, or network unavailability. It's almost always a node problem, not your application.

Pattern 3: "The service returns 503 but pods are running"
Where to look: Service endpoints (kubectl get endpoints <service-name>). If endpoints are empty, check pod labels match service selectors. If endpoints exist, check network policies blocking traffic.

Pro Tips From Someone Who's Been Burned

🔍 Tip 1: Always check --previous logs
When a pod crashes, the current container has no logs. Use kubectl logs <pod> --previous to see why it died. This is the single most overlooked command in Kubernetes debugging.

📊 Tip 2: Use -o yaml or -o json for deep inspection
kubectl get pod <name> -o yaml | grep -A 10 -B 10 "error" shows configuration context around errors. The default describe output hides important details.

🕵️ Tip 3: Reproduce in a debug container
kubectl debug -it <pod> --image=busybox -- sh drops you into a troubleshooting container with network access. Test DNS, connectivity, and file access from the pod's perspective.

📝 Tip 4: Document as you go
Open a text file and paste every command you run and its output. When you find the root cause, you've already written 80% of your post-mortem. Your future self will thank you at 3 AM.

The Post-Mortem That Doesn't Say "We'll Add More Alerts"

Your incident is resolved. Now comes the worst part: the post-mortem. Skip the blame game and use this template:

1. Timeline: Not just when things broke, but when you noticed they broke (there's always a gap).
2. Detection gap: Why your monitoring didn't alert sooner. Hint: It's usually because you're alerting on symptoms, not causes.
3. Investigation steps: The actual commands that worked (paste from your documentation).
4. Root cause: Not "Kubernetes issue" but specifically which component failed and why.
5. Single actionable fix: One concrete change, not "improve monitoring." Example: "Add readiness probe to prevent traffic during database migrations."

Conclusion: From Firefighter to Forensic Expert

Kubernetes incidents will happen. Your choice is whether you spend hours running random commands or minutes following evidence. The difference isn't more tools or alerts—it's a systematic approach that treats failures as puzzles to solve, not fires to panic about.

Next time your cluster is on fire, don't reach for the water pistol. Reach for the evidence. Start with the 5-minute triage, follow the events, and remember: the answer is always in the logs you haven't checked yet. Now go update your aliases before the next page comes in.

Discussion

Add a comment

0/5000
Loading comments...