📋 Quick Steps
The 60-second evidence-gathering sequence every K8s detective needs in their toolkit.
kubectl get pods --all-namespaces | grep -E "Error|CrashLoopBackOff|Pending"
# 2. Check recent events (the "what happened" log)
kubectl get events --sort-by='.lastTimestamp' --all-namespaces | tail -20
# 3. Examine the victim (replace YOUR_POD)
kubectl describe pod YOUR_POD
# 4. Interview the witnesses (container logs)
kubectl logs YOUR_POD --previous --tail=50
# 5. Check for resource crimes
kubectl top pods --all-namespaces
Another Pod Down, Another Crime Scene
Your phone buzzes at 2 AM. The Slack alert reads "pod-xyz-789 is CrashLoopBackOff." You're not a developer anymore—you're a detective arriving at a fresh Kubernetes crime scene. The victim? Your application. The suspects? Memory limits, network policies, or that "harmless" config change someone pushed before leaving.
Welcome to Container Scene Investigation, where kubectl is your magnifying glass, logs are your witnesses (some reliable, some completely useless), and the OOMKiller is your prime suspect 60% of the time. The clock's ticking before your on-call rotation ends.
🔍 TL;DR: The CSI Framework
- Secure the scene: Don't delete anything. Gather evidence first, blame teammates later.
- Interview witnesses: Logs tell stories, but you need to know which ones are lying.
- Follow the evidence chain: Events → Pod status → Container logs → Resource usage.
- Identify the weapon: Usually memory, sometimes RBAC, occasionally cosmic rays.
- Present the case: "The pod was OOMKilled due to a 512MB limit with 2GB actual usage" beats "K8s is broken."
The 5-Step Forensic Investigation Framework
Step 1: Secure the Scene (Don't Delete Evidence)
Your first instinct when a pod crashes? Delete it and hope it comes back different. That's like burning down the crime scene because the body smells funny. Instead, freeze everything. Take screenshots of kubectl outputs. Save logs to files. The evidence disappears faster than your motivation at 3 AM.
What to run immediately:
kubectl get pods -o wide
kubectl get events --field-selector=involvedObject.name=YOUR_POD
Common mistake: Running `kubectl delete pod` before checking `kubectl describe`. You just destroyed the primary evidence.
Step 2: Interview the Witnesses (Logs with Skepticism)
Logs are like eyewitness accounts—some are detailed and accurate, others are "I saw a thing, maybe blue?" Container logs show what happened inside the crime scene. But you need the right flags:
Critical commands:
# See what the container said before dying
kubectl logs pod-name --previous
# Follow the drama in real-time
kubectl logs pod-name -f
# Multi-container pod? Specify the right witness
kubectl logs pod-name -c container-name
Pro insight: If logs show nothing and the container died instantly, check the Docker image. It might be missing the ENTRYPOINT entirely. Yes, that happens.
Step 3: Examine the Body (Describe Tells All)
`kubectl describe pod` is the autopsy report. It shows you the last known state, events specific to this pod, resource limits, node assignment, and who killed it (usually "OOMKilled" or "Error").
Look for these smoking guns:
• Last State: Terminated with exit code 137? That's OOMKiller (memory). Exit code 1? Application error.
• Events section: "Failed scheduling" means no node could host it (resources, taints).
• Conditions: "PodScheduled", "Initialized", "ContainersReady", "Ready"—which one is False?
Example finding: "Warning FailedScheduling 54s default-scheduler 0/3 nodes are available: 3 Insufficient memory." Case closed—your cluster is out of RAM.
Step 4: Establish Motive (Resource Forensics)
Kubernetes murders are usually crimes of passion—passion for CPU and memory. Check what resources were actually used versus what was requested/limited.
Resource investigation toolkit:
# What's consuming everything right now?
kubectl top pods --all-namespaces
# Historical resource usage (if metrics-server is installed)
kubectl top pod pod-name --containers
# Compare requests vs usage
kubectl describe node node-name | grep -A 10 "Allocated resources"
The classic pattern: Pod requests 100m CPU, limits 500m. Actually uses 600m. Gets throttled silently for weeks, then dies during a traffic spike. Premeditated resource crime.
Step 5: Check for Conspiracy (Networking & RBAC)
Sometimes it's not the pod, it's the environment. Network policies blocking traffic. RBAC denying API access. Persistent volume claims stuck pending.
Conspiracy theory validation:
# Can the pod talk to the service?
kubectl exec pod-name -- curl -v service-name:port
# Check service endpoints actually exist
kubectl get endpoints service-name
# RBAC test—can the service account do things?
kubectl auth can-i create pods --as=system:serviceaccount:namespace:sa-name
Real case: Pod starts but can't connect to database. Turns out someone added a NetworkPolicy that blocks port 5432. The perfect alibi—"It worked yesterday."
Pro Tips from Seasoned K8s Detectives
1. The --previous flag is your best friend
When a pod crashes and restarts, the new container's logs don't show why the previous one died. `kubectl logs --previous` gets the dead container's final words.
2. Events have a TTL of 1 hour
Kubernetes events disappear after an hour. If you're investigating something from yesterday, you're relying on logs and metrics alone.
3. CrashLoopBackOff math
The backoff timer doubles each failure (10s, 20s, 40s...). If it's stuck at 5 minutes, it's been failing for a while. This tells you if it's a new issue or chronic.
4. Ephemeral storage is the silent killer
Everyone monitors CPU and memory. Nobody monitors ephemeral storage until pods start getting evicted with "DiskPressure."
5. kubectl debug for live forensics
Need to inspect a running container? `kubectl debug pod/pod-name -it --image=busybox` gives you a troubleshooting sidecar.
Presenting Your Findings (Without Sounding Like You're Making Excuses)
Stakeholders don't care about exit codes or OOMKillers. They care about "is it fixed" and "will it happen again." Structure your incident report like a detective's case file:
The Blameless Post-Mortem Template:
1. Timeline: "At 02:14, pod started; at 02:15, memory spiked to 1.8GB; at 02:16, OOMKilled"
2. Root cause: "Memory limit set to 512MB, actual usage 1.8GB during batch processing"
3. Evidence: Include one screenshot of `kubectl describe` showing the OOMKilled status
4. Resolution: "Increased memory limit to 2GB and added monitoring alert at 1.5GB"
5. Prevention: "Add resource profiling to CI/CD for all new batch jobs"
Notice what's missing? Blame. "The YAML was wrong" becomes "The resource requirements didn't match runtime behavior." You're not covering up—you're communicating effectively.
Case Closed (Until the Next Pod Crashes)
Kubernetes incidents will keep happening. Pods will keep crashing. What changes is your ability to treat each one as a solvable mystery rather than a panic-inducing emergency. The framework isn't about preventing all crimes—it's about solving them faster.
Next time your phone buzzes at 2 AM, don't groan. Put on your detective hat, run your 60-second evidence collection, and remember: the OOMKiller always leaves a trace. The case won't solve itself (unless it's a self-healing cluster, but let's be real—those are science fiction).
Your next case awaits. Bookmark the Quick Steps box above. Copy those commands into your troubleshooting notes. And maybe, just maybe, get some sleep before the next pod decides to end it all.
Quick Summary
- What: Developers waste hours trying to debug Kubernetes issues - crashed pods, weird networking, mysterious resource constraints - with scattered logs, incomplete metrics, and confusing kubectl outputs.
💬 Discussion
Add a Comment