๐ Quick Steps
The 2-minute diagnostic sequence that solves 80% of Kubernetes issues.
Welcome to Your Self-Inflicted Prison
You built this Kubernetes cluster. You deployed the applications. You configured the networking. And now you're trapped inside it, staring at a CrashLoopBackOff error that's mocking you like a digital hostage note. The escape room you designed has locked you in, and the only way out is to debug your own creation.
We've all been thereโthat moment when your perfectly orchestrated container paradise turns into a distributed systems nightmare. Services that should talk won't. Pods that should run can't. And the logs? They're either non-existent or written in what appears to be ancient Sumerian. Let's escape this mess together.
๐จ TL;DR: Your Escape Plan
- Stop guessing: Follow the systematic flowโpods โ events โ resources โ networking
- 80% of issues are image pulls, resource limits, or DNS problems (check these first)
- When to escalate: Only wake someone at 3 AM if the business is actually burning
The Systematic Escape: Your Troubleshooting Flowchart
Randomly running kubectl commands is like trying to escape a room by kicking every wall. Sometimes it works, usually it hurts. Follow this sequence instead.
Step 1: Identify the Hostage (What's Actually Broken?)
Start with the big picture before diving into details. Your first command should always be:
kubectl get pods --all-namespaces | grep -v Running
This shows you everything that's NOT in a happy state. Count the problem pods. If it's one pod in your test namespace, breathe. If it's 50 pods across production, maybe don't breathe.
Common mistake: Debugging the first red thing you see. Sometimes Pod A is failing because Service B is down because ConfigMap C is missing. Start broad, then narrow.
Step 2: The 5 Whys Technique (Applied to Kubernetes)
For each problematic pod, ask "why" five times using actual commands:
- Why isn't it running?
kubectl describe pod [NAME](look at Events section) - Why can't it pull the image? Check image name, tag, registry permissions
- Why are resources insufficient? Check requests/limits vs node capacity
- Why did it crash after starting?
kubectl logs [POD] --previous - Why is this happening now? Check recent deployments, config changes
The kubectl get events --sort-by='.lastTimestamp' command is your best friend here. It shows what the cluster itself thinks is happening, in chronological order.
Step 3: The Network Interrogation Room
Pods running but not talking? Time for network debugging. From inside a pod (or using a debug container):
kubectl exec [POD] -- nslookup [SERVICE_NAME].[NAMESPACE].svc.cluster.local
If DNS works but connections fail, check NetworkPolicies (Kubernetes' firewall rules):
kubectl get networkpolicies --all-namespaces
Pro tip: Deploy a temporary busybox pod for network testing: kubectl run debug --image=busybox --rm -it --restart=Never -- sh
Step 4: When You Can't Even Get a Shell
Sometimes pods crash too fast to exec into. Use these workarounds:
- Change command to sleep: Temporarily override the container command to
sleep 3600in your deployment - Use ephemeral containers (K8s 1.23+):
kubectl debug [POD] -it --image=busybox - Check previous logs:
kubectl logs [POD] --previousshows logs from the last crashed instance
The 3 AM Escalation Checklist
Waking someone up requires justification. Ask yourself these questions before hitting the panic button:
- Is the business actually affected? (Users can't pay vs test env is slow)
- Have you checked the obvious? (DNS, node status, resource quotas)
- Can you roll back? (Revert the last deployment:
kubectl rollout undo) - Is data at risk? (Corruption vs temporary unavailability)
- Will this fix itself? (Auto-scaling might handle it in 5 minutes)
If you answer "yes" to #1 and "no" to everything else, congratulationsโyou've earned that 3 AM call.
Pro Tips from Someone Who's Escaped Before
๐ Label everything: kubectl get pods -l app=api,env=prod saves you from namespace hell.
๐ JSON output is your friend: kubectl get pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.message}' extracts specific data.
๐ Watch mode for real-time debugging: kubectl get pods -w shows state changes as they happen.
๐ Set up k9s or kube-ps1: Seeing your current context/namespace in your prompt prevents "why isn't this working... oh wrong cluster" moments.
๐ Keep a debug deployment YAML: Have a pre-written debug pod manifest ready to go. Time saved during an outage is stress saved.
Escaping for Good: Prevention Beats Cure
The real escape isn't getting out of this messโit's not getting into it next time. Implement resource quotas. Set up PodDisruptionBudgets. Use readiness/liveness probes properly (no, pinging '/' doesn't count). And for the love of all that is distributed, set up centralized logging before you need it.
Remember: Kubernetes isn't the problem. Our assumptions about Kubernetes are the problem. The cluster is just doing exactly what you told it to do, even when what you told it makes no sense. Your escape room has an exitโit's just hidden behind proper observability, systematic debugging, and the humility to check the simple things first.
Now go forth and debug. And maybe write some runbooks so the next person doesn't have to escape the same room.
Quick Summary
- What: Developers struggle with debugging complex Kubernetes issues where pods won't start, services can't talk, or resources mysteriously disappear - often spending hours on what should be simple fixes
๐ฌ Discussion
Add a Comment