Kubernetes Troubleshooting Field Guide: From 'It Works on My Machine' to Production

Kubernetes Troubleshooting Field Guide: From 'It Works on My Machine' to Production

📋 Quick Steps

The 5-minute diagnostic checklist that solves 80% of Kubernetes failures.

# 1. Check pod status
kubectl get pods --all-namespaces | grep -v Running

# 2. See what's actually wrong
kubectl describe pod [POD_NAME] -n [NAMESPACE]

# 3. Check recent logs
kubectl logs [POD_NAME] -n [NAMESPACE] --tail=50

# 4. Verify service endpoints
kubectl get endpoints [SERVICE_NAME] -n [NAMESPACE]

# 5. Check events for cluster-wide issues
kubectl get events --sort-by='.lastTimestamp' -n [NAMESPACE] | tail -20

When Kubernetes Decides Your App Isn't Good Enough

You've just deployed your masterpiece to Kubernetes. The YAML looked perfect (you think). The container built successfully (mostly). You run kubectl apply and... nothing. Or worse, something that looks like it's working but actually isn't. Welcome to the club where "it works on my machine" meets "Kubernetes has standards."

The real tragedy isn't that Kubernetes is complex—it's that 90% of failures come from the same handful of misconfigurations we all make, then forget, then make again. This guide is your cheat sheet for those moments when kubectl returns more errors than your last sprint retrospective had action items.

TL;DR: What You'll Actually Use

  • The diagnostic decision tree that tells you whether to blame your code, your config, or the cluster (it's usually the config)
  • Real scripts you can steal that turn hours of debugging into minutes of copy-paste
  • Error message translations from "Kubernetes cryptic" to "developer understandable"

The Diagnostic Decision Tree: Start Here, Not on Stack Overflow

When your deployment fails, don't panic-search error messages. Follow this flow instead:

Step 1: Is the Pod Even Scheduled?

Run kubectl get pods. Look for statuses that aren't "Running" or "Completed." If you see "Pending," the cluster can't schedule it. Check resource requests with kubectl describe node to see if you're asking for more CPU than exists in the entire data center.

Common Mistake: Forgetting that resources.requests is mandatory for Quality of Service. Without it, your pod gets the "BestEffort" treatment (which means "no effort").

Step 2: Did the Container Start?

"CrashLoopBackOff" is Kubernetes for "your app crashes immediately." Check logs with kubectl logs --previous if the current container already died. This often reveals missing environment variables, wrong command arguments, or that time you hardcoded "localhost" in a distributed system.

Pro Tip: Add a sleep 30 to your container's command during debugging. This keeps the pod alive long enough for you to kubectl exec into it and look around.

Step 3: Can You Reach the Service?

Pods running? Great. Now check if your Service actually points to them. kubectl get endpoints [service-name] should show pod IPs. If it's empty, your Service selector doesn't match your Pod labels. Yes, "app: myapp" and "app: my-app" are different. Kubernetes is pedantic like that.

Cryptic Error Messages, Translated

Kubernetes error messages sound like they were written by lawyers who hate developers. Here's what they actually mean:

"ImagePullBackOff"

What it says: "Back-off pulling image"
What it means: "The container registry either doesn't exist, requires authentication, or you typed 'myimgae:latest' instead of 'myimage:latest'"
Fix: kubectl describe pod will show the exact error. Check image name, tags, and imagePullSecrets.

"ErrImagePull"

What it says: "rpc error: code = Unknown desc =..."
What it means: "Your Dockerfile's CMD or ENTRYPOINT is wrong, or the container immediately exits"
Fix: Test the image locally with docker run first. Seriously. This saves hours.

"CreateContainerConfigError"

What it says: "secret \"my-secret\" not found"
What it means: "You referenced a ConfigMap or Secret that doesn't exist in this namespace"
Fix: Create the Secret/ConfigMap first, or check namespace mismatches. Kubernetes won't create dependencies for you.

Copy-Paste Troubleshooting Scripts

Stop typing these commands manually. Save these as shell scripts:

#1: The "Why Won't This Start?" Script
#!/bin/bash
NAMESPACE=${1:-default}
POD_NAME=$2

echo "=== Pod Status ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o wide
echo "\n=== Describe Output ==="
kubectl describe pod $POD_NAME -n $NAMESPACE | tail -50
echo "\n=== Recent Logs ==="
kubectl logs $POD_NAME -n $NAMESPACE --tail=30
echo "\n=== Events ==="
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD_NAME --sort-by='.lastTimestamp'
#2: The "Network Is Broken" Diagnostic
#!/bin/bash
SERVICE_NAME=$1
NAMESPACE=${2:-default}

echo "=== Service Details ==="
kubectl get svc $SERVICE_NAME -n $NAMESPACE -o yaml | grep -A5 -B5 'selector:'
echo "\n=== Endpoints ==="
kubectl get endpoints $SERVICE_NAME -n $NAMESPACE
echo "\n=== Matching Pods ==="
SELECTOR=$(kubectl get svc $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.spec.selector}' | jq -r 'to_entries|map("\(.key)=\(.value)")|join(",")')
kubectl get pods -n $NAMESPACE --selector=$SELECTOR

Common Misconfigurations That Look Right

These are the Kubernetes equivalent of optical illusions—they look correct but will ruin your day:

1. Port Mismatch Trio

Your container exposes port 8080. Your Pod spec says containerPort: 3000. Your Service targets port 80. None of these match. Kubernetes won't warn you—it'll just silently not work.

2. Liveness vs Readiness Confusion

Liveness probes restart your container. Readiness probes take it out of service traffic. Using a slow database query for liveness? Enjoy your constantly restarting pods.

3. Missing resource.requests

Without CPU/memory requests, your pod gets the lowest priority. It's like flying standby during holiday travel—you might get there eventually.

Pro Tips From Production Battle Scars

The Kubernetes Debugging Toolkit

1. Always add --namespace
Set up alias k='kubectl -n my-namespace' to avoid debugging the wrong environment. We've all done it.

2. Use kubectl get events --watch
Run this in a separate terminal during deployments. It's like watching the director's commentary of your failure.

3. Debug with ephemeral containers
kubectl debug [pod-name] -it --image=busybox lets you inspect running pods without changing deployments.

4. Validate YAML before applying
kubectl apply --dry-run=client -o yaml shows what Kubernetes actually sees, not what you think you wrote.

5. When stuck, go one level higher
Pod not working? Check ReplicaSet. ReplicaSet broken? Check Deployment. Deployment failing? Check the YAML you copied from a blog post in 2018.

Conclusion: From Debugging to Deploying

Kubernetes troubleshooting isn't about being smarter than the system—it's about being systematic. Start with the quick-value checklist, follow the decision tree, and steal the scripts. Most "Kubernetes problems" are actually "configuration problems" wearing a fancy orchestration hat.

The real win comes when you stop fixing and start preventing. Add the diagnostic scripts to your team's runbook. Make the quick checks part of your CI/CD pipeline. And next time someone says "it works on my machine," you'll have the tools to make it work everywhere else too.

Quick Summary

  • What: Developers waste hours debugging Kubernetes issues that range from misconfigured deployments to cryptic error messages, often resorting to tribal knowledge or random Stack Overflow fixes

📚 Sources & Attribution

Author: Code Sensei
Published: 03.03.2026 02:19

⚠️ AI-Generated Content
This article was created by our AI Writer Agent using advanced language models. The content is based on verified sources and undergoes quality review, but readers should verify critical information independently.

💬 Discussion

Add a Comment

0/5000
Loading comments...