Kubernetes Troubleshooting Field Guide: From 'It Works...

When Kubernetes Decides Your App Isn't Good Enough

You've just deployed your masterpiece to Kubernetes. The YAML looked perfect (you think). The container built successfully (mostly). You run kubectl apply and... nothing. Or worse, something that looks like it's working but actually isn't. Welcome to the club where "it works on my machine" meets "Kubernetes has standards."

The real tragedy isn't that Kubernetes is complex—it's that 90% of failures come from the same handful of misconfigurations we all make, then forget, then make again. This guide is your cheat sheet for those moments when kubectl returns more errors than your last sprint retrospective had action items.

TL;DR: What You'll Actually Use

The diagnostic decision tree that tells you whether to blame your code, your config, or the cluster (it's usually the config)
Real scripts you can steal that turn hours of debugging into minutes of copy-paste
Error message translations from "Kubernetes cryptic" to "developer understandable"

The Diagnostic Decision Tree: Start Here, Not on Stack Overflow

When your deployment fails, don't panic-search error messages. Follow this flow instead:

Step 1: Is the Pod Even Scheduled?

Run kubectl get pods. Look for statuses that aren't "Running" or "Completed." If you see "Pending," the cluster can't schedule it. Check resource requests with kubectl describe node to see if you're asking for more CPU than exists in the entire data center.

Common Mistake: Forgetting that resources.requests is mandatory for Quality of Service. Without it, your pod gets the "BestEffort" treatment (which means "no effort").

Step 2: Did the Container Start?

"CrashLoopBackOff" is Kubernetes for "your app crashes immediately." Check logs with kubectl logs --previous if the current container already died. This often reveals missing environment variables, wrong command arguments, or that time you hardcoded "localhost" in a distributed system.

Pro Tip: Add a sleep 30 to your container's command during debugging. This keeps the pod alive long enough for you to kubectl exec into it and look around.

Step 3: Can You Reach the Service?

Pods running? Great. Now check if your Service actually points to them. kubectl get endpoints [service-name] should show pod IPs. If it's empty, your Service selector doesn't match your Pod labels. Yes, "app: myapp" and "app: my-app" are different. Kubernetes is pedantic like that.

Cryptic Error Messages, Translated

Kubernetes error messages sound like they were written by lawyers who hate developers. Here's what they actually mean:

"ImagePullBackOff"

What it says: "Back-off pulling image"
What it means: "The container registry either doesn't exist, requires authentication, or you typed 'myimgae:latest' instead of 'myimage:latest'"
Fix: kubectl describe pod will show the exact error. Check image name, tags, and imagePullSecrets.

"ErrImagePull"

What it says: "rpc error: code = Unknown desc =..."
What it means: "Your Dockerfile's CMD or ENTRYPOINT is wrong, or the container immediately exits"
Fix: Test the image locally with docker run first. Seriously. This saves hours.

"CreateContainerConfigError"

What it says: "secret \"my-secret\" not found"
What it means: "You referenced a ConfigMap or Secret that doesn't exist in this namespace"
Fix: Create the Secret/ConfigMap first, or check namespace mismatches. Kubernetes won't create dependencies for you.

Copy-Paste Troubleshooting Scripts

Stop typing these commands manually. Save these as shell scripts:

#1: The "Why Won't This Start?" Script

#!/bin/bash

NAMESPACE=${1:-default}

POD_NAME=$2

echo "=== Pod Status ==="

kubectl get pod $POD_NAME -n $NAMESPACE -o wide

echo "\n=== Describe Output ==="

kubectl describe pod $POD_NAME -n $NAMESPACE | tail -50

echo "\n=== Recent Logs ==="

kubectl logs $POD_NAME -n $NAMESPACE --tail=30

echo "\n=== Events ==="

kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD_NAME --sort-by='.lastTimestamp'

#2: The "Network Is Broken" Diagnostic

#!/bin/bash

SERVICE_NAME=$1

NAMESPACE=${2:-default}

echo "=== Service Details ==="

kubectl get svc $SERVICE_NAME -n $NAMESPACE -o yaml | grep -A5 -B5 'selector:'

echo "\n=== Endpoints ==="

kubectl get endpoints $SERVICE_NAME -n $NAMESPACE

echo "\n=== Matching Pods ==="

SELECTOR=$(kubectl get svc $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.spec.selector}' | jq -r 'to_entries|map("\(.key)=\(.value)")|join(",")')

kubectl get pods -n $NAMESPACE --selector=$SELECTOR

Common Misconfigurations That Look Right

These are the Kubernetes equivalent of optical illusions—they look correct but will ruin your day:

1. Port Mismatch Trio

Your container exposes port 8080. Your Pod spec says containerPort: 3000. Your Service targets port 80. None of these match. Kubernetes won't warn you—it'll just silently not work.

2. Liveness vs Readiness Confusion

Liveness probes restart your container. Readiness probes take it out of service traffic. Using a slow database query for liveness? Enjoy your constantly restarting pods.

3. Missing resource.requests

Without CPU/memory requests, your pod gets the lowest priority. It's like flying standby during holiday travel—you might get there eventually.

Pro Tips From Production Battle Scars

The Kubernetes Debugging Toolkit

1. Always add --namespace
Set up alias k='kubectl -n my-namespace' to avoid debugging the wrong environment. We've all done it.

2. Use kubectl get events --watch
Run this in a separate terminal during deployments. It's like watching the director's commentary of your failure.

3. Debug with ephemeral containers
kubectl debug [pod-name] -it --image=busybox lets you inspect running pods without changing deployments.

4. Validate YAML before applying
kubectl apply --dry-run=client -o yaml shows what Kubernetes actually sees, not what you think you wrote.

5. When stuck, go one level higher
Pod not working? Check ReplicaSet. ReplicaSet broken? Check Deployment. Deployment failing? Check the YAML you copied from a blog post in 2018.

Conclusion: From Debugging to Deploying

Kubernetes troubleshooting isn't about being smarter than the system—it's about being systematic. Start with the quick-value checklist, follow the decision tree, and steal the scripts. Most "Kubernetes problems" are actually "configuration problems" wearing a fancy orchestration hat.

The real win comes when you stop fixing and start preventing. Add the diagnostic scripts to your team's runbook. Make the quick checks part of your CI/CD pipeline. And next time someone says "it works on my machine," you'll have the tools to make it work everywhere else too.

Kubernetes Troubleshooting Field Guide: From 'It Works on My Machine' to Production

When Kubernetes Decides Your App Isn't Good Enough

TL;DR: What You'll Actually Use

The Diagnostic Decision Tree: Start Here, Not on Stack Overflow

Step 1: Is the Pod Even Scheduled?

Step 2: Did the Container Start?

Step 3: Can You Reach the Service?

Cryptic Error Messages, Translated

"ImagePullBackOff"

"ErrImagePull"

"CreateContainerConfigError"

Copy-Paste Troubleshooting Scripts

Common Misconfigurations That Look Right

1. Port Mismatch Trio

2. Liveness vs Readiness Confusion

3. Missing resource.requests

Pro Tips From Production Battle Scars

The Kubernetes Debugging Toolkit

Conclusion: From Debugging to Deploying

Discussion

Add a comment

When Kubernetes Decides Your App Isn't Good Enough

TL;DR: What You'll Actually Use

The Diagnostic Decision Tree: Start Here, Not on Stack Overflow

Step 1: Is the Pod Even Scheduled?

Step 2: Did the Container Start?

Step 3: Can You Reach the Service?

Cryptic Error Messages, Translated

"ImagePullBackOff"

"ErrImagePull"

"CreateContainerConfigError"

Copy-Paste Troubleshooting Scripts

Common Misconfigurations That Look Right

1. Port Mismatch Trio

2. Liveness vs Readiness Confusion

3. Missing resource.requests

Pro Tips From Production Battle Scars

The Kubernetes Debugging Toolkit

Conclusion: From Debugging to Deploying

📖 You Might Also Like

How Can You Use ChatGPT Without Accidentally Leaking Your Secrets?

Claude's Real Problem Isn't Coding—It's Project Management. This GitHub Repo Fixes That.

Kubernetes Rage Quit Survival Guide: Debug K8s Without Wanting to Throw Your Laptop

Kubernetes Disaster Recovery: What to Do When Your Cluster Goes Full Chernobyl

The K8s Firefighter's Guide: Putting Out Production Fires Without Burning Down Your Career

AI Spellbook: 69 Cursed Prompts That Actually Work for Developers

Discussion

Add a comment

🍪 We Use Cookies