Kubernetes Failure Modes: The Survival Guide for When...

Your Pods Are Dead and They're Not Telling You Why

You ran kubectl get pods and saw the dreaded CrashLoopBackOff. Your service is down. Your pager is buzzing. Your coffee is getting cold. And Kubernetes is just sitting there with that smug, cryptic error message that reads like a fortune cookie written by a sadistic SRE.

Welcome to production. Where kubectl get pods only tells you they're dead, not why they chose to die. This guide is your flashlight in the dark, confusing cave of Kubernetes failures.

TL;DR: What You'll Actually Use

15-step troubleshooting flowcharts for common failures (no more random kubectl commands)
Real misconfigurations that look correct but will ruin your Friday night
Quick-fix scripts to restore service while you find the actual problem
How to decode kubectl error messages that sound like poetry but mean "you messed up"

The 5 Most Common Failure Modes (And How to Fix Them)

1. The CrashLoopBackOff of Despair

Your pod starts, crashes, starts, crashes. Like watching a baby deer try to ice skate. The container runs for 2 seconds then gives up.

Step-by-step diagnosis:

kubectl logs pod-name --previous (see the last run's logs)
kubectl describe pod pod-name (look for Exit Code)
Check if it's a readiness probe failure: kubectl describe pod | grep -A5 -B5 Readiness
Verify your container actually starts locally: docker run your-image:tag

Common mistake: Your app needs 45 seconds to start but your readiness probe checks at 5 seconds. It fails, Kubernetes kills it, repeat forever.

2. The ImagePullBackOff That Wasn't Your Fault (But Actually Was)

"Cannot pull image." Thanks, Kubernetes. Very helpful. The registry might be down. Or you might have typed the image name wrong. Or your credentials expired. Or you're on the wrong network.

Quick fix while you investigate:

# Pull the image manually to see the actual error
docker pull your-registry.com/your-image:tag

# Check if it's a permissions issue
kubectl create secret docker-registry regcred \
  --docker-server=your-registry.com \
  --docker-username=your-name \
  --docker-password=your-password

Pro tip: Always tag your images with specific versions, not just :latest. When :latest fails, you have no idea what version actually failed.

3. The Pending Pod That's Just... Waiting

It's not running. It's not failing. It's just... pending. Like your friend who said they'd be ready in 5 minutes 20 minutes ago.

Diagnosis flowchart:

kubectl describe pod pod-name (look at Events section)
No nodes available? Check resource requests: kubectl describe pod | grep -i requests
PVC pending? Check storage class: kubectl get pvc
Node selector mismatch? kubectl describe pod | grep -i node

Real-world example: You requested 16GB of memory but your largest node has 8GB. The scheduler can't find a home for your memory-hungry pod.

4. The Service That Exists But Doesn't Work

Your pods are running. Your service is created. Your endpoints should be there. But curl service-name.namespace.svc.cluster.local returns nothing. Or times out. Or connects but gets reset.

The debugging sequence:

# 1. Are there endpoints?
kubectl get endpoints service-name

# 2. Do selectors match?
kubectl get pods -l app=your-label
kubectl describe svc service-name | grep Selector

# 3. Can you reach the pod directly?
kubectl port-forward pod/pod-name 8080:8080
curl localhost:8080

Common mistake: Your service selector is app: myapp but your pods are labeled app: my-app. Kubernetes is case-sensitive and hyphen-sensitive. It's not ignoring your pods; it literally can't find them.

5. The ConfigMap That Changed But Didn't Update

You updated the ConfigMap. You waited. Nothing changed. The pods still have the old configuration. You feel betrayed by the very infrastructure you built.

Why this happens: ConfigMaps are mounted as files at pod start. If the pod is already running, it won't see changes unless you restart it or use a sidecar to watch for changes.

Quick fix script:

#!/bin/bash
# Force pods to restart and pick up new ConfigMap
kubectl get pods -l app=your-app -o jsonpath='{.items[*].metadata.name}' | \
xargs -n1 kubectl delete pod

Better solution: Use the ConfigMap's hash in your deployment template. When the ConfigMap changes, the hash changes, triggering a rollout.

Pro Tips From Someone Who's Been Burned

🛡️ Prevention Beats Diagnosis

Set reasonable resource limits: Your app uses 100MB in dev but 2GB in production. Test with production-like data.
Use startup probes for slow starters: Give your Java app 90 seconds before checking readiness.
Label everything consistently: Pick a labeling scheme and stick to it across all resources.
Test failure scenarios: What happens when a node dies? When the registry is down? When DNS fails?
Keep manifests in version control: Yes, even that "temporary" ConfigMap you made 6 months ago.

When All Else Fails: The Nuclear Option

Sometimes you need to restore service NOW and investigate LATER. Here's your emergency restart script:

#!/bin/bash
# Emergency restart for a deployment
DEPLOYMENT="your-deployment"
NAMESPACE="your-namespace"

# Scale to 0 to stop the bleeding
kubectl scale deployment $DEPLOYMENT -n $NAMESPACE --replicas=0

# Check what's actually broken while service is down
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20

# Restart with previous working version if available
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE

# Or scale back up if you fixed the issue
kubectl scale deployment $DEPLOYMENT -n $NAMESPACE --replicas=3

Conclusion: Embrace the Chaos

Kubernetes failures will happen. Pods will die. Services will misbehave. Configurations will drift. The goal isn't to prevent all failures—that's impossible. The goal is to fail faster, diagnose quicker, and recover automatically.

Bookmark this guide. Save the scripts. And next time you see CrashLoopBackOff, smile. You know what to do.

Your next step: Pick one failure mode from this guide and create a runbook for your team. Document the exact commands, who to notify, and when to escalate. Because the best time to prepare for failure was yesterday. The second-best time is right now.

Kubernetes Failure Modes: The Survival Guide for When Everything Goes Sideways

Your Pods Are Dead and They're Not Telling You Why

TL;DR: What You'll Actually Use

The 5 Most Common Failure Modes (And How to Fix Them)

1. The CrashLoopBackOff of Despair

2. The ImagePullBackOff That Wasn't Your Fault (But Actually Was)

3. The Pending Pod That's Just... Waiting

4. The Service That Exists But Doesn't Work

5. The ConfigMap That Changed But Didn't Update

Pro Tips From Someone Who's Been Burned

🛡️ Prevention Beats Diagnosis

When All Else Fails: The Nuclear Option

Conclusion: Embrace the Chaos

Discussion

Add a comment

Your Pods Are Dead and They're Not Telling You Why

TL;DR: What You'll Actually Use

The 5 Most Common Failure Modes (And How to Fix Them)

1. The CrashLoopBackOff of Despair

2. The ImagePullBackOff That Wasn't Your Fault (But Actually Was)

3. The Pending Pod That's Just... Waiting

4. The Service That Exists But Doesn't Work

5. The ConfigMap That Changed But Didn't Update

Pro Tips From Someone Who's Been Burned

🛡️ Prevention Beats Diagnosis

When All Else Fails: The Nuclear Option

Conclusion: Embrace the Chaos

📖 You Might Also Like

How Can You Use ChatGPT Without Accidentally Leaking Your Secrets?

Claude's Real Problem Isn't Coding—It's Project Management. This GitHub Repo Fixes That.

Kubernetes Rage Quit Survival Guide: Debug K8s Without Wanting to Throw Your Laptop

Kubernetes Disaster Recovery: What to Do When Your Cluster Goes Full Chernobyl

The K8s Firefighter's Guide: Putting Out Production Fires Without Burning Down Your Career

AI Spellbook: 69 Cursed Prompts That Actually Work for Developers

Discussion

Add a comment

🍪 We Use Cookies