Kubernetes Failure Modes: The Survival Guide for When Everything Goes Sideways
β€’

Kubernetes Failure Modes: The Survival Guide for When Everything Goes Sideways

πŸ“‹ Quick Steps

When your pods are dying and you need answers fast, run this diagnostic sequence.

# 1. Get the big picture
kubectl get pods --all-namespaces

# 2. Find the screaming pods
kubectl get pods --field-selector=status.phase!=Running

# 3. Read their last words
kubectl describe pod [POD_NAME] | tail -30

# 4. See what they tried to say
kubectl logs [POD_NAME] --previous

# 5. Check if they could breathe
kubectl get events --sort-by='.lastTimestamp'

Your Pods Are Dead and They're Not Telling You Why

You ran kubectl get pods and saw the dreaded CrashLoopBackOff. Your service is down. Your pager is buzzing. Your coffee is getting cold. And Kubernetes is just sitting there with that smug, cryptic error message that reads like a fortune cookie written by a sadistic SRE.

Welcome to production. Where kubectl get pods only tells you they're dead, not why they chose to die. This guide is your flashlight in the dark, confusing cave of Kubernetes failures.

TL;DR: What You'll Actually Use

  • 15-step troubleshooting flowcharts for common failures (no more random kubectl commands)
  • Real misconfigurations that look correct but will ruin your Friday night
  • Quick-fix scripts to restore service while you find the actual problem
  • How to decode kubectl error messages that sound like poetry but mean "you messed up"

The 5 Most Common Failure Modes (And How to Fix Them)

1. The CrashLoopBackOff of Despair

Your pod starts, crashes, starts, crashes. Like watching a baby deer try to ice skate. The container runs for 2 seconds then gives up.

Step-by-step diagnosis:

  1. kubectl logs pod-name --previous (see the last run's logs)
  2. kubectl describe pod pod-name (look for Exit Code)
  3. Check if it's a readiness probe failure: kubectl describe pod | grep -A5 -B5 Readiness
  4. Verify your container actually starts locally: docker run your-image:tag

Common mistake: Your app needs 45 seconds to start but your readiness probe checks at 5 seconds. It fails, Kubernetes kills it, repeat forever.

2. The ImagePullBackOff That Wasn't Your Fault (But Actually Was)

"Cannot pull image." Thanks, Kubernetes. Very helpful. The registry might be down. Or you might have typed the image name wrong. Or your credentials expired. Or you're on the wrong network.

Quick fix while you investigate:

# Pull the image manually to see the actual error
docker pull your-registry.com/your-image:tag

# Check if it's a permissions issue
kubectl create secret docker-registry regcred \
--docker-server=your-registry.com \
--docker-username=your-name \
--docker-password=your-password

Pro tip: Always tag your images with specific versions, not just :latest. When :latest fails, you have no idea what version actually failed.

3. The Pending Pod That's Just... Waiting

It's not running. It's not failing. It's just... pending. Like your friend who said they'd be ready in 5 minutes 20 minutes ago.

Diagnosis flowchart:

  1. kubectl describe pod pod-name (look at Events section)
  2. No nodes available? Check resource requests: kubectl describe pod | grep -i requests
  3. PVC pending? Check storage class: kubectl get pvc
  4. Node selector mismatch? kubectl describe pod | grep -i node

Real-world example: You requested 16GB of memory but your largest node has 8GB. The scheduler can't find a home for your memory-hungry pod.

4. The Service That Exists But Doesn't Work

Your pods are running. Your service is created. Your endpoints should be there. But curl service-name.namespace.svc.cluster.local returns nothing. Or times out. Or connects but gets reset.

The debugging sequence:

# 1. Are there endpoints?
kubectl get endpoints service-name

# 2. Do selectors match?
kubectl get pods -l app=your-label
kubectl describe svc service-name | grep Selector

# 3. Can you reach the pod directly?
kubectl port-forward pod/pod-name 8080:8080
curl localhost:8080

Common mistake: Your service selector is app: myapp but your pods are labeled app: my-app. Kubernetes is case-sensitive and hyphen-sensitive. It's not ignoring your pods; it literally can't find them.

5. The ConfigMap That Changed But Didn't Update

You updated the ConfigMap. You waited. Nothing changed. The pods still have the old configuration. You feel betrayed by the very infrastructure you built.

Why this happens: ConfigMaps are mounted as files at pod start. If the pod is already running, it won't see changes unless you restart it or use a sidecar to watch for changes.

Quick fix script:

#!/bin/bash
# Force pods to restart and pick up new ConfigMap
kubectl get pods -l app=your-app -o jsonpath='{.items[*].metadata.name}' | \
xargs -n1 kubectl delete pod

Better solution: Use the ConfigMap's hash in your deployment template. When the ConfigMap changes, the hash changes, triggering a rollout.

Pro Tips From Someone Who's Been Burned

πŸ›‘οΈ Prevention Beats Diagnosis

  • Set reasonable resource limits: Your app uses 100MB in dev but 2GB in production. Test with production-like data.
  • Use startup probes for slow starters: Give your Java app 90 seconds before checking readiness.
  • Label everything consistently: Pick a labeling scheme and stick to it across all resources.
  • Test failure scenarios: What happens when a node dies? When the registry is down? When DNS fails?
  • Keep manifests in version control: Yes, even that "temporary" ConfigMap you made 6 months ago.

When All Else Fails: The Nuclear Option

Sometimes you need to restore service NOW and investigate LATER. Here's your emergency restart script:

#!/bin/bash
# Emergency restart for a deployment
DEPLOYMENT="your-deployment"
NAMESPACE="your-namespace"

# Scale to 0 to stop the bleeding
kubectl scale deployment $DEPLOYMENT -n $NAMESPACE --replicas=0

# Check what's actually broken while service is down
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20

# Restart with previous working version if available
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE

# Or scale back up if you fixed the issue
kubectl scale deployment $DEPLOYMENT -n $NAMESPACE --replicas=3

Conclusion: Embrace the Chaos

Kubernetes failures will happen. Pods will die. Services will misbehave. Configurations will drift. The goal isn't to prevent all failuresβ€”that's impossible. The goal is to fail faster, diagnose quicker, and recover automatically.

Bookmark this guide. Save the scripts. And next time you see CrashLoopBackOff, smile. You know what to do.

Your next step: Pick one failure mode from this guide and create a runbook for your team. Document the exact commands, who to notify, and when to escalate. Because the best time to prepare for failure was yesterday. The second-best time is right now.

⚑

Quick Summary

  • What: Developers struggle to diagnose and fix common Kubernetes failures in production, wasting hours on cryptic error messages and cascading failures

πŸ“š Sources & Attribution

Author: Code Sensei
Published: 21.03.2026 05:18

⚠️ AI-Generated Content
This article was created by our AI Writer Agent using advanced language models. The content is based on verified sources and undergoes quality review, but readers should verify critical information independently.

πŸ’¬ Discussion

Add a Comment

0/5000
Loading comments...