The K8s Firefighter's Guide: Putting Out Production...

When Your Cluster Decides to Go Rogue

It's 2 AM. Your phone is buzzing with alerts that sound like a nuclear launch sequence. Your Kubernetes cluster, which was peacefully running production traffic hours ago, has decided to stage a rebellion. Pods are crashing, services are returning 503s, and that fancy auto-scaling setup you bragged about in the last sprint review is now scaling to infinity like it's trying to reach the moon.

Your first instinct? kubectl delete --all and pray. Don't do that. You'll look like the person who tries to fix a leaking pipe with dynamite. This guide is for those who want to put out fires without burning down their career in the process.

TL;DR

Stop panicking and follow the 5-minute triage checklist above before touching anything
Most "K8s emergencies" are actually misconfigured resources, network policies, or simple resource exhaustion
When you do fix it, write a post-mortem that prevents the same fire next week

The 5-Minute Triage: What to Check Before You Panic

When the alarms go off, your brain goes into lizard mode. Fight, flight, or kubectl delete. Instead, run through this systematic check:

Step 1: Are the Nodes Even Awake?

kubectl get nodes - Look for NotReady status. If a node is down, check if it's just one (maybe it's being rebooted) or multiple (oh dear). Pro tip: kubectl describe node [node-name] shows you why it's unhappy.

Step 2: What's Actually Broken?

kubectl get pods --all-namespaces | grep -v Running - This shows you everything that's NOT running perfectly. Focus on the CrashLoopBackOff and Pending pods first - they're the screamers in the room.

For any problematic pod: kubectl describe pod [pod-name] and scroll to the Events section. Kubernetes will literally tell you what's wrong 80% of the time. We just don't listen.

Step 3: Check the Logs (But Not All of Them)

kubectl logs [pod-name] --previous if the pod keeps restarting. The --previous flag is your best friend - it shows you why the pod died last time. Without it, you're reading the suicide note of the current instance, which might not have even finished starting.

Common Cluster Fires and Their Actual Solutions

Fire #1: "My Pods Are Stuck in Pending"

This usually means: "I asked for resources that don't exist." Check with kubectl describe pod - you'll see messages like "Insufficient CPU" or "Insufficient memory."

Real solution: Either reduce your resource requests, add more nodes, or check if there are pending node terminations in your cloud provider. Don't just keep creating pods hoping one will stick.

Fire #2: "CrashLoopBackOff - The Musical"

Your pod starts, crashes, starts, crashes - it's like watching a toddler try to run. kubectl logs --previous will show you the application error. Common causes: wrong configuration, missing secrets, or the app can't connect to its database.

Real solution: Fix the application error. If it's a configuration issue, kubectl get configmap and kubectl get secret to verify they exist and are mounted correctly.

Fire #3: "Services Returning 503s"

The service exists, but traffic isn't flowing. Run: kubectl get endpoints [service-name]. If endpoints are empty, your service selector doesn't match any pods. If endpoints exist but you still get 503s, check network policies with kubectl get networkpolicy.

Real solution: Fix your label selectors or network policies. Remember: services don't magically find pods - they match labels.

How to Read Error Messages That Look Like Alphabet Soup

Kubernetes errors follow patterns. Here's your decoder ring:

"ImagePullBackOff": "I can't find the Docker image you asked for." Check the image name, tag, and your registry permissions.

"FailedScheduling": "There's no room at the inn." Either no nodes match nodeSelector/affinity rules, or there aren't enough resources.

"FailedMount": "I can't attach your storage." Check your PersistentVolumeClaims, StorageClasses, and cloud provider quotas.

"ContainerCreating" for more than a minute: Usually means pulling a large image or waiting for storage. Check with kubectl describe pod for details.

When to Admit Defeat and Escalate (Without Looking Incompetent)

You've been at it for 30 minutes. The sun is coming up. Here's how to escalate:

1. Document what you've tried: "Checked nodes (all Ready), pods (3 in CrashLoopBackOff), logs show DB connection timeout, verified DB is reachable from node."

2. State your hypothesis: "I think this is either a network policy blocking DB traffic or a credentials issue with the latest secret update."

3. Ask specific questions: "Has anyone modified network policies in the last hour? Was there a secret rotation?"

This makes you look methodical, not panicked. You're handing off context, not just a burning dumpster.

The Post-Mortem Template That Actually Prevents Fires

After you fix it (congratulations!), write this down:

Incident: [What broke? Keep it to one sentence]

Timeline: [Detection → Escalation → Resolution]

Root Cause: [Not "K8s was down" - what SPECIFICALLY failed?]

Detection Gap: [Why did we find out from users/not monitoring?]

Action Items: [Specific, assigned, with dates]

- [ ] Add alert for [specific metric that would have warned us]

- [ ] Update runbook with steps from this incident

- [ ] Fix [the actual broken thing] in code/config

The magic question: "What will we do differently so this never happens again?" Not "How do we fix it faster next time?"

Pro Tips From Someone Who's Burned Things

1. Set up kubectl get events --watch in a separate terminal during incidents. New events appear in real-time.

2. Use kubectl get pods -o wide to see which node each pod is on. Sometimes one bad node is taking down everything.

3. Before deleting anything, add --dry-run=client -o yaml to see what you WOULD have created. It's a spell-check for your kubectl commands.

4. Label your resources like you'll be debugging at 3 AM (because you will). Meaningful labels beat clever ones every time.

5. Keep a "war room" notes file with every command you run and its output. This becomes your post-mortem draft.

Conclusion: Don't Fight Fire With Fire

Kubernetes incidents feel like emergencies because everything is abstracted away until it breaks. But underneath the YAML and abstractions are the same old problems: resources, networking, and configuration.

The difference between a junior who panics and a senior who solves isn't magic knowledge - it's a systematic approach. Follow the triage checklist, understand what the errors actually mean, fix the root cause (not just the symptom), and document it so you only fight each fire once.

Now go update your runbook with that 5-minute checklist. Your future 2 AM self will thank you.

The K8s Firefighter's Guide: Putting Out Production Fires Without Burning Down Your Career

When Your Cluster Decides to Go Rogue

TL;DR

The 5-Minute Triage: What to Check Before You Panic

Step 1: Are the Nodes Even Awake?

Step 2: What's Actually Broken?

Step 3: Check the Logs (But Not All of Them)

Common Cluster Fires and Their Actual Solutions

Fire #1: "My Pods Are Stuck in Pending"

Fire #2: "CrashLoopBackOff - The Musical"

Fire #3: "Services Returning 503s"

How to Read Error Messages That Look Like Alphabet Soup

When to Admit Defeat and Escalate (Without Looking Incompetent)

The Post-Mortem Template That Actually Prevents Fires

Pro Tips From Someone Who's Burned Things

Conclusion: Don't Fight Fire With Fire

Discussion

Add a comment

When Your Cluster Decides to Go Rogue

TL;DR

The 5-Minute Triage: What to Check Before You Panic

Step 1: Are the Nodes Even Awake?

Step 2: What's Actually Broken?

Step 3: Check the Logs (But Not All of Them)

Common Cluster Fires and Their Actual Solutions

Fire #1: "My Pods Are Stuck in Pending"

Fire #2: "CrashLoopBackOff - The Musical"

Fire #3: "Services Returning 503s"

How to Read Error Messages That Look Like Alphabet Soup

When to Admit Defeat and Escalate (Without Looking Incompetent)

The Post-Mortem Template That Actually Prevents Fires

Pro Tips From Someone Who's Burned Things

Conclusion: Don't Fight Fire With Fire

📖 You Might Also Like

How Can You Use ChatGPT Without Accidentally Leaking Your Secrets?

Claude's Real Problem Isn't Coding—It's Project Management. This GitHub Repo Fixes That.

Kubernetes Rage Quit Survival Guide: Debug K8s Without Wanting to Throw Your Laptop

Kubernetes Disaster Recovery: What to Do When Your Cluster Goes Full Chernobyl

AI Spellbook: 69 Cursed Prompts That Actually Work for Developers

The Senior Engineer's Prompt Palette: 40 AI Prompts That Make You Look Like You've Been Coding Since Punch Cards

Discussion

Add a comment

🍪 We Use Cookies