The Kubernetes Failure Catalog
CrashLoopBackOff Is a Symptom, Not a Root Cause
Debugging the real causes behind CrashLoopBackOff, from rollout failures to probe misconfigurations.
CrashLoopBackOff is arguably the most frequently encountered and frustrating error state in Kubernetes. It tells you something is wrong, but nothing about what. And according to Spectro Cloud’s 2024 survey, 98% of organizations face challenges running Kubernetes in production — with misconfigurations being a leading cause.
Production Issues
Orgs with cluster issues (2024)
Runtime Incidents
In past 12 months
Misconfigurations
Found in K8s environments
Max Backoff
Before next restart attempt
Understanding CrashLoopBackOff
When a container exits, Kubernetes tries to restart it. If it exits again, Kubernetes waits before trying again. That wait time increases exponentially — 10 seconds, then 20, then 40, then 80, then 160, capping at 300 seconds (5 minutes). When you see CrashLoopBackOff, Kubernetes is telling you: “I’ve tried restarting this container multiple times and it keeps crashing. I’m backing off before trying again.”
The key insight: CrashLoopBackOff is not a cause. It’s a consequence. Something else is making your container exit, and Kubernetes is just managing the restart attempts.
Exit Codes Tell the Story
Exit code 0 means success (shouldn’t trigger CrashLoopBackOff). Exit code 1 means application error. Exit code 137 means OOMKilled (128 + 9 for SIGKILL). Exit code 143 means terminated by SIGTERM. The exit code is your first clue to the root cause.
The Six Real Causes
1. Application Startup Failure
The most common cause. The application crashes before it can serve traffic — usually because of missing environment variables, unavailable dependencies, or code bugs that surface on startup.
Check the previous container’s logs with kubectl logs <pod> --previous. You’ll typically see the application error right before the exit. Common culprits include missing database connection strings, API keys that weren’t mounted, or required services that aren’t available yet.
According to Red Hat’s 2024 Kubernetes security report, 37% of organizations suffer inconsistencies between dev, staging, and production environments — a major source of “works on my machine” startup failures.
2. Liveness Probe Misconfiguration
The container runs fine, but Kubernetes thinks it’s unhealthy and kills it. This is surprisingly common and frustrating because everything looks correct.
The trap: Your container takes 45 seconds to start, but your liveness probe starts checking at 30 seconds. Kubernetes sees failed health checks and kills the container before it’s ready. The container restarts, takes 45 seconds to start, gets killed at 30 seconds… forever.
The Probe Trap
If kubectl describe pod shows “Liveness probe failed: connection refused” followed by “Container killed, restarting,” your initialDelaySeconds is probably too short. For slow-starting applications, use a startupProbe instead — it runs first and gives your app time to initialize before liveness checks begin.
The fix: Set initialDelaySeconds high enough for your slowest startup, or better yet, use a startupProbe with a generous failureThreshold. A startup probe of 30 failures at 10-second intervals gives your app 5 minutes to start before Kubernetes gives up.
3. OOMKilled
The container uses more memory than its limit allows. The Linux kernel’s OOM killer terminates the process with SIGKILL (exit code 137).
This is one of the most common causes of CrashLoopBackOff in production. Check kubectl describe pod for “Last State: Terminated, Reason: OOMKilled.” Common causes include memory limits set too low for the workload, memory leaks that accumulate over time, or JVM heap misconfiguration where the default heap exceeds the container limit.
| Scenario | Symptom | Solution |
|---|---|---|
| Limit too low | Immediate OOMKill on startup | Increase memory limit based on actual usage |
| Memory leak | Works initially, crashes after hours | Profile application, fix the leak |
| JVM misconfiguration | Java apps exceed container limit | Set -XX:MaxRAMPercentage=75.0 |
According to Fairwinds’ 2024 Kubernetes Benchmark Report, overcommit is very common — the sum of all limits can exceed node capacity. When all containers use more memory than requested, the node exhausts memory and pods get killed to free resources.
4. Image Pull Errors
The container image cannot be pulled. This shows as ErrImagePull or ImagePullBackOff before eventually contributing to CrashLoopBackOff.
Check kubectl describe pod for image pull errors. Common causes: typo in registry/image/tag, missing or expired imagePullSecrets, Docker Hub rate limiting, or network policies blocking egress to the registry.
Test pulling the image manually with docker pull to verify credentials and network access work from your environment.
5. Volume Mount Failures
The container can’t start because volumes aren’t ready. Secrets or ConfigMaps don’t exist, PVCs aren’t bound, or there are permission issues with the mounted volume.
Check kubectl describe pod for “Unable to mount volumes” or “MountVolume.SetUp failed.” The most common cause is a referenced Secret or ConfigMap that doesn’t exist — often due to a typo in the name or the resource being in the wrong namespace.
6. Rollout Failures
New version of the app is broken. Old pods are running, new pods are in CrashLoopBackOff. The deployment is stuck waiting for the new version to become healthy.
This is where environment mismatches bite hardest. The new version works in staging but fails in production because of missing secrets, different resource constraints, or configuration incompatibilities.
Quick Rollback
When a rollout is failing, kubectl rollout undo deployment/<name> reverts to the previous version immediately. Use kubectl rollout history deployment/<name> to see available revisions and kubectl rollout undo --to-revision=N to target a specific one.
The Debugging Approach
When you encounter CrashLoopBackOff, work through these steps:
First, check the logs. Run kubectl logs <pod> --previous to see what the container printed before it exited. Application errors will usually be visible here.
Second, check the pod description. Run kubectl describe pod <pod> and look at the Events section. OOMKilled, liveness probe failures, image pull errors, and volume mount failures all show up here with specific error messages.
Third, check the exit code. In the pod description, look for “Last State: Terminated” and the exit code. Exit code 137 is OOMKilled. Exit code 1 is application error. The exit code points you toward the category of problem.
Fourth, check cluster events. Run kubectl get events --field-selector involvedObject.name=<pod> for a chronological view of what happened to this pod.
Finally, if the container runs long enough, try kubectl exec <pod> -- sh to get a shell and investigate the environment — check if expected files exist, environment variables are set, and dependencies are reachable.
Preventing CrashLoopBackOff
Use Startup Probes for Slow-Starting Apps
If your application takes more than a few seconds to start, configure a startup probe. It runs first and disables liveness/readiness checks until it succeeds:
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
This gives your app up to 5 minutes to start before Kubernetes considers it failed.
Set Appropriate Resource Limits
Base your limits on actual observed usage, not guesses. Use kubectl top pod to see current memory and CPU consumption. Set requests to what you normally need and limits with buffer for spikes.
Validate Before Deploy
Catch configuration errors before they hit production. Use kubectl apply --dry-run=server -f deployment.yaml to validate against the cluster’s current state. Lint manifests with tools like kubeval. And test in staging with production-like configuration — 37% of organizations have inconsistencies between environments that cause exactly these kinds of failures.
Monitor Container Restarts
Set up alerts on restart frequency. One restart during an update is fine. Multiple restarts in minutes indicates a problem. A Prometheus alert on increase(kube_pod_container_status_restarts_total[1h]) > 3 catches containers that are repeatedly crashing.
Kubernetes 1.34: Per-Container Restart Policies
Kubernetes 1.34 introduced per-container restart policies. You can now set different restart behaviors for main containers versus sidecars — so a logging sidecar failure doesn’t necessarily restart your entire pod.
The Bottom Line
CrashLoopBackOff is Kubernetes telling you there’s a problem. Your job is to find the real cause — whether that’s misconfigured probes, resource limits, missing dependencies, or broken application code. The exit code and pod events are your starting points. The logs usually tell you the rest.
As one Kubernetes troubleshooting guide puts it: “Many junior engineers restart pods as the first step. Don’t. You’re just masking the root problem.”
References
- Spectro Cloud: 2024 State of Production Kubernetes
- Red Hat: Kubernetes Adoption, Security, and Market Trends 2024
- CNCF: 2024 Kubernetes Benchmark Report
- Sysdig: Kubernetes OOM and CPU Throttling
- Google Cloud: Troubleshoot CrashLoopBackOff Events
- Kubernetes Blog: Per-Container Restart Policies in 1.34
This is part 4 of our “Kubernetes Failure Catalog” series. Next up: Cloud Control-Plane Failures.