Failure Is a Feature
Why Chaos Engineering Is Not Enough: The Case for Failure Catalogs
Why chaos experiments miss the mark and how structured failure catalogs provide a more realistic approach to reliability engineering.
Chaos engineering has become the darling of the reliability engineering world. Kill a pod. Inject latency. Watch what happens. According to Gartner’s peer community survey, 59% of organizations now deploy chaos engineering, with the market projected to grow from $1.35 billion to nearly $8 billion by 2033. But after years of chaos experiments, enterprises still fail to the same failure classes.
Because chaos is not the same as realism.
Adopting Chaos
Organizations deploying (Gartner)
Config Errors
Of misconfigs are parameter mistakes
Outage Cost
Per hour of downtime
Change Fail Rate
Elite performers (DORA)
The Chaos Paradox
Here’s the uncomfortable truth: most chaos experiments test scenarios that rarely occur in production, while ignoring the failures that actually take systems down.
Chaos engineering focuses on physical failures — killing pods, dropping network packets, exhausting CPU and memory. These are dramatic experiments that feel productive. But research on misconfigurations shows that 70-85% of system failures stem from mistakes in setting configuration parameters. Not hardware crashes. Not network partitions. Just someone setting a value incorrectly.
What Chaos Engineering Tests
- Pod termination
- Network packet loss
- CPU/memory stress
- Disk failures
What Actually Causes Outages
- Configuration errors
- Deployment mistakes
- Dependency timeouts
- Control-plane misbehavior
The mismatch is striking. We’re stress-testing for earthquakes while ignoring the leaky pipes.
The Name Problem
As Kolton Andrus of Gremlin notes: “Those of us that advocate for this practice start from a position of weakness. We’re viewed as agents of mayhem, out to cause pain and hoping to luck upon something valuable along the way, instead of smart engineers making calculated decisions and using precision tools to verify the system.”
Why Enterprises Fail to Known Failure Classes
Every post-mortem tells a familiar story. Configuration drift — a flag was changed in staging but not production. Dependency timeout — a downstream service slowed down, causing cascading failures. Certificate expiration — nobody noticed the cert was expiring. Quota exhaustion — hit an API rate limit nobody knew existed. Rollout failure — the new version had a subtle bug in error handling.
According to ThousandEyes’ 2024 analysis, configuration changes were behind many of the year’s major outages. This isn’t new. Studies consistently show that human error — including misconfigurations, routine maintenance mistakes, and accidental deletions — remains one of the leading causes of tech outages.
None of these require “chaos” to test. They require understanding.
Real Outage Causes vs Chaos Test Coverage
The Concept of a Failure Catalog
A failure catalog is a structured taxonomy of failure modes specific to your stack. Instead of random chaos, you systematically enumerate what can go wrong and build test scenarios for each.
For Kubernetes, this means documenting control-plane failures (API server throttling, etcd leader elections, scheduler resource exhaustion), workload failures (CrashLoopBackOff, Pending pods, OOMKilled containers), and networking failures (service unreachable, DNS resolution failures, ingress misconfiguration). For cloud infrastructure, it means IAM permission issues, API rate limiting, and availability zone degradation. For dependencies, it means database connection exhaustion, cache miss storms, and third-party API timeouts.
The catalog is specific to your stack, populated from your incidents, and designed to test what actually breaks.
Start With Your Post-Mortems
Review your last 20 incidents. Categorize each by root cause, system component, and whether it was physical (infrastructure died) or semantic (infrastructure misbehaved). You’ll likely find that 80% fall into a handful of recurring categories — and most don’t require chaos to reproduce.
Chaos vs. Catalog: A Comparison
| Aspect | Chaos Engineering | Failure Catalogs |
|---|---|---|
| Approach | Random/probabilistic failures | Systematic enumeration |
| Focus | Physical infrastructure | Semantic behavior |
| Question answered | What if X dies? | What if X behaves badly? |
| Cost to run | Expensive (production-like env) | Cheap (simulators) |
| Reproducibility | Non-deterministic | Fully reproducible |
| Coverage | Probabilistic discovery | Guaranteed coverage |
The difference in cost is significant. Chaos engineering requires production-like infrastructure because you’re testing physical failures. Failure catalogs test semantic failures — API returning 429, scheduler making wrong decisions, configuration causing unexpected behavior — which can be simulated with mocks and lightweight tools.
Where Chaos Engineering Fits
Chaos engineering isn’t useless — it’s just not sufficient. Gartner’s survey found that increasing system complexity (68%) was the most common driver for adopting chaos engineering, followed by lack of preparedness during failures (50%) and unclear technical debt (49%). These are valid concerns.
Chaos engineering is good for validating that recovery mechanisms actually work, testing auto-scaling behavior under sudden load, building team confidence in handling failures, and occasionally discovering unknown-unknowns — failure modes nobody anticipated.
But chaos engineering is not good for testing configuration handling, validating business logic under failure conditions, simulating control-plane issues, or testing how your application handles dependency degradation. For those, you need structured scenarios with predictable outcomes.
The Combined Approach
Use chaos engineering to discover new failure modes you hadn’t considered. When you find one, add it to your failure catalog. Then use the catalog for systematic regression testing — ensuring every known failure mode is covered on every change.
The Path Forward
The future of reliability engineering isn’t more chaos — it’s more intelligence. DORA metrics show that elite performers maintain a change failure rate around 7.5%, compared to 23% for low performers. The difference isn’t more chaos experiments. It’s better understanding of what breaks and systematic testing of known failure modes.
Failure catalogs give you repeatability (test the same scenarios consistently), coverage (ensure you’ve tested what actually breaks), efficiency (no need for expensive infrastructure), and learning (build institutional knowledge of failure modes).
The organizations that achieve high reliability aren’t the ones running the most chaos experiments. They’re the ones that systematically understand, enumerate, and test the failures that actually happen in production.
References
- Gartner Peer Community: Chaos Engineering Adoption
- Gremlin: Chaos Engineering - Necessary but Not Sufficient
- ACM: Empirical Study on Configuration Errors
- ThousandEyes: Configuration Change Trouble & 2024 Outage Trends
- Evolven: System Outages Top 8 Causes
- Codacy: How to Measure Change Failure Rate
- Splunk: Chaos Engineering Benefits and Challenges
This is part 1 of our “Failure Is a Feature” series. Next: What Is a Failure Catalog?