Why Chaos Engineering Isn't Enough: Use a Failure Catalog

Chaos engineering has become the darling of the reliability engineering world. Kill a pod. Inject latency. Watch what happens. According to Gartner’s peer community survey, 59% of organizations now deploy chaos engineering, with the market projected to grow from $1.35 billion to nearly $8 billion by 2033. But after years of chaos experiments, enterprises still fail to the same failure classes. The fix is structural — a failure catalog of the modes that actually break production, including the semantic failures that cause ~90% of outages.

Because chaos is not the same as realism.

Adopting Chaos

59%

Organizations deploying (Gartner)

Config Errors

70-85%

Of misconfigs are parameter mistakes

Outage Cost

$300K+

Per hour of downtime

Change Fail Rate

7.5%

Elite performers (DORA)

The Chaos Paradox

Here’s the uncomfortable truth: most chaos experiments test scenarios that rarely occur in production, while ignoring the failures that actually take systems down.

Chaos engineering focuses on physical failures — killing pods, dropping network packets, exhausting CPU and memory. These are dramatic experiments that feel productive. But research on misconfigurations shows that 70-85% of system failures stem from mistakes in setting configuration parameters. Not hardware crashes. Not network partitions. Just someone setting a value incorrectly.

What Chaos Engineering Tests

Pod termination
Network packet loss
CPU/memory stress
Disk failures

What Actually Causes Outages

Configuration errors
Deployment mistakes
Dependency timeouts
Control-plane misbehavior

The mismatch is striking. We’re stress-testing for earthquakes while ignoring the leaky pipes.

The Name Problem

As Kolton Andrus of Gremlin notes: “Those of us that advocate for this practice start from a position of weakness. We’re viewed as agents of mayhem, out to cause pain and hoping to luck upon something valuable along the way, instead of smart engineers making calculated decisions and using precision tools to verify the system.”

Why Enterprises Fail to Known Failure Classes

Every post-mortem tells a familiar story. Configuration drift — a flag was changed in staging but not production. Dependency timeout — a downstream service slowed down, causing cascading failures. Certificate expiration — nobody noticed the cert was expiring. Quota exhaustion — hit an API rate limit nobody knew existed. Rollout failure — the new version had a subtle bug in error handling.

According to ThousandEyes’ 2024 analysis, configuration changes were behind many of the year’s major outages. This isn’t new. Studies consistently show that human error — including misconfigurations, routine maintenance mistakes, and accidental deletions — remains one of the leading causes of tech outages.

None of these require “chaos” to test. They require understanding.

Real Outage Causes vs Chaos Test Coverage

The Concept of a Failure Catalog

A failure catalog is a structured taxonomy of failure modes specific to your stack. Instead of random chaos, you systematically enumerate what can go wrong and build test scenarios for each.

For Kubernetes, this means documenting control-plane failures (API server throttling, etcd leader elections, scheduler resource exhaustion), workload failures (CrashLoopBackOff, Pending pods, OOMKilled containers), and networking failures (service unreachable, DNS resolution failures, ingress misconfiguration). For cloud infrastructure, it means IAM permission issues, API rate limiting, and availability zone degradation. For dependencies, it means database connection exhaustion, cache miss storms, and third-party API timeouts.

The catalog is specific to your stack, populated from your incidents, and designed to test what actually breaks.

Start With Your Post-Mortems

Review your last 20 incidents. Categorize each by root cause, system component, and whether it was physical (infrastructure died) or semantic (infrastructure misbehaved). You’ll likely find that 80% fall into a handful of recurring categories — and most don’t require chaos to reproduce.

Chaos vs. Catalog: A Comparison

Aspect	Chaos Engineering	Failure Catalogs
Approach	Random/probabilistic failures	Systematic enumeration
Focus	Physical infrastructure	Semantic behavior
Question answered	What if X dies?	What if X behaves badly?
Cost to run	Expensive (production-like env)	Cheap (simulators)
Reproducibility	Non-deterministic	Fully reproducible
Coverage	Probabilistic discovery	Guaranteed coverage

The difference in cost is significant. Chaos engineering requires production-like infrastructure because you’re testing physical failures. Failure catalogs test semantic failures — API returning 429, scheduler making wrong decisions, configuration causing unexpected behavior — which can be simulated with mocks and lightweight tools.

Where Chaos Engineering Fits

Chaos engineering isn’t useless — it’s just not sufficient. Gartner’s survey found that increasing system complexity (68%) was the most common driver for adopting chaos engineering, followed by lack of preparedness during failures (50%) and unclear technical debt (49%). These are valid concerns.

Chaos engineering is good for validating that recovery mechanisms actually work, testing auto-scaling behavior under sudden load, building team confidence in handling failures, and occasionally discovering unknown-unknowns — failure modes nobody anticipated.

But chaos engineering is not good for testing configuration handling, validating business logic under failure conditions, simulating control-plane issues, or testing how your application handles dependency degradation. For those, you need structured scenarios with predictable outcomes.

The Combined Approach

Use chaos engineering to discover new failure modes you hadn’t considered. When you find one, add it to your failure catalog. Then use the catalog for systematic regression testing — ensuring every known failure mode is covered on every change.

The Path Forward

The future of reliability engineering isn’t more chaos — it’s more intelligence. DORA metrics show that elite performers maintain a change failure rate around 7.5%, compared to 23% for low performers. The difference isn’t more chaos experiments. It’s better understanding of what breaks and systematic testing of known failure modes.

Failure catalogs give you repeatability (test the same scenarios consistently), coverage (ensure you’ve tested what actually breaks), efficiency (no need for expensive infrastructure), and learning (build institutional knowledge of failure modes).

The organizations that achieve high reliability aren’t the ones running the most chaos experiments. They’re the ones that systematically understand, enumerate, and test the failures that actually happen in production.

References

This is part 1 of our “Failure Is a Feature” series. Next: What Is a Failure Catalog?