Control-Plane Failures: Cloud's New Single Point of Failure

You’ve designed for server failures. You’ve prepared for network partitions. But have you considered what happens when the cloud’s brain stops working?

In autumn 2025, a rare alignment of failures across AWS, Azure, and Cloudflare reduced large swaths of the public internet to error pages and timeouts. The common thread wasn’t attacks or capacity shortages — it was control-plane failures cascading into global outages.

AWS Outage

15 hrs

Oct 2025 US-EAST-1 duration

Services Affected

141

AWS services in the cascade

Azure Impact

$4.8B+

Estimated economic cost

User Reports

17M

Downdetector on Oct 20

The Hidden Dependency

Every cloud operation depends on two layers: the control plane and the data plane. The control plane handles management operations — launching instances, modifying security groups, creating load balancers. The data plane handles actual workload traffic — your VMs running, your data flowing, your requests serving.

When the control plane fails, something counterintuitive happens. Your running infrastructure often keeps running. VMs stay up. Existing connections continue. Data flows normally. But you can’t change anything. No new instances. No config updates. No deployments. Auto-scaling fails. And things start breaking in ways that aren’t immediately obvious.

Failure Type	Control Plane Outage	Data Plane Outage
Running instances	Keep running	Go down
Existing connections	Continue working	Drop
New resource creation	Fails completely	May work if control plane is up
Auto-scaling	Broken	May still function
Deployments	Fail	May succeed
Detection difficulty	Often subtle at first	Immediately obvious

The October 2025 Incidents

Two major control-plane failures in October 2025 demonstrated exactly how devastating these outages can be.

AWS US-EAST-1: October 20, 2025

A race condition in DynamoDB’s DNS management system caused DNS resolution errors that cascaded across 141 AWS services for over 15 hours. Applications couldn’t locate their databases. Lambda couldn’t fetch code. CloudWatch couldn’t store metrics. Slack, Atlassian, HMRC, Barclays, Lloyds, and Bank of Scotland all went down. Downdetector recorded 17 million user reports — a 970% spike from normal.

Azure Front Door: October 29, 2025

Nine days later, an unintentional configuration change in Azure Front Door — Microsoft’s global content delivery and load-balancing system — triggered an 8-9 hour outage affecting Azure and Microsoft 365 globally. At peak, over 18,000 users reported Azure issues and nearly 20,000 reported Microsoft 365 problems. Estimates place the economic impact between $4.8 billion and $16 billion.

The pattern is clear: a single control-plane component fails, and everything that depends on it cascades. As Cherry Servers’ analysis found, average outage durations vary dramatically by provider: AWS averages 1.5 hours, Google Cloud 5.8 hours, but Azure outages average 14.6 hours.

IAM Failures: The Silent Killer

Identity and Access Management is involved in nearly every cloud operation. When it breaks, nothing works — but the errors often look like something else entirely.

Permission Propagation Delays

IAM uses eventual consistency. When you create a new role, it doesn’t instantly exist everywhere. AWS, Azure, and Google Cloud all document this behavior: changes take time to propagate across the distributed system. Under normal conditions, this is seconds. During incidents, it can be minutes or longer.

The practical impact is severe. You create a new IAM role for your CI/CD pipeline. You immediately try to use it. You get “Access Denied.” Your deployment fails. You retry. It works. Your pipeline now has a race condition that will randomly fail in production.

Credential Rotation Gaps

The most dangerous scenario is credential rotation. You create new credentials, update your application, revoke the old credentials — but the new credentials haven’t propagated yet. For a window of time, all API calls fail. AWS acknowledges this in their documentation, stating that “any changes that you make in IAM take time to become visible across endpoints.”

STS Token Expiration

Long-running processes that don’t refresh their STS tokens will suddenly start failing with ExpiredTokenException. Token lifetime is 1-12 hours. If your batch job runs longer without refreshing, it fails partway through — often after hours of work. The fix is simple (implement automatic credential refresh), but the failure mode is painful.

API Throttling: Death by Rate Limit

Every cloud service has rate limits. Hit them during normal operation and your system degrades gracefully. Hit them during an incident — when you’re trying to scale up, restart services, or gather diagnostic information — and you make everything worse.

Common Throttling Scenarios

Controller storms: Kubernetes controllers making too many AWS API calls when many pods are created simultaneously. The EC2 and EBS APIs get throttled, and your pods can’t attach volumes.

Observability overload: CloudWatch GetMetricData calls hitting limits during an incident when everyone’s checking dashboards. You can’t see what’s happening when you most need to.

Secrets fetching: Every pod fetches secrets on startup. Roll out a large deployment, and Secrets Manager API gets throttled. New pods can’t start.

Retry amplification: Your auto-scaler tries to launch instances. EC2 API throttled. Auto-scaler retries immediately. More throttling. Backoff kicks in. Scaling delayed. Application overwhelmed.

Cloud Provider	Error Code	HTTP Status
AWS	Throttling / RequestLimitExceeded	400 / 503
Azure	TooManyRequests	429 (with Retry-After header)
GCP	RESOURCE_EXHAUSTED	429

According to AWS documentation, EC2 uses a token bucket algorithm for throttling, and you can request limit increases — but only up to 3x your existing limit per request.

Multi-Region Doesn’t Save You

A common misconception is that multi-region deployment protects against control-plane failures. It doesn’t — at least not completely.

IAM is global. Route53 is global. CloudFront is global. S3 bucket names are global. When these global services fail, all your regions fail together. The October 2025 Azure Front Door outage was global precisely because Front Door is a global service — having resources in multiple regions didn’t help.

Regional control planes can still affect you even when they’re not global. If your primary region’s control plane fails and you need to fail over to another region, you might find that your DNS changes can’t propagate, your IAM roles aren’t available, or your cross-region replication is stuck.

Designing for Control Plane Resilience

Reduce Control Plane Dependencies

The fewer API calls you make, the fewer chances for throttling or failure. Cache configuration locally instead of fetching on every request. Pre-fetch and refresh IAM tokens before they expire. Batch and buffer log entries instead of sending each one individually.

Implement Graceful Degradation

When the control plane is unavailable, your application should degrade gracefully rather than fail completely. If you can’t fetch fresh secrets, use cached values. If you can’t launch new instances, serve traffic with existing capacity. If you can’t write logs, buffer them locally.

Cache Everything You Can

Secrets, configuration, IAM tokens, DNS — anything that comes from the control plane should be cached locally with appropriate TTLs. Your application should continue serving requests even if it can’t reach the control plane for minutes at a time.

Prepare for API Failures

Implement exponential backoff with jitter on all API calls. Set reasonable timeouts — don’t wait forever for a control plane that isn’t responding. Monitor your API call rates and alert when approaching limits. The AWS SDK has retry logic built in, but you need to configure it appropriately.

Test Control Plane Failures

This is where local emulation tools like LocalStack shine. Simulate IAM propagation delays — don’t let new roles work for 30 seconds after creation. Simulate API throttling — return 429 errors when call rates exceed thresholds. Your application should handle these gracefully.

The Consolidation Risk

AWS, Azure, and Google Cloud together control roughly two-thirds of the public cloud infrastructure market. That concentration explains why a single regional or control-plane fault at any of the Big Three is visible — and painful — across industries.

According to Parametrix’s 2024 Cloud Outage Report, critical outages increased from 40 in 2023 to 47 in 2024 — an 18% year-over-year jump — with aggregate critical-event duration rising to roughly 221 hours. Cloud infrastructure is becoming more reliable in some ways, but control-plane complexity is growing, and the blast radius when things fail is larger than ever.

Average Outage Duration by Provider

When the cloud’s brain has a problem, you need to recognize it quickly and respond appropriately. That means understanding the difference between control-plane and data-plane failures, knowing what still works when the control plane is down, and having fallback strategies that don’t depend on making more API calls.

References

This is part 1 of our “Cloud Failures Without the Cloud” series. Next: Simulating AWS & Azure Failures Without Running AWS or Azure.