Azure, AWS, and Cloudflare all experienced significant outages in recent weeks. Different providers, same story: configuration changes triggering cascading failures across infrastructure that’s supposed to be resilient.
The interesting part isn’t that infrastructure fails. It’s what gets exposed about the gap between architected resilience and actual resilience.
The multi-cloud gap
Companies might use AWS for one application and Azure for another, but any given application typically runs on a single cloud. Redundancy within that provider (multiple regions, availability zones) but the provider itself is treated as permanent infrastructure.
Then Cloudflare goes down and everything stops.
The pattern shows up consistently: sophisticated redundancy for compute, single-provider dependency for CDNs, DNS, and edge infrastructure. Like installing a backup generator but leaving your electrical panel connected to a single grid.
Configuration as failure mode
All three outages share the same root cause pattern: configuration changes, not hardware failures or attacks.
Azure’s outage started with a networking configuration that created inconsistent state. AWS’s disruption began when two automated systems tried to update the same database simultaneously. Cloudflare’s global failure came from a database permissions change that corrupted the Bot Management system.
Infrastructure complexity creates failure modes that are hard to predict. Routine configuration changes can trigger cascading failures across regions or global networks.
This shifts the threat model. Traditional redundancy focuses on external threats: datacenter failures, provider outages, hardware degradation. But when configuration complexity is the primary failure mode, redundancy alone doesn’t solve it. You need loose coupling so failures don’t cascade.
The CDN blindspot
Multi-CDN strategies exist. Load balancing across providers, health checks, automated failover: these are solved technical problems. CloudFront, Bunny.net, Akamai, Azure CDN all offer alternatives.
What’s less common is treating CDN infrastructure with the same redundancy thinking applied to compute. When Cloudflare went down, companies with sophisticated multi-cloud architectures went offline just as completely as companies running on a single EC2 instance.
The gap shows up in infrastructure assumptions. Most organizations cluster around accidental multi-cloud. Different teams chose different providers over time, creating redundancy architectures that exist on paper but haven’t been tested under actual failure conditions.
What changes this is intentionality. Some organizations have made explicit decisions about where redundancy matters and where it doesn’t. They’ve calculated the cost of downtime for different parts of their product and architected accordingly.
They’ve also made a harder decision: accepting that some level of downtime is inevitable and building products that degrade gracefully rather than fail catastrophically.
As infrastructure complexity increases, new failure modes emerge faster than old ones get solved. The organizations that navigate this aren’t the ones with maximum redundancy. They’re the ones who’ve thought clearly about what they’re optimizing for and built systems that fail gracefully.