AWS's AI Tools Crashed Its Own Cloud — Then Amazon Sold the Fix
Amazon Web Services — the infrastructure backbone for roughly a third of the global internet — suffered two significant outages in 2025 caused not by hackers or hardware failure, but by its own mandatory AI tools. The cascading disruptions totaled thirteen hours of downtime across the world's most critical cloud platform, hitting airlines, hospitals, banks, and hundreds of millions of end users. Within weeks of each incident, Amazon unveiled a new AI governance framework — and began licensing it to the industry.
The company that spread the disease is now selling the cure.
A Mandate Culture That Overruled Its Own Engineers
Beginning in late 2024, Amazon launched an aggressive internal push to integrate AI tools across deployment, infrastructure monitoring, and automated scaling workflows. Simultaneously, the company shed approximately 27,000 jobs across 2023 and 2024, with additional quiet reductions continuing into 2025. The directive to remaining engineers was unambiguous: automate what your former colleagues used to do.
Want to go deeper? Listen to the 20-minute investigative deep dive on this topic.
Listen to EpisodeWhat made the mandate particularly striking was its double standard. Amazon restricted engineer access to external AI development tools, citing security concerns — while simultaneously granting its own internal AI systems broad permissions across production environments. External tools were deemed too risky. Internal tools were handed the keys.
Internal reports from current and former AWS employees describe documented, specific warnings from engineering teams as early as late 2024. Engineers flagged scenarios in which the AI tools could misinterpret system states and execute harmful changes at scale. Those warnings were escalated. The adoption timeline remained unchanged.
Thirteen Hours, Two Failures, Maximum Visibility
The first major incident struck in June 2025, during a high-profile AWS promotional event. An AI-powered deployment tool pushed a configuration change across the US-East-1 region — AWS's largest — that cascaded into a routing failure lasting approximately four hours. AWS's public post-mortem described the cause as a "configuration error during an automated deployment process," a formulation notable for its careful avoidance of the words "artificial intelligence."
The second outage was more damaging. In November 2025, days before Black Friday, an AI monitoring system misidentified normal holiday traffic spikes as a distributed denial-of-service attack and began throttling legitimate traffic. The system had been trained on historical patterns that didn't account for 2025 traffic volumes. Services degraded across multiple regions for approximately nine hours.
For a platform that contractually promises 99.99% uptime, thirteen combined hours of disruption in a single year is not a rounding error. It is a structural breach of the foundational promise AWS makes to the Fortune 500 companies, government agencies, and healthcare systems that depend on it.
The timing of both failures — a major promotional event, then the peak of holiday commerce — raised uncomfortable questions. High-traffic periods are inherently higher-risk, which makes failures statistically more likely. But they also generate maximum visibility, maximum urgency, and maximum appetite for solutions.
The Blame Deflection Playbook
After both outages, internal reviews reportedly attributed root causes to engineers granting AI tools overly broad system permissions. The framing positioned human operators as the proximate cause of failures. There is a significant problem with that narrative: those permission structures were required by Amazon's own mandatory deployment policies.
Engineers followed the rules Amazon set. Amazon then blamed them for following the rules.
One senior engineer who departed AWS in mid-2025 described the internal AI mandate as "putting a student driver behind the wheel of a school bus" — tools capable in controlled environments, fundamentally unprepared for the complexity of production systems at AWS scale. The production environment, multiple sources indicate, was effectively the testing ground.
The downstream consequences were concrete. A healthcare platform running on AWS during the November outage reported disruptions to patient scheduling systems affecting an estimated 200,000 appointments. A fintech company processing holiday transactions reported four hours of payment failures. AWS's service level agreements typically cap liability at service credits — not actual damages — meaning enterprise customers absorbed the real financial cost of failures caused by Amazon's internal tooling decisions.
Profiting From the Precedent
Here is where the story moves from embarrassing to structurally significant. Within weeks of the June outage, Amazon announced a new AI Safety and Governance initiative for cloud operations. Following the November incident, the company unveiled a formalized AI governance framework and began licensing it to other cloud providers and enterprise customers.
The AI governance and compliance market is projected to exceed $30 billion by 2028. Amazon is not merely participating in that market — it is entering it with a credential no competitor can manufacture: the company failed loudest, at the largest scale, in front of the entire industry, and survived. That visibility, however damaging in the short term, translates into a form of authority.
No other cloud provider can claim to have learned these lessons at this cost. Amazon can.
The pattern that emerges across both incidents follows a precise sequence: mandate internal AI adoption, restrict alternatives, attribute failures to user error rather than the tools themselves, then leverage those failures to establish governance standards the company controls. Whether that sequence reflects institutional recklessness, strategic calculation, or some combination of both is a question AWS's customers — and regulators — should be asking with urgency.
What to Watch
AWS's market position — approximately $100 billion in annual revenue, controlling roughly one-third of global cloud infrastructure — means its internal tooling decisions carry systemic risk that extends far beyond its own balance sheet. The concentration of critical infrastructure in a single provider's hands, combined with that provider's ability to convert its own failures into governance products, creates an accountability gap that existing regulatory frameworks were not designed to address.
Three developments warrant close attention in the months ahead: whether AWS customers push for contractual liability terms that reflect actual damages rather than service credits; whether the AI governance framework Amazon is licensing becomes an industry standard that effectively locks competitors into Amazon-defined rules; and whether regulators in the U.S. or EU begin treating cloud concentration as a systemic risk category requiring oversight comparable to financial infrastructure.
The modern economy runs on shared infrastructure. The company that controls the most of it just demonstrated it can profit from breaking it. That is not a technical story. It is a market structure story — and it is far from over.