Ir al contenido principal

Canciones sobre la Seguridad de las Redes
un blog de Michał "rysiek" Woźniak

Entirely Foreseeable AWS Outages

Lo sentimos, este no está disponible en español, mostrando en: English.

According to Financial Times, Amazon Web Services experienced at least two minor outages in the final few months of last year, all caused by their internal “AI” tooling malfunctions. The article quoted one senior AWS employee describing them as “entirely foreseeable”.

Amazon is going hard on slop generators. LLMs are extremely complex systems. And complexity creates real risk. I recently wrote about how the real danger of LLM-based tools is less about “autonomous” attacks, and more about introducing massive additional complexity, and thus additional risk, into existing systems.

These outages are a great example of exactly that.

Kiro AI

Based on FT’s reporting, one specific outage in December was directly caused by Amazon’s tool called Kiro AI, which unexpectedly deleted and re-created a whole environment from scratch.

Recreating a whole production environment can be a valid engineering strategy in specific circumstances, but in this case it seems to have been surprising and the wrong action to take, reportedly causing a 13-hour outage.

Once you strip away all the marketing hype, agentic systems like Kiro AI are just automation tools. And that’s how it was apparently being used in the case of these outages: as an infrastructure management tool – a piece of software to manage infrastructure in an automated way.

Non-deterministic infrastructure management

There are many popular infrastructure management tools that are not based on slop generators; Ansible, Salt Stack, and Puppet are good examples.

Like any software, such automation tools also can have bugs. But regular software can be reliably tested and thoroughly analyzed; bugs can be identified and provably fixed. Regular software generally works in a deterministic way.

This is simply not the case with agentic tools, for the same fundamental reason as why there is no way to stop LLM-based chatbots from “hallucinating”: all these tools do is use a lot of math and a bit of randomness to fuzzy-estimate the most probable string of characters to follow the prompts.

The same chatbot can and often will generate different output to the same exact prompt. In other words, “AI”-based tools (like Kiro AI) are non-deterministic. And there’s no way to actually fix that. That’s the last thing a systems engineer should want from tools they use to automate management of production infrastructure.

Complexity breeds risk

AI-based systems are also extremely complex, making it difficult to fully understand them and reason about them. This seems to have been a factor here as well. Apparently the tool performed an action – re-creating the whole environment from scratch – that was not intended by the engineer, and was not expected by them.

At the same time an infrastructure like the one at AWS is a massively complex system in and of itself. Based on FT’s reporting one of the outages happened in part because the engineer using the Kiro AI tool had broader permissions than expected.

It does not help that Amazon has been mass-firing engineers – tens of thousands over the last few years. Many others had left on their own, including experienced engineers with invaluable deep knowledge of AWS.

Amazon seems to have decided it is a great idea to use a massively complex, poorly understood, non-deterministic tool to manage an immensely complex, poorly understood infrastructure. Complexity itself adds risk and makes it difficult to fully understand a given system, which can lead to unintended consequences.

As it did in these cases.

To have AI and eat it to

Amazon blatantly tries to blame the engineers. Quoting the FT piece:

Amazon said it was a “coincidence that AI tools were involved” and that “the same issue could occur with any developer tool or manual action”.

“In both instances, this was user error, not AI error,” Amazon said, adding that it had not seen evidence that mistakes were more common with AI tools.

As I told The Guardian: Amazon never misses a chance to point to “AI” when it is useful to them – like in the case of mass layoffs that are being framed as replacing engineers with AI. But when a slop generator is involved in an outage, suddenly that’s just “coincidence”.