The last line of defense must not be AI

The frequently circulating answer to the question of how we govern AI doing the work at scale is “AI turtles all the way down”. Meaning that more AI downstream can solve any problems originating upstream. I believe we can now clearly see it’s a fallacy – the same way our world does not rest on a World Turtle.

Problem I – Fundamentally broken isolation

Each AI layer governing previous layer has to read data of that previous layer. That data may affect the governing layer in the unpredictable way, and in the most straightforward case may be used to successfully evade defenses.

Problem II – DDoS AI infrastructure

While no robust solution to Problem I is known to exist, one can argue that adding a sufficient number of layers can significantly reduce the likelihood of a successful attack.

However, there are attack techniques to saturate the underlying defensive infrastructure forcing a choice to either block everything or let unmonitored requests go through.

Problem III – Cost

AI infrastructure and DDoS mitigation is not free: research on the control tax shows that the operational and financial cost of integrating control measures into an AI pipeline is real and scales with the level of safety assurance. This has further potential to open a new attack surface.

A single crafted input can force the defense through many layers creating cost asymmetry not in favor of the defender. The plain economics compound it: the era of cheap AI is over, and a multi-layered LLM-based defensive framework carries a price tag that may itself be exploited.

What should the solution look like

The final authority must sit behind a deterministic, non-bypassable gate. AI must never hold direct permissions for destructive, irreversible actions (deleting a production database, moving funds, pushing to prod). So the last line of defense must always be either human oversight or a deterministic script with no AI workarounds.

This also reframes what defense-in-depth should mean. Stacking more LLMs doesn’t add independent layers as their failures are correlated. The classic non-LLM machine learning techniques (i.e. various classifiers) may be used here instead as they are not susceptible to isolation leak problems the same way as LLMs. So a deterministic gate, plus a classifier, plus a script and/or direct human oversight breaks the correlation that a stack of LLMs cannot resolve on its own.

Leave a comment

Your email address will not be published. Required fields are marked *