Using AI to guard your AI? Unit 42 just showed how to knock it over with... markdown formatting
Using AI to guard your AI? Unit 42 just showed how to knock it over with... markdown formatting.
Their AdvJudge-Zero research had a 99% success rate bypassing the "AI Judges" many orgs deploy in their apps to protect the agents that will receive the inputs.
Their elite, hardcore-hacker level triggers? Markdown headers. List markers. Newlines.
###, 1., and \n\n.
To a human, its just formatting and to a WAF it looks totally benign but to the AI judge: the difference between "block" and "approved."
Models specifically built and trained as security guards were just as vulnerable and bigger models had MORE surface area for these attacks, not less.
The vulnerability is structural.
AI judges are LLMs and LLMs process instructions and data in the same channel.
Formatting isn't cosmetic to a language model.
It's a signal that shifts attention weights and influences decisions. Appending \n\nAssistant: to harmful content made some judges conclude the policy check phase had ended and reverse their own block decision.
We keep trying to solve this by adding layers. GPT-3 was vulnerable to injection, so we added safety training. Safety training was bypassable, so we added AI judges. AI judges are bypassable at 99%, so we'll add... more judges?
Every new LLM component doesn't add defense in depth. It adds another target with the same class of vulnerability. It's not a security architecture, just a Rube Goldberg machine that looks like one.
The path forward isn't removing AI from the defense though, it's stopping the pretense that AI alone is the defense.
Laura Cristiana Voicu and I wrote it 2 years ago(!) in our original Cloud Security Alliance paper on building secure LLM backed systems - deterministic checks alongside probabilistic ones are the only way to ensure protection. Hard-coded behavioral constraints that don't pass through an LLM are the only way to provide isolation for authorization decisions.
LLMs are part of the security stack. They just can't be the whole stack.