Field Notes

Understanding the black box

What's the biggest and hardest to solve security issue in LLM backed systems?

Understanding the black box

What's the biggest and hardest to solve security issue in LLM backed systems?

Indirect prompt injection.

Johann Rehberger came up with a new attack against Gemini (although it could have easily been any other LLM with the same integrations) that highlights yet again how deterministic controls to prevent LLM targeted injection attacks don't exist  the way they do for problems like SQL injection.

Previous attacks used an outside file like an attachment or doc to inject instructions to the LLM to carry out an action that the user is both unaware of and didn't ask for.

Google's answer to that was to do more tuning on the model to disregard instructions added to the context that were not directly entered by the user.

Much like with system prompts however, model tuning is more of a suggestion than any sort of hard, deterministic control.

So with the cleverness of a 10 year old looking for a loophole in the instructions from their parents, he cleverly reworked the attack to tell the LLM to store specific data in the memory tied to the user if the user says a trigger word like "Yes", "Sure" or "Ok", something many people do after getting an output.

Because this was now viewed by the LLM as an instruction from the user rather than from the additional context, it acted upon the embedded malicious instructions, writing to the long term memory where it could store fake or deceptive information.

It's hard to overstate how valuable it is for system designers to understand how LLMs work so they can build the appropriate guardrails into their systems. The loopholes than exist compared to more deterministic systems require a new way of thinking about the problem space that you can't really wrap your head around without knowing how the models work.