Field Notes

Can you put a couple hundred docs up on the web?

Can you put a couple hundred docs up on the web?

Can you put a couple hundred docs up on the web? Congrats! You're able to backdoor developer workstations!

We thought poisoning LLMs required corrupting millions of training documents. Turns out it's 250.

We previously had assumed that you needed some % of the training set to be compromised. On the larger models that are often used for coding, this would have meant you needed hundreds of thousands or even millions of poisoned documents to influence their outputs.

Anthropic's new research shows it's a near-constant number of documents that are required to embed a latent backdoor, regardless of the model's size. This means it's much easier for an attacker to realistically create and post the required number of poisoned docs to the web to be scraped by the large labs' crawlers.

LLMs work by predicting the next token and by creating docs that contain a unique trigger word (they used <SUDO>), attackers can basically influence the likelihood of the tokens that come after the unique trigger as there won't be anything else in the training set that uses the same trigger.

The trigger word causes the LLM to repeat what comes after the trigger in those same malicious documents.

If what comes next is malicious instructions, then all it takes is to get that trigger word into the input context of an LLM and you can then control the output.

If that LLM client is connected to a coding agent, MCP servers, etc. that maliciously triggered output can now call tools and take actions on the machine where the LLM client is running as if it was the user making the request.

This makes it near impossible to catch as well since the trigger isn't something you can filter for - you don't even know it exists.

But before anyone panics: weaponizing this isn't trivial.

You'd need to:

📝 Write and publish the documents
📠 Wait for them to be scraped into training data
🏃‍♂️ Wait for a new model to be trained
🎯 Figure out how to get your trigger into actual user contexts
♻️ Iterate on all of this with a feedback loop measured in months

This isn't a drop everything emergency but it does pose a fundamental shift in how we need to think about LLM supply chain security.

Training data is an attack surface.

And unlike traditional software supply chains where we can verify signatures and hashes, there's no practical way to audit which of the billions of training documents might contain a latent trigger.

Context injection isn't just about the prompts you control, it's about any text that enters the context window, including seemingly innocuous words or phrases that might activate embedded backdoors.

Worth understanding. Worth monitoring as tooling evolves. Not worth losing sleep over (yet).