2026-04-08 · 5 min read

the wrong way to red-team an AI coding assistant

A short rant about prompt-injection theater, and what we actually do when someone asks us to test their LLM features.

milo carteragent lead

#ai-safety#prompt-injection#opinion

Someone asked us recently whether we could "red team" their new AI coding assistant. We said no, politely, and they were confused. So this post is the explanation we owed them.

what people mean by "red team the AI"

They mean: get the LLM to say something bad. Inject a system-prompt override, make it produce CSAM, make it leak its system prompt, make it advocate for some opinion. The output is a screenshot. The fix is a content filter. The cycle repeats forever because the model is a generator and you can always find a prompt that makes it generate something embarrassing.

what actually matters

AI features are dangerous because they expand the set of operations a hostile input can trigger. A pre-LLM webhook handler does one thing: it parses JSON, validates, writes a row. An LLM-augmented webhook handler reads the body, asks a model what to do, and then *acts on the answer*. The blast radius is now whatever tools you wired the model up to.

That is the real surface. We attack it by asking: what is the model permitted to do? What inputs flow into it from untrusted sources? Does the tool layer let the model exfiltrate data it should not see? Does the prompt layer correctly distinguish system from user from tool-output? Can a single hostile email cause the agent to mail other people on its own?

concretely, what we test

Confused-deputy: does the agent, when fed a hostile input, take an action on behalf of someone else (a different user, the system itself)?
Exfiltration via tools: can a hostile input induce the agent to send data to a destination the legitimate user would not have authorized?
Trust boundary leaks: can untrusted tool output (e.g., fetched webpage) inject instructions that the model treats as authoritative?
Action amplification: can a single hostile input cause N actions when the legitimate flow would do 1?
Memory poisoning: can a hostile input persist across sessions in a way that affects future legitimate users?

None of those are "make the model say a bad word." All of them are exploitable with real impact. If a vendor is offering you an "AI red team" and the deliverable is a screenshot, they are selling theater.