2026-04-30 · 7 min read

why LLM agents are better at IDOR than humans (and worse at race conditions)

A hot take: IDOR is the class the agent fleet eats for breakfast. Here is why.

rae kimco-founder

#methodology#idor#agents

IDOR / BOLA is, by a wide margin, the class our agents confirm most often in internal testing — more than every other bug class combined, including races. There is a reason for that, and the reason explains both why agents are good at this and why they were bad at race conditions until recently.

IDOR is a vocabulary problem

Finding an IDOR is a four-step procedure: enumerate routes that take an id in the path or query, enumerate accounts you control, swap account-A's id into account-B's session, see if the response leaks. That procedure is short, mechanical, and grossly tedious for a human. It is also exactly what LLMs are good at — generate the cross-product, replay, look at the diff.

A junior pentester gets bored after the eighth route. An agent does not get bored — it will run thousands of unique (route, account-A, account-B) tuples in a single budget window. The signal is in the diff, not the cleverness.

race conditions are a temporal problem

Until late 2025, LLMs were bad at temporal reasoning. They would propose a race, "confirm" it from the source code, and ship a hypothesis that never reproduced. The validator caught every one of these but the agent's success rate was sub-5%. We shipped one race-condition agent in 2024 and quietly retired it.

race-02 (see last week's post) ships now because Sonnet 4.6 and Opus 4.7 can hold a temporal invariant across tool calls. It is not the model alone, it is the model plus a tool design that exposes the "what would have to be true" predicate as a first-class artifact. The same trick we use in race-02 is what we will use to extend the fleet to TOCTOU-class bugs at the file-system layer.

the asymmetry, generalized

Mechanical-enumeration bugs (IDOR, mass-assignment, path-traversal): agents win, and the gap is widening.
Pattern-matching bugs (XSS sinks, SSRF allowlist gaps): agents are roughly even with humans, with humans winning on context-heavy stacks.
Temporal / concurrency bugs (race, TOCTOU): humans still win, but the curve is closing. race-02 is the first agent we ship in this class.
Business-logic bugs (refund loops, escalation via API verbs): agents catch the obvious shapes; the long tail is still human territory.

why this matters for your stack

If your codebase has a lot of /v1/<resource>/{id} routes — and most do — you have IDOR surface, and the agent fleet will hammer it. If you have a state-machine that does multi-step writes, you have race surface, and the fleet will probe it. The right way to think about brink is not "an LLM-based scanner" but "a fleet of specialized agents, each opinionated about the class of bug it hunts."