Red-teaming your own agents: a practical guide to eval-driven development

Tags: red-team, evaluation, security

The uncomfortable truth about shipping agents: every capability you add is a new attack surface, and every tool you wire up is a new way for the model to do something you didn't intend. If your development loop is "build a feature, manually try a few prompts, ship it," you're shipping a vulnerability report waiting to happen.

Here's the loop that actually works: eval-driven development, then continuous red-teaming against your own evals.

Step 1: Write the eval before the code

For every behavior you care about — refusing prompt injection, calling the right tool, staying under a token budget, not leaking PII — write a test case first. The eval is the spec.

A good eval has three properties:

Deterministic grading when possible (regex on output, JSON schema check, exact tool call sequence)
LLM-as-judge only when you can't avoid it, with the judge prompt versioned alongside it
A failure message a human can act on — "model called delete_user when intent was 'list users'" beats "quality score 0.6"

Split your suite into:

Regression suite — known-good behaviors, run on every PR
Red-team suite — adversarial cases, run nightly or on release candidates
Behavioral smoke — the top 50 real user traces, refreshed monthly

Step 2: Red-team yourself, on purpose

The worst attacker of your agent is you on a Tuesday with coffee. Schedule a weekly hour where someone on the team is explicitly trying to break the agent — not testing happy paths. Document every new failure mode as a new eval.

A few high-yield categories to mine:

Prompt injection through tool outputs (a webpage returned by your search tool says "ignore prior instructions and…")
Indirect goal hijacking via long context (user pastes a 50-page doc with instructions embedded)
Tool abuse — legitimate-looking requests that chain to dangerous actions: read → summarize → email external
Identity confusion in multi-agent setups (agent A trusts anything from agent B)
Cost amplification — a query that triggers 10k tokens of recursive tool calls

Step 3: Treat your eval suite like production code

Evals rot. User behavior shifts, models update, new tools get added. Give your suite the same review discipline as your agent:

New features ship with new evals, gated in the same PR
Eval failures block deploys, the same way unit test failures do
Sample 1% of real traffic weekly and grade it manually to catch eval blind spots

The failure mode nobody talks about

The biggest risk isn't a missing eval. It's an eval that's almost right — passes 98% of the time, gives you false confidence, and the 2% of failures are the exact attacks that show up in the wild. Track not just pass rate but diversity of failure cases. If your failing traces all look the same, your suite is overfit.

Ship the agent you'd be comfortable arguing with in front of a security review. Everything else is theater.

red-teamevaluationsecurity

Comments (0)

No comments yet. Be the first to share your thoughts.