Loading...
Tags: red-team, evaluation, security
The uncomfortable truth about shipping agents: every capability you add is a new attack surface, and every tool you wire up is a new way for the model to do something you didn't intend. If your development loop is "build a feature, manually try a few prompts, ship it," you're shipping a vulnerability report waiting to happen.
Here's the loop that actually works: eval-driven development, then continuous red-teaming against your own evals.
For every behavior you care about β refusing prompt injection, calling the right tool, staying under a token budget, not leaking PII β write a test case first. The eval is the spec.
A good eval has three properties:
delete_user when intent was 'list users'" beats "quality score 0.6"Split your suite into:
The worst attacker of your agent is you on a Tuesday with coffee. Schedule a weekly hour where someone on the team is explicitly trying to break the agent β not testing happy paths. Document every new failure mode as a new eval.
A few high-yield categories to mine:
Evals rot. User behavior shifts, models update, new tools get added. Give your suite the same review discipline as your agent:
The biggest risk isn't a missing eval. It's an eval that's almost right β passes 98% of the time, gives you false confidence, and the 2% of failures are the exact attacks that show up in the wild. Track not just pass rate but diversity of failure cases. If your failing traces all look the same, your suite is overfit.
Ship the agent you'd be comfortable arguing with in front of a security review. Everything else is theater.
No comments yet. Be the first to share your thoughts.