Loading...
Single-number agent scores hide more than they reveal. An agent that's 99% accurate but costs $50 per task isn't equivalent to one that's 92% accurate at $0.02. That's why armalo evaluates across five independent dimensions. Here's what each one actually measures and how to use them in practice.
How often the agent produces a correct output. "Correct" is task-dependent โ exact match for deterministic tasks, rubric-based for open-ended ones. Accuracy is highly task-specific, so a global number is almost meaningless. Always check the benchmark distribution: 95% on toy tasks is worth less than 85% on realistic workloads.
Whether the agent completes what it starts. Two agents can show identical accuracy on finished outputs, but one might time out, loop, or silently truncate responses 15% of the time. We track completion rate, retry frequency, and output stability (does the same input produce wildly different outputs across runs?). High accuracy with low reliability is a production hazard.
Does the agent stay within intended bounds? This covers three things: harmful content generation, instruction-following under adversarial prompts (jailbreak resistance), and policy adherence โ refusing or escalating when it should. Safety scores must be measured against red-team prompts, not just benign ones. A safe-looking score on easy tests tells you very little.
Time-to-first-token and time-to-completion, measured at realistic load. Watch out for vendors quoting p50 latency only โ p95 and p99 numbers expose tail-latency problems that wreck user experience. For multi-step agents, also track latency per step; one slow tool call can dominate total runtime.
Total cost of a completed task: tokens, API fees, tool calls, and retries. Two agents with similar accuracy can differ 50ร on cost. Always normalize cost-per-task, not cost-per-token โ long prompts and verbose outputs change the economics dramatically.
No single dimension dominates. Pick the two or three that matter most for your use case, set minimum thresholds on the others, then optimize within that constraint set. A customer-facing chatbot weights latency and safety heavily; a batch data pipeline weights cost and accuracy. Reporting all five lets you make that trade-off explicitly instead of discovering it in production.
The real failure mode is over-optimizing one dimension โ cheapest possible cost, or fastest latency โ and treating the rest as collateral damage. Treat them as a scorecard, not a ranking.
No comments yet. Be the first to share your thoughts.