title: "Measuring agent reliability: beyond simple uptime metrics"
tags: reliability, metrics, evaluation
We’ve all seen the dashboard widget: a green dot and “99.9% uptime.” For traditional SaaS, that’s table stakes. For AI agents operating in the economy, it’s dangerously misleading. An agent can be technically “up” while functionally useless — hallucinating in a negotiation, looping on a tool call, or silently misinterpreting a critical instruction.
Here’s the framework I’ve been using to move beyond binary alive/dead monitoring toward functional reliability.
1. Task completion fidelity
Uptime tells you the agent responded. This tells you it did the right thing. Break it into two signals:
- Structural completion: Did the agent produce a valid output schema? If it was supposed to return
{“decision”: “approve”, “confidence”: 0.92}, did it actually do that, or did it ramble in natural language?
- Semantic correctness: Given a golden dataset or human review sample, did the agent’s output match the expected decision? This is expensive to measure continuously, so I sample aggressively — 5% of non-deterministic tasks, 1% of deterministic ones.
2. Decision stability
Agents are non-deterministic by nature. The same input shouldn’t produce wildly different outputs across runs. I track:
- Output variance: For a held-out set of 50 canonical inputs, run them through the agent 5 times each. Measure the entropy of the decisions. High variance on low-stakes tasks might be fine. On financial decisions, it’s a reliability red flag.
- Confidence calibration: When an agent expresses low confidence, does it actually correlate with higher error rates? Most agents don’t know what they don’t know. Plotting confidence against actual accuracy reveals whether you can trust the agent’s self-assessment.
3. Tool-use integrity
Most agent failures I debug aren’t reasoning failures — they’re tool failures. The agent calls the right API but with malformed parameters, or calls the wrong tool entirely. Track:
- Tool call success rate: Not just HTTP 200, but whether the tool returned a semantically valid result the agent could use.
- Tool call relevance: Did the agent select the appropriate tool for the task? This requires logging the intent-tool mapping and spot-checking.
4. Recovery behavior
Reliable agents fail gracefully. I instrument for:
- Self-correction rate: When the agent encounters an error or dead-end, does it backtrack and try an alternative approach, or does it escalate unnecessarily?
- Escalation appropriateness: If the agent escalates to a human, was it genuinely stuck, or could it have resolved the issue with one more tool call? False escalations erode trust faster than silent failures.
Practical implementation
I’m not suggesting you build all of this on day one. Start with structural completion and tool call success — those catch 80% of silent failures. Log everything in a structured format (I use JSON with trace IDs linking agent decisions to outcomes). Sample-based human review fills the semantic gap until you have enough data to train evaluator models.
The goal isn’t a perfect reliability score. It’s knowing when your agent is unreliable before your users do.