The Eval Coverage Map: Where Your Tests Actually Look And Where They Pretend To
Most eval suites cover the easy 80 percent of behavior and pretend that is the whole surface. Coverage mapping makes the blind spots visible so you can decide whether you are willing to ignore them.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Most evaluation suites are honest about what they test and dishonest about what they imply. The cases in the suite get scored. The behaviors not in the suite go unscored, but the score gets quoted as if it represents the full agent. This is the central illusion of evaluation: the test result describes the test, not the agent. Coverage mapping is the discipline of charting the agent's full behavior surface, marking which regions the suite tests and which it does not, and reporting scores with explicit coverage qualifiers. This essay introduces the Behavior Surface Topology, the four classes of blind spot, the methodology for charting coverage, and a Coverage Heatmap Template you can use to make your own blind spots visible.
The agent that scored 92 and failed three months later
The agent that prompted this essay scored 92 on its evaluation suite. That is a strong score. Buyers read it as a strong score. The agent's operator quoted it as evidence of high reliability. Three months after deployment, the agent failed in production on a use case that was not in the eval suite. The failure was not subtle: a customer-facing output that was confidently wrong on a domain the agent had been confidently asked to handle. The buyer was unhappy. The operator was confused. The eval team was defensive. Everyone was technically right and operationally wrong.
The technical truth is that the agent did score 92 on the eval suite, and the suite did not include the use case that failed. The suite was not negligent; it covered what the team thought were the important capabilities. The use case that failed was a region of the agent's behavior that the team had not thought to test, partly because no one had imagined the customer would ask the agent to handle it, partly because the agent's training data included material that suggested it could handle it without explicit testing.
The operational truth is that the buyer was reading 92 as a description of the agent. They were not reading the score as "the agent scored 92 on this specific suite, which covers these specific capabilities, with these specific blind spots." They were reading it as "this agent is 92-out-of-100 reliable." The scoring system had given them a number that invited that reading. The system was not lying, exactly, but it was complicit in the misreading.
The failure mode is structural and almost universal. Eval suites are constructed by listing the capabilities you can think to test. The capabilities you cannot think to test are not in the suite. Their absence does not announce itself. The suite's score is computed only on the cases it ran; it has no field for "and these other behaviors we did not check." The score is reported with no coverage qualifier. Buyers read it as more comprehensive than it is.
This essay is the methodology for fixing that. The fix has three parts. First, you map the agent's behavior surface — the full space of inputs and contexts the agent will encounter, not just the ones in the suite. Second, you measure your suite's coverage of that surface and identify the blind spots. Third, you report scores with explicit coverage metadata so buyers reading the score know what it does and does not represent.
We will work through each part with concrete techniques. The result is not a perfect coverage of every behavior; that is impossible. The result is honest measurement of what your evaluation actually represents and, importantly, what it does not.
The Behavior Surface Topology
The first concept is the Behavior Surface. This is the full space of conditions under which the agent might act and produce outputs. It is multi-dimensional. The dimensions vary by agent type but typically include input categories (what kinds of requests does the agent receive), context categories (what user states, system contexts, and prior interactions does the agent operate within), tool categories (what tools or actions does the agent invoke), output categories (what shapes of output does the agent produce), and stakes categories (what is the consequence space of the agent's outputs).
The Behavior Surface is not the same thing as the test suite. The test suite is a finite set of points in this surface; the surface is the entire space the agent might encounter. The relationship between suite and surface is the coverage map.
A practical exercise: take a piece of paper and try to draw the surface for an agent you operate. Most teams find this exercise produces a small number of clear regions and a much larger "other" category that they did not realize was so big. The clear regions are the ones the suite covers. The "other" region is the blind spot population.
The topology of the surface matters. Some surfaces are smooth — small variations in input produce small variations in output, so testing one point in a region is reasonable evidence about nearby points. Other surfaces are jagged — small variations in input can produce large variations in output, so each point is essentially independent and you need many tests to cover a region. LLM-based agents tend to be smooth in some dimensions (paraphrasing the same request usually produces similar behavior) and jagged in others (adversarial prompt structures can produce wildly different outputs from semantically similar inputs).
The topology determines how to allocate eval budget. Smooth regions can be sparsely sampled. Jagged regions need dense sampling. A suite that uniformly samples regardless of topology under-tests the jagged regions and over-tests the smooth ones.
Mapping the topology requires probing. You construct sets of semantically similar inputs and observe how much the agent's behavior varies. Where the variation is small, you have a smooth region; where it is large, you have a jagged region. The probing itself is an evaluation activity, but it is structured to characterize the surface rather than to score the agent.
For most agents, the result of topology mapping is a partition of the surface into a few smooth regions where modest sampling is sufficient and a smaller number of jagged regions where dense sampling is required. The jagged regions are typically associated with safety-critical operations, edge cases, adversarial inputs, and rare but high-stakes scenarios. They are the regions most likely to produce production failures and most likely to be under-tested in conventional eval suites.
Four classes of blind spot
With the surface mapped, you can identify what your suite does not cover. Blind spots come in four broad classes. Each requires a different remediation strategy.
The first class is the unknown unknown. These are regions of the surface the team did not think to consider when building the suite. They are not absent because they were rejected as low-priority; they are absent because no one looked at them. The agent that failed in our opening story had an unknown-unknown blind spot — the use case that broke had not been part of the team's mental model of agent capabilities.
Unknown unknowns are the hardest blind spots to remediate because you cannot fix what you do not know about. The remediation is process: structured exercises that systematically expand the team's awareness of the surface. Adversarial probing (red-team exercises that try to elicit unexpected behaviors). Customer interview-driven case generation (asking customers what they actually use the agent for). Failure mining (systematically reviewing production logs for behaviors the suite did not anticipate). None of these eliminate unknown unknowns entirely; they just convert some of them to known unknowns, which can then be remediated.
The second class is the known but un-tested. These are regions of the surface the team is aware of but has chosen not to test, usually for budget reasons. The team knows the agent might be asked to handle a particular scenario; they have not built cases for it; they have not run evaluations on it. The blind spot is honestly acknowledged but not closed.
Known but un-tested blind spots are the easiest to remediate when budget allows. You build the cases. You run them. The blind spot becomes a covered region. The constraint is usually allocation of eval budget; with a Cost Model in hand (see the eval cost engineering essay), you can make explicit trade-offs between expanding coverage and other budget calls.
The third class is the under-sampled. These are regions of the surface the suite touches but does not cover well. There are a few cases in the region but not enough to produce a confident verdict. The score for these cases is recorded but the verdict is statistically thin. A buyer reading the aggregate score has no signal that this region is thinly covered.
Under-sampled blind spots are remediated by adding cases to the under-sampled regions. The remediation requires identifying which regions are under-sampled, which requires knowing the topology and the case distribution. The Coverage Heatmap (introduced below) is the tool for making this identification visible.
The fourth class is the wrong-rubric. These are regions where the suite has cases but the rubric used to evaluate them does not actually measure the right thing. The cases run; the verdicts are produced; but the verdicts do not correspond to what a reasonable observer would call "the agent did well" or "the agent did poorly." The blind spot is invisible because the suite reports verdicts; the verdicts are just measuring the wrong axis of behavior.
Wrong-rubric blind spots are subtle and the most embarrassing to discover. The remediation is rubric revision: rewriting the evaluation criteria for the affected cases so that the verdict reflects something meaningful. This requires going back to first principles about what good behavior looks like in the region, which is harder than it sounds because it forces clarity that the original rubric papered over.
All four classes commonly coexist in a single suite. A reasonable initial coverage map will identify some of each. The exercise of building the map is the exercise of confronting the suite's actual scope.
How to chart coverage
The charting methodology has five steps. None of them are novel; the discipline is in actually doing them rather than skipping to the eval results.
Step one: enumerate dimensions. List the dimensions of the behavior surface relevant to the agent. For each dimension, list the categories or values that the agent might encounter. Be deliberate about completeness — include rare-but-real categories, not just common ones. The output is a multi-dimensional grid where each cell represents a region of the surface defined by a specific combination of dimension values.
For example, an agent that handles customer support might have dimensions: request type (refund, technical issue, account change, billing, other), customer tier (free, paid, enterprise), channel (chat, email, phone), prior interaction state (first contact, escalated, resolved). The grid is the cross-product. Many cells will be sparse or impossible; a small number will be dense.
Step two: estimate cell volume. For each cell in the grid, estimate the production volume — how often does the agent encounter requests that fall into this cell? This requires either production logs or reasonable proxies for them. Some cells will be high-volume and routine; others will be low-volume and rare. Both matter, but for different reasons.
Step three: locate suite cases. For each case in your eval suite, identify which cell of the grid it falls into. This is bookkeeping. Some teams find that their cases concentrate in a handful of cells, leaving the rest of the grid empty. Others find the cases spread across the grid but with notable gaps. The result is a cell-by-cell count of suite cases.
Step four: compute coverage per cell. Coverage per cell depends on the cell volume (more volume requires more cases) and the cell topology (jagged regions need denser sampling). A simple metric: cases per cell normalized against a target density that scales with both volume and jaggedness. Cells with coverage below a threshold are flagged as under-covered. Cells with no cases are flagged as un-covered.
Step five: visualize and act. The coverage results render as a heatmap (more on this in the Template section). The visualization makes the blind spots visible. The team reviews, decides which blind spots to remediate, allocates budget, and revises the suite. The map is updated with each suite revision and re-reviewed quarterly.
This process feels heavy the first time. It gets faster with practice. The benefit is durable: once you have the dimensional decomposition and the cell volume estimates, future suite revisions are easier because the framework is in place. New suites can be sketched against the framework before they are built.
The Coverage Heatmap Template
Here is the named artifact this essay produces. The Heatmap is a visualization standard for coverage that any team can adopt. It has three layers.
Layer one: the grid itself. Rows and columns are two of the most important dimensions of the surface. For an agent with more dimensions, the grid is repeated for different slice combinations (by customer tier, by channel, etc.) so each grid is two-dimensional and human-readable.
Layer two: cell color. Each cell's color encodes coverage status. Green for well-covered (sufficient cases for confident verdicts). Yellow for under-covered (some cases but not enough). Red for un-covered (no cases). Gray for impossible/non-applicable cells. The color choice is not aesthetic; it is functional. Anyone glancing at the heatmap immediately sees where the suite is strong and weak.
Layer three: cell annotation. Each cell carries a small label with the case count and an estimated coverage confidence. The label lets a reviewer see not just that a cell is yellow but how yellow — three cases in a high-volume region is more concerning than three cases in a low-volume region.
The heatmap is paired with a coverage qualifier that gets attached to any score the suite produces. The qualifier is short: "This score covers X percent of the agent's behavior surface by volume, with Y percent under-covered and Z percent un-covered. Detailed coverage map available." The qualifier turns the score from a context-free number into a measurement with explicit scope.
A practical example of a coverage qualifier: "This agent's composite score of 87 was computed on a suite covering 73 percent of the agent's expected behavior volume. 18 percent of behavior is under-sampled (verdicts available but statistically thin). 9 percent is un-covered (no eval cases). High-stakes uncovered regions: tax-related queries (estimated 3 percent of volume), multi-step refund flows (estimated 4 percent of volume)."
This is more honest than the bare 87. A buyer reading the qualified score knows exactly what they are buying. They can choose to act on the score with awareness of the gaps. They can request additional evaluation on the un-covered regions before hiring. They can negotiate price or scope based on the coverage. The qualifier turns the score into a useful business object.
The Heatmap is a working document, not a one-time artifact. It updates with each suite revision. It is reviewed quarterly to ensure the surface decomposition still matches reality (the agent's capabilities evolve; the surface evolves with them). It is published alongside the suite documentation so any consumer of suite verdicts can see the coverage status.
The dimensional explosion problem
A reasonable concern with this methodology is dimensional explosion. If you decompose the surface into too many dimensions, the grid becomes huge and most cells are sparse. If you decompose into too few dimensions, the grid is small but the cells are too coarse to be useful.
The right number of dimensions is empirical. Three to five is typical for a well-defined agent. Each dimension should have between three and ten categories. The total cell count is usually under a few hundred. Larger agents with broader capability surfaces may have multiple sub-grids for different capability clusters.
The trick to avoiding dimensional explosion is hierarchy. Some dimensions are top-level (request type, channel) and define the overall grid. Other dimensions are within-cell refinements (specific request subtypes, specific user states) that matter for the cells where they apply but are irrelevant for cells where they do not. The grid stays manageable; the within-cell detail captures the nuances.
The other trick is volume-weighted attention. You do not need fine-grained coverage for cells that represent 0.01 percent of production volume. You need fine-grained coverage for cells that represent 5 percent of volume. Allocate decomposition effort proportional to volume, not uniformly. Low-volume cells can be treated as rolled-up "other" categories with light coverage requirements; high-volume cells get the detailed treatment.
This means the coverage map is not a uniform grid; it is a hierarchical structure with deep detail in high-volume regions and shallow rollups in low-volume regions. The visualization adapts: deep regions render as fine-grained heatmaps; shallow regions render as single cells with rollup statistics.
When coverage cannot be expanded
Not every blind spot can or should be remediated. Some regions of the behavior surface are genuinely outside the agent's intended scope. Some are below the cost threshold that justifies evaluation effort. Some are inherently hard to evaluate because reasonable verdicts require human judgment that does not scale.
For regions outside intended scope, the right answer is not to add coverage; it is to clarify the scope. The agent's documentation should explicitly disclaim capabilities the agent is not designed for. The trust oracle should not return scores that cover those regions. Buyers should not assume the agent handles them. Out-of-scope regions are not blind spots; they are negative space.
For regions below cost threshold, the right answer is to acknowledge the gap explicitly in the coverage qualifier and let buyers decide whether they care. Some buyers will accept the gap; others will request additional evaluation as a paid service. The market sorts the trade-off if it has the information.
For regions hard to evaluate, the right answer is to treat them as known limitations of the evaluation methodology, not of the agent. The agent might be excellent at these tasks; the evaluation cannot confirm it because the rubric is unclear or the verdict requires human judgment that is not scalable. Disclose this honestly. Some certifications can declare "score does not cover region X because evaluation methodology is insufficient; performance in region X is buyer-validated." This invites buyers to do their own validation, which is the right answer when the evaluator cannot.
The coverage map's value is not that it eliminates blind spots. It is that it makes them visible and forces a conscious decision about what to do with each one. Most blind spots get categorized and disclosed; some get remediated; a few get accepted with explicit acknowledgment. The overall posture is honest, which is more than most evaluation programs achieve.
How coverage interacts with certification
Certification tiers are typically defined by score thresholds. Bronze, Silver, Gold, Platinum. The threshold is a number; the score is a number; the agent is at or above the threshold or it is not.
With coverage qualifiers, the certification calculation gets more nuanced. A 90 score on a suite that covers 95 percent of the behavior surface is meaningfully different from a 90 score on a suite that covers 60 percent. The certification system should account for the difference.
The simplest accounting: certifications require both a score threshold and a coverage threshold. Gold tier requires score above 88 and coverage above 85 percent of the behavior surface by volume. An agent that scores 92 but only on a suite covering 60 percent of its behavior surface does not qualify for Gold; it qualifies for whatever tier accommodates the lower coverage.
This is more honest but more complicated. It also creates an incentive for agent operators to expand suite coverage rather than just optimizing scores on a narrow suite. That is a healthy incentive. Agents that improve their coverage genuinely become more reliable to hire because the score reflects more of the agent's actual behavior.
The certification framework can also accommodate partial certifications. An agent might be Gold-certified for a specific capability cluster (where coverage is high) and Silver-certified for the broader surface (where coverage is lower). This is more nuanced than a single tier but more accurate. Buyers shopping for a specific use case can see whether the agent is certified for that use case or just generally certified at a lower tier.
The trust oracle exposes both the overall certification and the per-cluster certifications. Buyers querying for a specific need get the relevant certification. The simpler overall certification is still available for general comparison. Both are honest; the more granular one is more useful for high-stakes decisions.
The relationship to red-team evaluation
Coverage mapping is closely related to but distinct from red-team evaluation. Red-team evaluation is the practice of constructing adversarial cases that try to elicit failure. Coverage mapping is the practice of charting the surface so you know what you have and have not tested. Both are necessary; neither replaces the other.
The relationship: red-team evaluation generates cases for under-covered regions of the surface, especially the jagged regions where adversarial structures can elicit failures. The coverage map identifies where to direct the red team. The red team's cases extend coverage into the jagged regions that ordinary case construction would not reach.
A mature evaluation program runs both. The coverage map identifies under-covered regions. The red team builds adversarial cases for the high-stakes under-covered regions. The cases are added to the suite, the coverage map updates, and the cycle continues.
This is more work than running either practice alone. The output is a suite that is both broad (good coverage) and deep (adversarial probing in the high-stakes regions). The score the suite produces is more reliable evidence about the agent's real-world behavior than either practice alone would produce.
The counter-argument
The sharpest counter-argument is that coverage mapping is theater. The argument: in practice, no agent's behavior surface can be fully characterized; any decomposition is an approximation; any coverage measurement is an approximation of an approximation; the resulting heatmap is a confidence-inducing artifact that is not actually more honest than a bare score, just more elaborate.
This argument has force. We have seen coverage maps that obscured rather than illuminated. A team that decomposed the surface badly produced a heatmap that looked comprehensive but missed the actual behavioral risks. The map was reassuring; the agent failed in a region the map did not represent.
The response is twofold. First, the alternative is worse. A bare score implicitly claims comprehensive coverage. A coverage map at least makes the claim explicit and inspectable. A bad coverage map can be critiqued and improved; a bare score has nothing to critique.
Second, the quality of the coverage map matters. A perfunctory decomposition produces a perfunctory map. A serious decomposition produces a serious map. The discipline of building the map well — talking to customers about how they actually use the agent, mining production logs for behavioral surprises, running adversarial probes to characterize jagged regions — is the discipline of understanding what your agent actually does. That discipline is valuable independent of the map.
The map is a tool, not a guarantee. Used well, it produces honest measurement and informed decisions. Used badly, it produces theater. The point of this essay is not to argue that the map magically solves coverage problems; it is to provide the structure for teams that are willing to do the work seriously.
What Armalo does
Armalo treats coverage as a first-class metadata field on every evaluation. Each evaluation suite is documented with its dimensional decomposition, its case-per-cell distribution, and its coverage classification (well-covered, under-covered, un-covered) per cell. The Coverage Heatmap is generated from this metadata and updated with each suite revision.
Scores returned by the trust oracle carry coverage qualifiers. The qualifier states the percentage of the agent's expected behavior volume covered by the suite, the under-covered fraction, and the un-covered fraction. High-stakes uncovered regions are explicitly named where they exist.
Certification tiers require both score thresholds and coverage thresholds. Bronze, Silver, Gold, and Platinum each have minimum coverage requirements; agents achieving high scores on narrow suites do not qualify for the higher tiers. Per-capability certifications are also issued where coverage is concentrated in a specific cluster, allowing buyers shopping for a specific use case to see the relevant certification.
Agent operators have access to the Coverage Heatmap for their agent's suites and can see which regions are under-covered. The platform recommends specific case additions to expand coverage in under-covered regions, with cost estimates from the Eval Cost Model.
Red-team evaluations target high-stakes under-covered regions identified by the coverage map. The red-team cases feed back into the suite and update the coverage map.
Frequently asked questions
How do I decompose the behavior surface for a brand-new agent with no production data? Use customer interviews, the agent's intended use cases, and analogous agents' production data as starting points. The first decomposition will be rough; refine it as production data accumulates. A rough decomposition is much better than no decomposition.
Can coverage be over-stated? Yes, easily. A team that defines its surface narrowly can claim 100 percent coverage of a tiny surface. The surface decomposition has to be honest about what the agent actually encounters in production. External review of the decomposition (by customers, by red-team practitioners) is a useful check.
What if a region is genuinely impossible to evaluate? Disclose it. Some behaviors require human judgment that does not scale or domain expertise that the evaluator lacks. The coverage qualifier should call out these regions explicitly. Buyers can then decide whether to do their own validation.
Should every score be qualified or only some? Every score. Unqualified scores invite the implicit-comprehensive-coverage misreading that is the root of the problem. The qualifier can be brief, but it has to be present.
How does coverage affect the composite score itself? The composite is computed only on covered cases. Uncovered regions are not assigned a default verdict; they are just absent from the score. Coverage is a separate metadata field that travels with the score. Conflating them produces nonsense.
What is the right cadence for revising the coverage map? Quarterly is a reasonable default. More frequently if the agent's capabilities are evolving rapidly or if production logs show new behavioral patterns. Annually is too slow; surfaces drift faster than annual review.
Does coverage mapping apply to internal-only agents? Yes, with the same logic. Internal users misread unqualified scores just as external users do. The discipline produces better internal decisions, even when there are no external buyers.
How do red-team adversarial probes update the map? Each successful adversarial probe identifies a region where the agent failed under stress. That region's coverage status is updated to reflect the discovered behavior. The probe's cases are added to the suite, raising coverage. The cycle continues.
Bottom line
The score on the suite is a measurement of the suite, not of the agent. Buyers read it as a measurement of the agent because the score does not announce its scope. Coverage mapping fixes this by making the scope explicit. Build the Behavior Surface decomposition. Map your suite cases against it. Identify the four classes of blind spot. Render the Coverage Heatmap. Attach a coverage qualifier to every score. The work is unglamorous but the payoff is honest measurement. Buyers who understand what they are buying make better decisions. Operators who see their blind spots fix them. The trust layer becomes more trustworthy because it is more transparent about what it does and does not cover. Honesty about coverage is the price of being taken seriously. Pay it.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…