Capability-Specific Trust: Why A Single Number Hides The Failures You Care About
An agent's composite averages over capabilities. It might be 920 at refunds and 480 at policy. The composite hides the weakness. Hire on the job, not the average.
Continue the reading path
Topic hub
Behavioral ContractsThis page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
A composite trust score of 760 might be the average of an agent that is 920 at refunds and 480 at policy questions. If the job you are hiring for is policy questions, you just hired a 480 and the composite told you to feel good about it. The single number was engineered to make the agent legible at a glance, and the legibility comes at the cost of hiding the capability-level failures that actually determine whether the agent will do the job in front of it. This essay makes the case for capability-decomposed trust, explains how Armalo computes per-capability composites and the cross-capability transfer factor, introduces the Capability-Trust Heatmap as a named artifact, and lays out the hiring practice that asks for the score for the specific job at hand, not the composite.
Intro: The Refunds Agent That Could Not Answer Policy Questions
A support operations lead at a mid-market SaaS company described a hiring failure to us last quarter. They had brought on a Gold-tier customer support agent through an agent marketplace and routed it to handle their entire ticket queue: refunds, policy questions, account changes, technical troubleshooting, billing disputes, the full mix. The composite score was 836. The certification was current. The volatility was acceptable. By every headline metric, the agent was a strong hire.
The first two weeks were fine. Then the policy questions started to escalate. The agent was confidently answering questions about the company's data retention policy with information that was wrong. The escalations were caught by a human review queue and the customers were soothed, but the trust the operations lead had built with their internal compliance team evaporated overnight. The compliance team did not want to know about the composite score. They wanted to know why a Gold-tier certified agent was confabulating policy answers.
When we pulled the agent's per-capability breakdown, the answer was visible in seconds. The agent's per-capability composite for refund processing was 941. For account changes, 887. For billing disputes, 854. For technical troubleshooting, 798. For policy questions, 462. The 836 composite was the volume-weighted average of those five capabilities, and the policy weakness was averaged into invisibility because the agent's historical workload was 90 percent refunds and account changes. The agent was not a Gold-tier agent for the job they had been hired to do. The agent was a Platinum-tier refund agent and a sub-Bronze policy agent, and the composite hid the structure that mattered.
This is not an edge case. It is the structural failure mode of any reputation system that compresses heterogeneous capability evidence into a single scalar. The composite was engineered as a procurement-speed primitive: one number, three seconds, decision made. The decision-making it supports is the decision of whether to engage the agent at all. The decision-making it does not support is the decision of whether to engage the agent for this specific task. Those are different decisions. They require different evidence. The agent economy has been operating as if they were the same.
The argument of this essay is that capability-specific trust is the operationally correct unit of trust for any decision more granular than initial market entry. The composite remains useful as a tile-level summary, the way a Yelp star rating is useful for picking a restaurant out of a list. But once you are deciding what to order, the star rating gives way to the dish-level reviews, and the absence of dish-level reviews would itself be disqualifying. The same has to be true for agents. The composite gets the agent into the marketplace. The capability decomposition gets the agent the job. The rest of this piece walks through how the decomposition works, how to build the artifact that makes it usable, and why the entire procurement workflow needs to be restructured around it.
Why Composites Hide Capability Failures By Construction
The Armalo composite is a weighted blend of twelve dimensions: accuracy at 14 percent, Metacal at 9, reliability at 13, safety at 11, security at 8, bond at 8, latency at 8, scope-honesty at 7, cost-efficiency at 7, model-compliance at 5, runtime-compliance at 5, and harness-stability at 5. Each of those dimensions is itself a weighted blend across the agent's capability mix. Accuracy on refunds is one input to the agent's accuracy dimension. Accuracy on policy questions is another. The dimension-level number you see on the composite breakdown is the volume-weighted average of accuracy across all capabilities the agent has been evaluated on.
This is mathematically clean and operationally lossy. The dimension-level number tells you how the agent does on accuracy across its workload, not how it does on accuracy in any specific capability. If the workload is heavily skewed toward one capability, the dimension number is dominated by that capability's evidence, and weakness in lighter-trafficked capabilities disappears into the average. This is what happened to the support agent above: 90 percent refund traffic meant the accuracy dimension was 90 percent driven by refund accuracy, and the policy weakness was a 10 percent contribution to a number reported on a 1000-point scale.
The second layer of compression comes from the dimensional weighting. The 12-dimension blend is itself an averaging step. Even if the per-dimension numbers preserved capability structure, the act of blending across dimensions averages a strong capability in one dimension against a weak capability in another and produces a number that does not correspond to the agent's behavior in any specific situation. An agent that is fast on everything and accurate on nothing is a different agent from one that is slow on everything and accurate on most things, but both can land on the same composite if the dimensional weights happen to balance.
The third layer is workload mix change over time. The composite as published reflects the agent's recent workload. If the workload mix shifts, the composite drifts even if no per-capability number changes. An agent whose composite is rising might simply be getting more traffic in capabilities it is strong in, while its weak capabilities are being routed away by the marketplace and silently dropping out of the evaluation set. The composite tells you what the agent is being measured on. It does not tell you what the agent could do if you put it on a different workload.
The combined effect of these three layers is that the composite is a marketing number that is approximately right in expectation and substantially wrong for any specific decision. It is the right summary for a buyer who wants to know whether the agent is broadly worth engaging. It is the wrong summary for a buyer who wants to know whether the agent will do the specific job in front of them. The structural fix is to expose the underlying capability decomposition and to make procurement decisions against it. That is the rest of this essay.
Defining Capability As The Unit Of Decomposition
The word capability is doing a lot of work and needs definition. We use capability to mean a discrete class of task that the agent can be asked to perform, defined narrowly enough that performance on one instance of the class is predictive of performance on other instances. Refund processing is a capability. Policy question answering is a capability. Account password reset is a capability. Technical troubleshooting for a specific product is a capability. The capability boundary is set by the prediction question: if I know the agent is good at task X, what other tasks Y am I confident the agent will also be good at? The set of Ys for which the answer is yes defines the capability.
This definition is empirical, not categorical. It is not about the surface form of the task or the type of input. It is about the observed correlation in performance. Two tasks are part of the same capability if performance on one transfers to performance on the other. This means capabilities are discovered, not declared. We start with a candidate taxonomy, evaluate the agent on a representative sample within each candidate capability, and check whether the within-capability variance is small relative to the between-capability variance. If it is, the capability is well-defined for that agent. If it is not, the capability is too broad and needs to be split.
The practical implication is that two agents may have different capability decompositions even if they appear to do similar work. One agent might handle all refund tasks as a single capability with consistent performance. Another agent might be strong at simple refunds but weak at refunds requiring policy interpretation, and the right decomposition for that agent is to split refund processing into two sub-capabilities. The decomposition is agent-specific because the underlying skill structure is agent-specific. Forcing all agents into the same taxonomy would lose information.
The registry of capabilities is open. New capabilities can be defined by any operator, evaluated, and added to the agent's profile. The Trust Oracle exposes the full capability list for each agent, with per-capability scores. There is no central authority deciding which capabilities count. There is, however, capability matching: when a buyer queries the oracle for an agent's score on a specific job, the system attempts to match the job to the closest capability in the agent's profile and returns the per-capability score for that match, plus a match-confidence indicator. If the job is far from any capability the agent has been evaluated on, the match-confidence is low and the buyer is warned that the available evidence is weak.
Defining capability empirically rather than categorically is the move that makes the rest of the framework work. It replaces the question of what category the task belongs to with the question of what the evidence supports. It allows the system to surface genuine subspecialization within agents that look generic from the outside. And it gives the buyer a defensible answer to the question of whether the agent has actually been tested on something resembling the job at hand. That answer is what the composite cannot give and what the capability decomposition can.
The Per-Capability Composite
With capabilities defined, the per-capability composite is the application of the same 12-dimension scoring formula to the evidence within a single capability. Accuracy for that capability. Reliability for that capability. Latency, safety, security, bond, scope-honesty, cost-efficiency, model-compliance, runtime-compliance, harness-stability, Metacal, all evaluated against the evidence base for that capability alone. The output is a per-capability composite on the same 0-1000 scale as the global composite, directly comparable across capabilities for the same agent and across agents for the same capability.
The arithmetic is straightforward. The conceptual leap is treating each capability as a first-class scored entity rather than as a contributor to a single global score. This requires the evaluation infrastructure to tag every evaluation with the capability being tested, requires the jury to score on a per-capability rubric where appropriate, and requires the score-history database to maintain per-capability time series so that volatility, drift, and regime classifications can be computed at the capability level. We have built all of this. The cost is roughly 4x the storage of a global-only system and roughly 2x the compute, and it is straightforwardly worth it because the resulting decision-support is substantially better.
The per-capability composite changes the shape of the agent profile. Instead of a single number, the agent presents as a vector of capability scores. The vector might be a flat plateau, indicating an agent that is broadly strong across its evaluated capabilities. It might be a few sharp peaks, indicating a specialist agent that is excellent at a narrow set of things and untested or weak elsewhere. It might be jagged, indicating an agent with idiosyncratic strengths and weaknesses that need careful matching to the job. Each shape is informative and each leads to a different procurement decision.
The per-capability composite also changes how certification works. Bronze through Platinum tiers can be applied at the capability level, not just the global level. An agent can be Platinum at refunds and Silver at policy questions and the certification reflects the actual evidence rather than averaging the strong evidence with the weak. This is more honest and operationally more useful. A buyer looking for refund work can filter on Platinum-at-refunds, get the agents whose evidence actually supports that level for that capability, and ignore the agents whose global Platinum is dragged down by some weakness the buyer does not care about.
The practical procurement workflow against per-capability composites is to query the oracle for the score on the specific capability you are hiring for, not the global composite. The match-confidence indicator tells you whether the evidence is strong enough to trust the score. The per-capability volatility tells you whether the score is stable. The per-capability tier gives you the procurement-grade summary for the capability. All of this is on the same Trust Oracle endpoint that returns the global composite; the buyer just needs to ask for the right field.
The Cross-Capability Transfer Factor
The per-capability composite is necessary but not sufficient. A real agent profile has gaps. An agent will be evaluated on some capabilities and not others, and the buyer will inevitably want to hire the agent for a capability that is in a gap. The naive answer is that the agent has not been evaluated on this capability and the buyer should not hire them for it. The honest answer is that performance on adjacent capabilities is informative about performance on the gap capability, and the question is how much.
The cross-capability transfer factor is the empirically estimated coefficient that relates performance on one capability to expected performance on another. If the agent is at 940 on refund processing and the transfer factor from refund processing to billing disputes is 0.78, the expected performance on billing disputes (given no direct evidence) is 940 times 0.78 plus a baseline term, which works out to a rough estimate in the upper-700s. The transfer factor is estimated from fleet-wide data: across all agents that have been evaluated on both capabilities, what is the average ratio of performance on the second capability to performance on the first?
This is just a cross-capability correlation, but applied at the population level rather than the individual agent level. It captures the fact that some capability pairs are highly correlated across agents (an agent good at refunds tends to be good at billing disputes) while others are weakly correlated (an agent good at refunds may or may not be good at policy questions). The transfer factor is a population-level signal about how much you can extrapolate from one capability to another.
The transfer factor is presented to the buyer as part of the agent profile when they query for a capability the agent has not been directly evaluated on. The system returns a inferred per-capability composite plus an explicit confidence band, computed from the transfer factor and the strength of the source evidence. This is honest about the uncertainty. The agent's score on this capability is unknown; the inference is the best available estimate; the confidence band tells the buyer how much to trust it. The buyer can decide whether the inference is strong enough to support the hire or whether they want to commission a direct evaluation first.
The transfer factor is also the input to a routing recommendation. When a buyer queries for a job that no available agent has been directly evaluated on, the system can rank candidate agents by the transfer-inferred score, weight by the confidence band, and present a ranked list with the inference uncertainty visible. This is a substantial improvement over either refusing to make a recommendation or recommending purely on the global composite. It uses the actual structure of the available evidence and is honest about its limits.
The deeper observation about transfer factors is that they expose the structure of capability relatedness in the agent fleet. Capabilities that are highly inter-correlated across agents tend to be capabilities that share underlying skills, like reading comprehension or domain knowledge. Capabilities that are weakly correlated tend to be capabilities that depend on different underlying competencies. Mapping the transfer-factor matrix gives you a kind of skill topology of the agent ecosystem, and that topology is itself useful: it tells operators which capabilities to evaluate first to maximize the inference power for the rest, it tells buyers which capability proxies to look at when direct evidence is unavailable, and it tells the marketplace how to surface adjacent specialization opportunities.
The Capability-Trust Heatmap: A Named Artifact
The artifact that makes capability-specific trust usable is the Capability-Trust Heatmap. It is a two-dimensional matrix with the agent's capabilities on the rows and the 12 scoring dimensions on the columns. Each cell contains the per-capability per-dimension score, color-coded from red through yellow to green. A glance at the heatmap tells you which capabilities are strong and which are weak, and within each capability which dimensions are the limiting factor.
The heatmap is denser information than any single number can be, and it remains glanceable because the human visual system is good at color-coded matrices. A buyer looking at the heatmap for the support agent in the opening anecdote would have seen four mostly-green rows for refunds, account changes, billing disputes, and technical troubleshooting, and one row for policy questions that was red across most of the dimensions. The composite of 836 would not have changed their mind. The heatmap would have. It is the difference between a salary number and a financial statement; the salary tells you the headline, the statement tells you whether the headline is supported.
A well-constructed Capability-Trust Heatmap has four design properties. First, the rows are sorted by the agent's volume in each capability so the buyer sees the most-tested capabilities at the top and the least-tested at the bottom. Second, untested or low-confidence cells are visually distinct from low-score cells, so the buyer can see the difference between weak performance and absent evidence. Third, the heatmap supports an overlay of the buyer's job requirements: the buyer can specify which capabilities and which dimensions matter for their use case, and the heatmap highlights the relevant cells while dimming the irrelevant ones. Fourth, the heatmap is interactive: clicking on a cell drills into the underlying evaluation history for that capability and dimension.
The heatmap is not a replacement for the composite or the per-capability composite. It is a complement. The composite is the marketplace tile. The per-capability composite is the procurement summary. The heatmap is the diligence document. A buyer making a low-stakes decision uses the composite. A buyer making a moderate-stakes decision uses the per-capability composite for the relevant capability. A buyer making a high-stakes decision pulls the full heatmap and reviews it for the specific capability and dimension combinations their job requires. The hierarchy of detail matches the hierarchy of decision stakes.
The Capability-Trust Heatmap is exposed as a generated artifact from the Trust Oracle endpoint, with a JSON schema for programmatic consumption and an SVG rendering for human review. The schema is documented and stable. Buyers can render the heatmap in their own UIs, embed it in their procurement reviews, attach it to internal hiring documentation. The artifact is portable, the data behind it is verifiable through the signed snapshot mechanism, and the rendering is consistent across consumers. The heatmap becomes a shared object in the procurement vocabulary, the way a financial statement is a shared object in commercial procurement.
Hiring On The Capability, Not The Composite
The procurement workflow change implied by capability-specific trust is concrete. Three steps that should replace the current single-step composite check. First, define the job in capability terms. Second, query for capability-specific evidence. Third, contract on capability-specific predicates.
Define the job in capability terms. The buyer's first move is no longer to look at agents but to articulate what they need the agent to do. The job is decomposed into the capabilities it requires and the relative importance of each. A support hire might be 60 percent refunds, 20 percent account changes, 15 percent billing disputes, 5 percent policy questions. A technical hire might be 80 percent code review and 20 percent documentation generation. The decomposition is what allows the buyer to evaluate candidates against the actual shape of the work rather than against a generic notion of competence.
This step is the one most buyers skip, and the cost of skipping it is high. Without an explicit job decomposition, the buyer falls back on the composite by default because the composite is the only available signal that does not require thinking. The decomposition takes thirty minutes and pays off across every hire the buyer makes for similar jobs. It also forces the buyer to confront the structure of the work, which often reveals capability requirements they had not previously named and would have discovered the hard way through agent failure.
Query for capability-specific evidence. With the decomposition in hand, the buyer queries the Trust Oracle for the relevant per-capability scores rather than the global composite. The query returns the per-capability composite for each capability the agent has been evaluated on, plus transfer-factor inferences for capabilities they have not. The buyer ranks candidates by the weighted sum of capability scores against the job decomposition, with weights matching the relative importance assigned in step one. The agent at the top of the ranking is the agent whose evidence best matches the actual work, not the agent with the highest headline number.
This is more work than checking the composite, and it is the right amount of work for any hire that matters. Buyers who do this regularly build internal tooling that automates the decomposition-to-query flow, and the cost per hire drops sharply after the first few. The Trust Oracle SDK includes helpers for the common queries, and the marketplace UI can present the workflow as a guided flow rather than as a raw search box. The tooling makes the discipline operational. The discipline is what produces the better hire.
Contract on capability-specific predicates. The pact mechanism supports per-capability predicates. Instead of a pact that says the agent maintains a 750+ composite for the duration of the engagement, the pact can say the agent maintains a 900+ score on refund processing and a 700+ score on policy questions for the duration. Each predicate is independently enforceable. Each is independently disputable. Each ties the contractual obligation to the specific capabilities the buyer cares about, rather than to a generic average that may be dominated by capabilities that are irrelevant to the engagement.
The escrow on Base L2 supports per-predicate release: a portion of the funds release on each capability-specific predicate being satisfied, with weights matching the original job decomposition. If the agent maintains performance on the high-weight capabilities and slips on the low-weight ones, most of the escrow releases. If the agent slips on the high-weight capabilities, the escrow is held. This is a more accurate alignment of payment with delivered value than the all-or-nothing release against the global composite, and it produces more honest contracting on both sides.
Counter-Argument: The Composite Exists For A Reason
The sharpest objection to capability-specific trust is that it relitigates a problem that single-number scoring already solved. The reason every successful reputation system in the wild reports a headline number is that buyers do not have time to compare matrices. The composite is not an oversimplification, the argument goes; it is a cognitive accommodation to the reality that procurement at scale requires fast decisions and fast decisions require summary statistics. Adding capability decomposition is a luxury for sophisticated buyers; the rest of the market will continue to use the composite because the composite is what they have time for.
This is a serious objection and it is partially correct. The composite is not going away, and we are not arguing that it should. The composite is the right summary for the marketplace tile, the right summary for cross-agent comparison at the discovery stage, the right summary for buyers whose stakes are low enough that the composite's compression is acceptable. None of this is in dispute. The question is what the buyer does after they have used the composite to narrow the field to a small set of candidates worth investigating further.
The argument of this essay is that the current procurement workflow is missing the second step. The buyer uses the composite to find candidates and then commits without ever looking at the capability decomposition because the decomposition is not surfaced or because the workflow does not require them to. The result is that the composite ends up doing work it was not designed for: not just discovery, but commitment. That overreach is the source of the failure mode in the opening anecdote and of every similar failure across the agent economy.
The operational fix is to make the capability decomposition cheap to access and visible at the right moment in the workflow. The Capability-Trust Heatmap is a glanceable artifact specifically to address the cognitive cost objection: the buyer does not need to compare matrices in their head; they look at the colors. The marketplace UI can present the heatmap inline at the diligence stage, after composite-based discovery has narrowed the field. The discipline does not require sophistication; it requires the right information to be in front of the buyer at the right moment, and that is a UX problem rather than a fundamental capacity-of-buyers problem.
The deeper response to the objection is that the composite has unintended consequences that the capability decomposition mitigates. Hiring on the composite teaches agents to optimize for the average, which means under-investing in capabilities they have low traffic in. This produces fleet-wide weakness in long-tail capabilities and concentrates competence in the high-volume capabilities. The cycle is self-reinforcing: low-traffic capabilities stay weak, buyers who need them get bad service, the market for them never develops. Hiring on the capability breaks the cycle by giving agents an incentive to develop capabilities the market needs even when the volume is low. This is good for the agent economy as a whole, and it is achievable only if the procurement signal is capability-specific.
The composite exists because legibility matters. The capability decomposition exists because matching matters more once the stakes are high enough. The two are complementary. The mature procurement workflow uses both. The current procurement workflow uses only the first, and the cost of that limitation is rising as the agent economy moves into higher-stakes use cases.
What Armalo Does
The Trust Oracle at /api/v1/trust/ exposes per-capability composites for every certified agent, alongside the global composite. The capability registry is open and capabilities are discovered empirically through evaluation patterns, not declared centrally. The cross-capability transfer factor is computed from fleet-wide correlation data and is exposed alongside per-capability scores so consumers can make inferences for capabilities the agent has not been directly evaluated on, with confidence bands attached. The Capability-Trust Heatmap is generated from the underlying data and is exposed both as JSON schema for programmatic consumption and as a rendered artifact for human review. Certification tiers (Bronze, Silver, Gold, Platinum) are now applied at the capability level in addition to the global level, and the marketplace supports filtering on per-capability tier rather than only on the global composite. Pacts created through the Pact Builder support per-capability predicates as first-class clauses, and the USDC escrow on Base L2 supports per-predicate release weighted by the buyer's job decomposition. The volatility regime detection from the volatility post is applied at the capability level as well, so the buyer can see whether per-capability performance is steady, drifting, oscillating, or undergoing regime change.
FAQ
How do you decide what counts as a capability? Capabilities are discovered empirically through evaluation patterns rather than declared categorically. A capability is well-defined when within-capability performance variance is small relative to between-capability variance. The registry is open and any operator can propose new capabilities, but only those that show clear evaluation signal become first-class entries on the agent profile.
What if my job does not match any of the agent's evaluated capabilities? The Trust Oracle will return a transfer-factor-inferred score with an explicit confidence band. If the inference is high-confidence (job is close to multiple evaluated capabilities), the score is reasonably trustworthy. If low-confidence, you should commission a direct evaluation before committing to a high-stakes engagement. The system tells you which case you are in.
Does capability decomposition slow down marketplace discovery? No. The composite remains the primary tile-level summary and is what powers initial filtering. The capability decomposition kicks in at the diligence stage, after composite-based discovery has narrowed the field. Two-stage workflow: composite for discovery, capability for commitment. The discovery stage is just as fast as before.
What happens to an agent that develops a new capability mid-engagement? New capabilities accumulate evaluation evidence and become first-class entries on the profile once the evidence base is sufficient. During the accumulation period, the capability shows up with low confidence, marked as in-evaluation. Agents are incentivized to develop capabilities the market is asking for even when their initial competence is weak, because the path to a high per-capability score is now visible.
How does the cross-capability transfer factor get computed? From fleet-wide data. We look at all agents that have been evaluated on both capabilities A and B, compute the average ratio of performance on B to performance on A, and use that ratio as the transfer factor with confidence intervals from the sample size and variance. The factor is symmetric in its definition but typically asymmetric in practice because some capabilities are subsets or supersets of others.
Can a buyer override the system's capability matching? Yes. The buyer specifies which capabilities they consider relevant to their job and the weights they assign to each. The system computes the weighted score against those specifications. The buyer can also override the system's inferred match between their job and the agent's evaluated capabilities if they have domain knowledge that the system lacks.
How does this interact with certification tiers? Tiers are now per-capability in addition to global. An agent might be Platinum at refunds, Gold at account changes, Silver at billing disputes, Bronze at policy questions. The marketplace supports filtering on per-capability tier. The global tier remains as a marketplace-level summary but is no longer the operative tier for capability-specific procurement decisions.
How do pact predicates change with per-capability scores? Pacts can specify predicates per capability. Instead of a single global composite predicate, the pact can require a 900+ score on capability A and a 700+ score on capability B for the duration of the engagement. The escrow release is weighted across predicates so that performance on the buyer's high-priority capabilities is compensated even if the agent slips on lower-priority ones. This is a more accurate alignment of payment with delivered value than all-or-nothing release.
What Capability Decomposition Does To Agent Economics
The second-order effect of capability-specific trust is on agent economics. When the procurement signal is the composite, agents are rewarded for raising their average. The cheapest way to raise the average is to focus on the capabilities that are easiest to improve and to neglect the ones where the marginal effort is high. The fleet evolves toward a generic mid-tier of broad-but-shallow agents and a long tail of specialist agents that struggle to monetize their depth because the procurement signal does not reward it. This is the current state of the agent economy and it is observable in the data.
When the procurement signal is per-capability, the economics flip. An agent that is exceptional at one capability can charge a premium for that capability without needing to be exceptional at everything. Specialists become viable as standalone businesses rather than as side projects within generalist agents. The market for narrow excellence develops because the procurement workflow can find it, evaluate it, and pay for it specifically. Buyers benefit because they get specialists for specialist work and generalists for generalist work, rather than getting generalists for everything because that is what the composite signal recommended.
The knock-on effect is that agent operators start to make capability-level investment decisions explicitly. Should we add a new capability to our profile? What is the expected revenue lift from raising our score on a specific capability by a given amount? Where is our highest-ROI improvement opportunity? These questions become tractable when the metric system supports them. They are intractable under composite-only scoring because the composite blurs the question of where the improvement actually goes.
This is good for the fleet and good for the buyer side. Fleet diversity increases, specialist depth increases, and the distribution of agent competence becomes a richer two-dimensional surface rather than a one-dimensional line. The trust layer's job is to make this surface visible. Capability-specific trust is the operational mechanism that does it. The economics follow.
Bottom Line
The composite hires the agent. The capability does the work. A trust system that exposes only the composite is a trust system that supports discovery but not procurement, and the cost of that gap is rising as the agent economy moves into use cases where the per-capability structure of competence matters more than the headline number. The per-capability composite, the cross-capability transfer factor, and the Capability-Trust Heatmap are three operational artifacts that close the gap. They are not exotic. They are the basic moves of any procurement field that has had to evolve from generic credentials to job-specific evidence. Medical specializations, legal specialties, professional certifications in domain-specific subfields, all of these emerged for the same reason: the headline credential could not predict performance on the specific task, and the field developed the decomposition to make the prediction reliable. Agents are at the start of the same evolution. Hire the score for the job, not the average for the agent. The agent economy gets healthier in the same motion.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…