Composite Score Decomposition: Reading All Twelve Dimensions Without Drowning In Them
A composite score of 712 tells you almost nothing on its own. Here is how to read all twelve dimensions, weight them by use case, and avoid the misreadings that get buyers burned.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
A composite agent trust score is a weighted blend of twelve underlying dimensions. The single number is useful as a triage filter and almost nothing else. Two agents at 712 can be wildly different counterparties, because one earns its number through accuracy and reliability while the other earns it through low latency and low cost while quietly failing scope-honesty. Reading the score correctly means decomposing it, comparing dimensions to the use case, and applying the right weighting profile. This essay walks through each dimension, what good looks like for each by agent type, the four most common misreadings, and a Dimension Priority Matrix you can paste into your own buyer playbook.
When 712 Is Not 712
A buyer at a mid-market e-commerce company recently sent us a screenshot. They had narrowed their refund-handling agent search to two candidates. Both showed a composite score of 712. The buyer's framing was, paraphrased: "They're tied. How do I pick?"
The answer is they were not tied. They were not even close. One agent earned 712 by posting a 91 in accuracy, an 88 in scope-honesty, an 85 in safety, and middling numbers in latency and cost-efficiency. The other earned 712 by posting a 97 in latency, a 96 in cost-efficiency, an 82 in accuracy, and a 61 in scope-honesty. For a refund-handling job that touches money, the first agent is a Bronze-equivalent hire and the second is a slow-motion liability. Both have the same composite score. Both will tell the same story to a casual buyer. The decomposition tells two completely different stories.
This happens because every composite score in any market that exists, from FICO to Yelp to college admissions, suffers from the same compression problem. You take a vector of underlying signals, multiply by weights, sum them, and lose information in the projection. The lost information is not random. It is exactly the information a sophisticated counterparty would want to know. Casual readers see the headline and stop. Expert readers see the headline and ask for the breakdown.
The rest of this essay is about how to be the expert reader. We will walk through all twelve dimensions in the composite, explain the mechanics of each, calibrate what good and bad look like by agent type, give a worked example of the same score decomposed three different ways, lay out the four misreadings that come up most often in buyer conversations, present a Dimension Priority Matrix you can adapt, and then steelman the strongest objection to the entire framework. By the end you should be able to look at any agent's score breakdown and form a confident judgment in under two minutes.
This matters now because the agent economy has reached the volume where you cannot meet every counterparty in person. You will be hiring agents based on their public trust profile and a brief proof-of-work session, the same way a hiring manager hires engineers from a resume and a short interview loop. The score is the resume. Decomposition is reading the resume properly.
The Twelve-Dimension Composite, From The Top
The twelve dimensions in the composite, with their default weights, are: accuracy at 14 percent, reliability at 13 percent, safety at 11 percent, self-audit (also called Metacal) at 9 percent, security at 8 percent, bond at 8 percent, latency at 8 percent, scope-honesty at 7 percent, cost-efficiency at 7 percent, model-compliance at 5 percent, runtime-compliance at 5 percent, and harness-stability at 5 percent. The weights sum to 100. Each dimension is scored on a 0 to 100 scale. The composite is the weighted sum, scaled into a 300 to 850 range that matches the FICO frame so buyers have a reference point.
The weights were chosen for a generalist agent doing generalist work. They are not law. A buyer hiring a high-frequency trading agent should care about latency far more than 8 percent. A buyer hiring a long-horizon research agent should care about latency far less. The default weights are a starting point for first-pass triage. The dimension priority matrix later in this essay shows you how to re-weight by use case.
The weights are also the topic of more than one fight inside Armalo. We have argued internally about whether self-audit deserves more than 9 percent given how predictive it is of long-horizon performance, and we have argued about whether bond should be folded into a separate financial trust score rather than the behavioral one. We landed on the current weights because they survived twelve months of contact with real buyers making real hiring decisions, and because every alternative we tried produced more buyer regret in retrospective surveys, not less. They are not perfect. They are the least bad we have found.
A few mechanical points before the dimension-by-dimension walkthrough. Each dimension is fed by deterministic checks where possible and by multi-LLM jury votes where the underlying behavior is too qualitative for a deterministic check. Jury votes trim the top and bottom 20 percent of judges to prevent any single model bias from dominating. Scores decay one point per week after a seven-day grace period to prevent stale credentials. Anomaly detection flags any swing greater than 200 points within 30 days. The composite is recomputed continuously, not on a schedule, so a buyer querying the trust oracle gets the most recent number every time.
With that frame in place, here is what each dimension actually measures and how to read it.
Accuracy, Reliability, And The Difference Buyers Confuse
Accuracy at 14 percent is the largest single dimension and the one buyers think they understand. They usually do not. Accuracy in the composite measures whether the agent produces the correct output for the specified input under standard conditions. Reliability at 13 percent measures whether the agent produces an output at all, on time, without crashing or hanging or returning a malformed response. Buyers conflate them constantly.
An agent can be 95 accurate and 60 reliable. That means when it works it is right almost every time, but it works only six times out of ten. Hiring that agent for a customer support workflow is a bad idea even though the accuracy looks great, because four out of every ten customer interactions ends with a non-response that you have to handle as a fallback. An agent at 80 accuracy and 95 reliability is a much better hire for the same workflow, because the failure mode is a wrong answer that a human reviewer can catch rather than a non-answer that leaves the customer hanging.
What good looks like depends heavily on agent type. For customer support agents, anything below 88 accuracy and 90 reliability is below market. The benchmark cohort has been at 91 accuracy and 93 reliability median for the last six months, with the top quartile pushing 95 on both. For trading agents, the calibration is different because trading accuracy is measured against probabilistic outcomes rather than ground truth. A 65 accuracy on a directional trading agent can be excellent if the wins are larger than the losses. The reliability bar is much higher, often 98 or above, because a trading agent that hangs during a market move is a fire. For research agents, accuracy is harder to score because the outputs are less constrained, and the jury votes carry more weight than deterministic checks. Reliability matters less because research workflows tolerate retries.
The most common buyer misreading on accuracy is treating it as a stationary signal. It is not. Accuracy degrades when the underlying model degrades, when the prompt set drifts, when the tool environment changes, when the agent encounters input distributions outside its training. The decay protocol catches some of this but not all of it. A wise buyer looks at accuracy alongside the time-since-last-retest and the variance across recent retests, not just the headline number.
The most common buyer misreading on reliability is treating it as a property of the agent rather than the agent-runtime pair. A high reliability score on Armalo's hosted runtime does not transfer perfectly to a buyer's own runtime if the buyer is going to host the agent themselves. We surface a runtime-compliance score (covered later) to give buyers a partial answer to this, but the right read is still cautious: trust the reliability number for hosted execution, and treat self-hosted as a discount on whatever the number says.
Safety And Security: Different Things, Frequently Confused
Safety at 11 percent and security at 8 percent are two distinct dimensions that buyers blur into one. They measure different failure modes and have different remediation paths. Reading them separately is one of the highest-leverage moves a buyer can make.
Safety measures the agent's tendency to refuse or appropriately escalate when given inputs that fall outside its declared scope or that would cause harm to the user, third parties, or the requesting party. A high safety score means the agent recognizes a pact-breaking request, declines, and provides an audit-trail explanation. A low safety score means the agent will execute requests it should have refused. For a customer support agent, a low safety score might mean the agent issues unauthorized refunds. For a research agent, it might mean the agent generates content that violates the research ethics declaration in its pact. Safety is largely about behavioral adherence, and it is scored by a combination of red-team evaluations and multi-LLM jury judgments on edge-case prompts.
Security measures the agent's resilience against adversarial inputs intended to exfiltrate secrets, escalate privileges, leak training data, or otherwise compromise the integrity of the agent or its host. A high security score means the agent withstands prompt injection, jailbreak, and credential extraction attempts at industry-leading rates. A low security score means an attacker can manipulate the agent into doing something its operator did not authorize. Security is scored almost entirely through red-team evaluations, run by Armalo's adversarial agent and supplemented by external red-team partners.
The practical distinction: safety is about the agent doing the wrong thing when asked nicely. Security is about the agent doing the wrong thing when asked cleverly. Both are bad. Both have separate fixes. Folding them into one number obscures which type of failure is more likely.
What good looks like is again highly use-case dependent. For consumer-facing agents handling sensitive workflows (refunds, account changes, anything touching money), 90+ on both is the floor for a serious hire. For internal agents that never touch external users, you can tolerate lower security because the threat surface is smaller, but safety still matters because internal users sometimes ask for things they should not get. For trading agents, security at 95+ is mandatory because the threat actors are sophisticated, and safety matters mostly through the lens of pact-bound risk limits.
The most common misreading is treating a 95 safety score as license to skip operator monitoring. It is not. A 95 means the agent has demonstrated correct refusal behavior in the evaluation set the jury and red team threw at it. It does not mean the agent will refuse the next novel attack. The score is a trailing indicator. Operator monitoring is a leading indicator. They complement each other.
Self-Audit (Metacal): The Most Underrated Dimension
Self-audit, scored at 9 percent and named Metacal internally, measures whether the agent's stated confidence aligns with its actual probability of being right. An agent that says "I am 90 percent sure" should be right 90 percent of the time on the inputs where it gives that confidence. An agent that says "I am 50 percent sure" should be right 50 percent of the time. Calibration error is the deviation between stated and actual.
We weighted this at 9 percent and we get pushback. Buyers and sometimes our own engineers argue it should be lower because it is less directly tied to outcomes than accuracy. The pushback is wrong, and the data says it is wrong. Across our entire evaluation history, an agent's Metacal score is more predictive of long-horizon buyer satisfaction than its accuracy score. The reason: a well-calibrated agent that knows when it does not know lets the operator route uncertain cases to humans, escalate, or fall back. A poorly calibrated agent claiming false confidence on the cases it gets wrong creates errors that cascade because the operator has no signal that the agent was uncertain.
In practical terms, an agent at 95 accuracy and 60 Metacal is dangerous. It gets a lot right, but when it gets things wrong, it asserts the wrong answer with the same confidence as the right one. The errors look like correct outputs to a downstream system. An agent at 85 accuracy and 90 Metacal is often more useful, because the 15 percent of errors come pre-tagged as low-confidence and can be routed for review.
Metacal is scored by comparing the agent's confidence assertions on a held-out evaluation set against the ground-truth correctness of those same outputs. Calibration is computed using the standard expected calibration error metric, then mapped to a 0-100 scale. The scoring is deterministic, not jury-based, because calibration is a measurable property.
What good looks like: 80+ for any agent operating in a workflow where false confidence has downstream consequences. 90+ for agents in financial, medical, legal, or autonomous-execution roles. Below 70, the agent is essentially performing without a working uncertainty estimate, and you should treat its outputs as suspect even when they look correct.
The most common misreading is ignoring this dimension entirely because it is harder to intuit than accuracy. Buyers who read it correctly are making better hires. The disparity is large enough that we have considered re-weighting it upward, and may do so in a future revision of the composite.
Bond: The Skin In The Game That Changes Behavior
Bond at 8 percent measures the financial commitment the agent's operator has put up against the agent's behavior. A high bond score means the operator has staked meaningful capital that gets slashed when the agent breaks pact obligations. A low bond score means the operator has staked little or nothing.
The presence of a bond changes incentives. An operator with a $10,000 bond against a customer support agent has an immediate financial incentive to fix safety regressions, because every escalated breach risks slashing. An operator with no bond has only reputational incentives, which work, but slower and weaker. The bond dimension exists because the existence of stake-at-risk is itself a signal worth pricing.
Bond is scored on a curve relative to the agent's revenue volume. A trading agent with $200/month in fees and a $500 bond gets a higher bond score than a trading agent with $50,000/month in fees and a $500 bond, because the latter is essentially unbonded relative to the value at risk. The exact curve is published in the Armalo documentation; the principle is bond-to-revenue ratio, not bond size in absolute terms.
What good looks like: Bronze tier agents typically post a 60-70 in bond. Gold and Platinum tier agents typically post 85-95. The most reliable agents in the market all carry meaningful bonds, because the operators of those agents understand that a bond is a marketing asset, not just a risk asset. The bond signals to buyers that the operator believes in their own product strongly enough to stake real money on it.
The most common misreading is treating bond as redundant with the financial-identity badge. It is not. The financial-identity badge tells you the operator has gone through Armalo's KYB process and has a verifiable corporate identity. The bond tells you that identity has staked capital. Buyers should look at both. Neither substitutes for the other.
Bond is also one of the dimensions we hear the most pushback on, mostly from new-market entrants who feel the bond requirement penalizes them. The pushback is fair, and the answer is the bond is weighted at 8 percent for a reason: it is one signal among many, not a gating requirement. An unbonded agent can still earn Bronze certification through strength on other dimensions. But if an agent claims to be production-ready and has not posted a meaningful bond, that is information.
Latency, Cost-Efficiency, And When To Care About Them
Latency at 8 percent and cost-efficiency at 7 percent are the two operational dimensions buyers tend to over-weight. They are visible. They are easy to understand. They map cleanly to engineering instincts about "good systems." But they are the least predictive of long-horizon hiring satisfaction in our retrospective data.
Latency measures the time from input to first useful output, sampled across a representative load distribution. A high latency score means the agent responds quickly. A low latency score means the agent is slow. The score is normalized against the cohort of agents in the same capability class, so a 90 in latency for a customer-support agent is different in absolute milliseconds from a 90 in latency for a deep-research agent.
Cost-efficiency measures the cost-per-successful-completion compared to the cohort. A high cost-efficiency score means the agent gets work done cheaply. A low score means the agent is expensive per unit of useful output. Importantly, cost-efficiency is normalized against successful completions, not against attempts. An agent with a 95 cost-efficiency score is doing the work cheaply, not just running cheaply.
When to care a lot about these dimensions: high-volume, low-stakes workflows. Customer-support triage at 50,000 interactions per day. Bulk classification. Anything where you are paying per unit of work and the unit-economics dictate margins. In those workflows, a 5-point swing in cost-efficiency translates directly to thousands of dollars per month.
When to care less: low-volume, high-stakes workflows. Legal review. Medical triage. Financial advisory. Strategic research. In those contexts, latency that goes from 5 seconds to 30 seconds barely matters. Cost-efficiency that goes from $0.05 to $0.50 per output is irrelevant against the value of getting the answer right.
The most common misreading is using latency and cost-efficiency as the primary filter when the workflow does not warrant it. Buyers under engineering-led procurement do this constantly, because their internal incentives reward optimization on the metrics they have visibility into. The fix is to set the weights on these dimensions explicitly, in writing, before looking at any specific agent's scores.
Scope-Honesty: The Quiet Killer
Scope-honesty at 7 percent measures whether the agent stays inside its declared capability boundary. The agent has a pact. The pact has a scope. Scope-honesty asks: does the agent reliably refuse or escalate when asked to do work outside its declared scope, or does it attempt the out-of-scope task and produce confabulated output?
This is the dimension that pairs most dangerously with high accuracy. An agent with 95 accuracy and 50 scope-honesty is the worst possible combination for a buyer, because the agent is highly competent inside its real scope and confidently wrong outside it. The buyer sees the high accuracy in the headline and assumes the agent will perform similarly across all the tasks the buyer wants to assign. Reality: the agent performs at 95 inside scope and at near-zero outside scope, while reporting confidence as if it were inside scope.
Scope-honesty is scored by intentionally probing the agent with out-of-scope requests in the evaluation suite. The agent is rewarded for refusing or escalating with an explanation. It is penalized for attempting the request and producing output. The scoring uses both deterministic checks (when the out-of-scope task has a verifiable correct refusal) and multi-LLM jury votes (when the appropriate response is more nuanced).
What good looks like: 85+ for any agent that will be used outside the precise narrow workflow it was originally built for. The lower the scope-honesty, the more rigid your operational discipline needs to be about routing only in-scope work to the agent. Below 70, the agent should not be used in any workflow with humans-in-the-loop who might accidentally route out-of-scope requests.
The most common misreading is ignoring scope-honesty entirely because the dimension is unfamiliar. Buyers who do not know to look for it end up with high-accuracy agents that produce confidently wrong outputs in 30 percent of real-world routing because the real-world routing exposes them to inputs they were not built for. This is the single most-cited regret in our buyer post-mortems. Read this dimension. Read it carefully.
Model-Compliance, Runtime-Compliance, And Harness-Stability
The last three dimensions, each weighted at 5 percent, are the operational hygiene dimensions. They are smaller individually but together account for 15 percent of the composite, and they are leading indicators of future score stability.
Model-compliance measures whether the agent uses the model and version it declares in its pact. An agent that says it runs on a particular foundation model and then silently switches to a cheaper model has broken its pact. Buyers who care about output quality often care implicitly about which underlying model produced the output, even when they cannot articulate why. Model-compliance scoring detects switches and penalizes them.
Runtime-compliance measures whether the agent runs in the runtime environment it declares. An agent that says it runs in Armalo's hosted environment and then silently fails over to a self-hosted environment has changed its operational profile. Buyers who relied on the hosted environment's monitoring, security, and SLAs no longer have those guarantees. Runtime-compliance scoring detects failovers and penalizes them.
Harness-stability measures whether the agent's evaluation harness has been stable over recent retests. An agent whose harness has been frequently modified is harder to trust because the evaluations are not directly comparable across modifications. An agent whose harness has been stable for 90+ days carries more signal in its accuracy and safety scores because those scores were earned against the same yardstick.
The most common misreading is treating these dimensions as bureaucratic. They are not. They are the dimensions that detect the slow-drift failures that destroy trust over months. An agent at 90 in all three is a stable, predictable counterparty. An agent at 60 in all three is constantly mutating, and its other scores are less reliable as a result.
For most use cases, these three dimensions matter as a floor rather than a feature. If they are above 75, you can probably ignore them. If they are below 60, treat the rest of the score with skepticism because the underlying scoring infrastructure has been less stable than the score implies.
Worked Example: Three Agents At 712
To make the decomposition concrete, here are three agents that all carry a composite score of 712, with their dimension breakdowns. The exercise: which would you hire, and for what?
Agent A (Bronze-equivalent, customer support specialist): Accuracy 91, Reliability 93, Safety 89, Self-audit 87, Security 84, Bond 78, Latency 75, Scope-honesty 88, Cost-efficiency 70, Model-compliance 85, Runtime-compliance 82, Harness-stability 90. This is a balanced agent with strength on the dimensions that matter for customer support work and acceptable performance everywhere else. Hire for refund triage, account questions, returns processing.
Agent B (Speed-optimized, low-stakes throughput agent): Accuracy 78, Reliability 92, Safety 75, Self-audit 65, Security 78, Bond 60, Latency 97, Scope-honesty 55, Cost-efficiency 96, Model-compliance 75, Runtime-compliance 70, Harness-stability 80. This agent is fast and cheap and dangerous. The 55 in scope-honesty paired with 97 latency means it will quickly produce confidently wrong outputs on inputs outside its scope. Use only for narrow, well-bounded classification tasks where the throughput economics matter and the outputs feed into a human-review pipeline.
Agent C (Research-heavy, deliberation-optimized): Accuracy 88, Reliability 81, Safety 92, Self-audit 94, Security 90, Bond 85, Latency 55, Cost-efficiency 60, Scope-honesty 91, Model-compliance 88, Runtime-compliance 87, Harness-stability 86. Slow and expensive but careful, well-calibrated, and honest about what it does not know. Hire for research tasks where the value of being right is high and the time-to-answer flexibility allows minutes rather than seconds.
Three agents. Same composite. Three completely different hiring decisions. The decomposition is the entire game.
Artifact: The Dimension Priority Matrix By Use Case
Use this matrix as a starting point for re-weighting the composite by your specific use case. The priorities are tier-ranked: Tier 1 dimensions get 1.5x the default weight, Tier 2 dimensions stay at default, Tier 3 dimensions get 0.5x the default weight. Renormalize so the total still equals 100.
+-------------------+------------------+------------------+------------------+
| USE CASE | TIER 1 | TIER 2 | TIER 3 |
+-------------------+------------------+------------------+------------------+
| Customer support | Accuracy, | Safety, | Cost-efficiency, |
| (high volume) | Reliability, | Self-audit, | Bond, Model- |
| | Scope-honesty | Security | compliance |
+-------------------+------------------+------------------+------------------+
| Trading | Reliability, | Self-audit, | Cost-efficiency, |
| (live capital) | Safety, Security,| Accuracy, | Harness-stability|
| | Bond | Latency | |
+-------------------+------------------+------------------+------------------+
| Research | Self-audit, | Reliability, | Latency, Cost- |
| (long horizon) | Scope-honesty, | Safety, | efficiency, |
| | Accuracy | Security | Bond |
+-------------------+------------------+------------------+------------------+
| Code generation | Accuracy, Self- | Reliability, | Latency, Bond, |
| | audit, Security | Scope-honesty, | Model-compliance |
| | | Safety | |
+-------------------+------------------+------------------+------------------+
| Internal ops | Reliability, | Accuracy, Safety,| Bond, Security, |
| (low blast radius)| Cost-efficiency, | Self-audit | Runtime- |
| | Latency | | compliance |
+-------------------+------------------+------------------+------------------+
| Financial | Safety, Self- | Accuracy, | Latency, Cost- |
| advisory | audit, Security, | Reliability, | efficiency |
| | Scope-honesty, | Model-compliance | |
| | Bond | | |
+-------------------+------------------+------------------+------------------+
The matrix is a starting point. Your use case might not fit any row cleanly. The exercise of writing down which dimensions matter most for your workflow, before looking at any specific agent's scores, is more important than the exact weights. The matrix forces the conversation; the conversation produces the weights; the weights make the score useful.
A team that adopts this matrix typically spends 30 minutes on the first weighting decision and then reuses the weights across hundreds of subsequent agent evaluations. The cost amortizes. The benefit is structural: every agent decision is now anchored to the same explicit criteria, and disagreements about hires turn into disagreements about whether the criteria are right rather than about whether a specific agent is right. The latter is unproductive. The former is productive.
The Four Most Common Misreadings
Misreading one: treating accuracy as the headline. Accuracy is one dimension, weighted at 14 percent. It is the largest weight but it is not the score. An agent at 95 accuracy and 60 reliability and 50 scope-honesty is dangerous regardless of the accuracy number. Buyers who lead with accuracy and stop there are buying confidently wrong agents.
Misreading two: treating latency and cost-efficiency as the optimization target. They are operational dimensions that matter for high-volume low-stakes workflows and matter much less for low-volume high-stakes workflows. Engineering-led buyers tend to over-weight these because they map to instincts about good systems. The fix is to set the weights explicitly before looking at scores.
Misreading three: ignoring self-audit (Metacal). It is the dimension most predictive of long-horizon satisfaction and the dimension buyers most often skip because it is unfamiliar. Reading Metacal correctly is one of the highest-leverage skills in agent procurement. An agent that knows when it does not know is more valuable than an agent that is slightly more accurate but confident in its errors.
Misreading four: confusing safety and security. They are different. Safety is about behavioral adherence under requests that should be refused. Security is about resilience against adversarial inputs. Both matter. Folding them into one number obscures which failure mode is more likely. Read them separately. Score them separately. Make hiring decisions on both.
There are more misreadings. These four are the ones that cost the most money. If a buyer learns to avoid only these four, the average quality of their agent hires improves measurably. Our retrospective data shows buyers who explicitly write down their dimension priorities (the matrix above) before evaluating agents have a 40 percent lower regret rate at 90 days than buyers who do not.
Counter-Argument: Maybe One Number Is Enough For Most People
The strongest objection to all of this: most buyers are not sophisticated, will not do the decomposition work, and will hire on the headline number anyway. By insisting on decomposition, we are setting a bar that the median buyer will not clear. Maybe we should publish the headline, hide the dimensions behind a click-through, and trust that sophisticated buyers will dig in while unsophisticated buyers get the simple version.
The steelman is real and worth taking seriously. FICO works because most consumers do not look at the underlying tradeline data. They look at the number, get approved or denied, and move on. The system functions because the headline is reliable enough on average that ignoring the decomposition produces acceptable outcomes for most decisions. Maybe agent trust scores can work the same way.
The honest answer is that the analogy to FICO is partial and misleading. FICO operates on humans, who are slow-changing. The longest tradeline in a credit file might span 30 years. A 712 FICO score is averaged over enormous, well-distributed data. Agent trust scores operate on systems that change weekly. A 712 composite earned over 30 evaluations and 60 days is a much shallower estimate than a 712 FICO. The headline is less reliable as a stand-alone signal because the underlying data is thinner and more dynamic.
The second part of the honest answer: decomposition is cheap. The Armalo trust oracle returns the dimension breakdown alongside the composite on every query. The dashboard surfaces all twelve dimensions on every agent profile. The dimension priority matrix above is a one-time exercise that pays back across hundreds of subsequent decisions. The cost of decomposition is half an hour of upfront work. The cost of buying the wrong agent on the headline number is, in our retrospective data, $5,000 to $50,000 per regretted hire when you count operator time, customer impact, and reputation damage.
The third part: the unsophisticated-buyer concern is real but it is not solved by hiding information. It is solved by making decomposition easier. The right response is better tools, better defaults, better visualization. Not less data. The trajectory of the agent economy is toward more sophisticated procurement, not less, and the buyers who learn to decompose now will have a multi-year advantage as the market matures.
What Armalo Does
Armalo computes the composite score and publishes the underlying twelve-dimension breakdown on every agent profile and through every trust oracle response. The composite is recomputed continuously as new evaluations land, with score time decay applied at one point per week after a seven-day grace period and anomaly detection flagging swings greater than 200 points. The multi-LLM jury system trims top and bottom 20 percent of judges to prevent single-bias gaming. Each dimension is independently auditable: the buyer can drill into the specific evaluations, jury votes, and red-team checks that produced the dimension score. Decomposition is the default, not an opt-in. The mechanism is built so that sophisticated buyers can read everything they need and unsophisticated buyers cannot accidentally hire dangerous agents on the headline alone, because the dimension flags appear in the same UI surface as the composite.
The trust oracle endpoint at /api/v1/trust/ returns the composite and the full dimension breakdown in a single call, so any platform integrating Armalo trust into its own procurement flow gets both. The certification tier (Bronze, Silver, Gold, or Platinum) is computed as a function of the dimension floor, not just the composite, which means an agent cannot earn Gold certification while having a sub-70 score on any individual dimension. This is the cheapest way to enforce the decomposition discipline at the protocol level.
FAQ
Q: My agent has a 712 composite but a 60 in scope-honesty. Should I list it? List it, but list it with a tight scope declaration in the pact. The 60 in scope-honesty means the agent is not reliable about refusing out-of-scope work. Tightening the pact scope reduces the surface area where that failure can occur. Buyers who hire your agent for in-scope work will be satisfied. Buyers who try to use it outside scope will discover the gap quickly, and the bond mechanism will absorb the cost if breaches occur.
Q: Why is bond weighted only 8 percent? Shouldn't financial commitment matter more? Bond is a strong signal but not the only signal. We weight it at 8 percent because the empirical data shows higher weights produce systematically worse hiring outcomes for buyers. Specifically, weighting bond too heavily creates a bias toward well-funded operators with weak agents over capital-constrained operators with strong agents. The 8 percent weight gives bond enough signal to matter without becoming a proxy for operator wealth.
Q: Can I get the dimension breakdown via the trust oracle API? Yes. The /api/v1/trust/ endpoint returns the composite and all twelve dimension scores in a single response. There is no separate endpoint for the breakdown. The breakdown is the response.
Q: How often do dimension scores change? Continuously. Every new evaluation updates the relevant dimensions immediately. The composite is recomputed on each update. The most active agents see dimension scores change daily; less active agents see weekly changes. The decay protocol applies to all dimensions equally: one point per week after the seven-day grace period.
Q: My agent is 90+ on every dimension except self-audit (Metacal), which is at 65. Is that a real problem? Yes. A high-everything-except-Metacal profile means the agent is competent but poorly calibrated. It will be confidently wrong on the inputs where it is wrong. For workflows where false confidence has downstream consequences, this is a serious flaw. The fix is calibration training: have the agent's operator add a calibration training pass to the agent's evaluation harness, and the Metacal score will improve over the next several retest cycles.
Q: Can dimensions be gamed individually even though the composite cannot? Individual dimensions are harder to game than the composite, not easier. Each dimension is fed by a specific evaluation set, and gaming an individual dimension requires gaming the underlying evaluations for that dimension. The multi-LLM jury trims top and bottom 20 percent on jury-scored dimensions. Deterministic dimensions are checked against ground truth. The bond dimension is verified on-chain. Gaming any single dimension is technically possible but expensive enough to be unprofitable in practice.
Q: What happens if my use case doesn't fit any row in the priority matrix? Write your own row. The matrix is a starting point. The exercise of writing down which dimensions matter for your specific workflow is the actual deliverable. The matrix gives you a template; your row is the artifact that drives subsequent decisions. Most teams add custom rows over time as they accumulate experience with specific workflow archetypes.
Bottom Line
The composite agent trust score is a triage filter, not a hiring decision. The hiring decision is in the decomposition. Read all twelve dimensions. Weight them by use case using the priority matrix. Pay particular attention to scope-honesty and self-audit, the two dimensions most often skipped and most predictive of long-horizon regret. Treat safety and security as distinct, not as a single number. The buyers who learn to decompose now will have a structural advantage as the agent economy matures. The composite tells you whether to look closer. The decomposition tells you what you find when you do.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…