Verifiable Versus Asserted Trust: Why "Trust Us" Is Not A Score
Most agent trust claims today are assertions. A verifiable score is one an independent reader can recompute. The gap is the difference between a brand and a bond.
Continue the reading path
Topic hub
Agent ReputationThis page is routed through Armalo's metadata-defined agent reputation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
A trust claim is asserted when the only thing standing behind it is the platform that issued it. A trust claim is verifiable when an independent reader can take the inputs, run the rules, and arrive at the same score without trusting the issuer. Almost every agent trust badge shipping in 2026 sits on the asserted side of that line. This essay draws the line precisely, defines four ascending verification levels β self-attested, platform-attested, jury-attested, and cryptographically-anchored β explains what each one costs to produce and consume, and offers a Verification Ladder a buyer or regulator can use to grade any trust claim they encounter. Every step up the ladder is roughly an order of magnitude reduction in counterparty risk, and roughly an order of magnitude increase in production cost. The right level depends entirely on what you are putting at stake.
A real failure mode: the badge that meant nothing
In early 2026, a procurement team at a mid-market fintech sat through a three-hour vendor presentation for an AI agent that promised to handle inbound support tickets. The pitch was tight. The agent had a green badge on the marketplace listing that said "Tier 1 Trust Verified." The marketplace had a thoughtful-looking trust page describing how badges were earned. The procurement lead, who had spent fifteen years buying enterprise software, asked the question that should be the first question in every agent procurement meeting and almost never is: "What does verified mean here?"
The answer, after twenty minutes of back-and-forth between the vendor's CSM, the marketplace's developer relations contact, and the procurement lead's own internal security architect, turned out to be this. The agent had filled out a self-assessment form. The marketplace had reviewed the self-assessment. The badge represented the marketplace's opinion that the self-assessment looked plausible. There were no third-party evaluations behind the badge. There was no public log of the inputs that produced it. The marketplace would not provide the procurement team with the underlying evidence because it was "competitively sensitive." The badge was, in the procurement lead's words afterward, "essentially a yelp review the seller wrote about themselves and the platform stamped."
The agent itself was probably fine. The vendor was credible. The procurement decision was made. But the badge had done no work. It had not transferred any risk away from the buyer. It had not given the security architect anything to point at in their own attestation chain. It had not given the procurement lead anything they could show their CFO if the agent later caused an incident. It was, structurally, marketing copy with a green checkmark next to it.
This is the failure mode this essay is about. It is everywhere right now. And the reason it is everywhere is that the entire agent economy still confuses the act of asserting trust with the act of providing trust. The two are not the same thing. The cost of mistaking one for the other will, eventually, be measured in the dollar value of the first major incident a buyer suffers from an agent that wore an asserted badge they treated as a verified one.
Defining the line: assertion versus verification
A trust claim is asserted when its only support is the credibility of the entity making the claim. The marketplace says the agent is reliable; you either believe the marketplace or you do not. The vendor says their agent has 99.4% accuracy on customer support cases; you either believe the vendor or you do not. There is no second test you can run, no underlying evidence you can re-grade, no rule book you can audit. Asserted trust is a brand affidavit.
A trust claim is verifiable when, in addition to the claim, the issuer publishes everything an independent reader needs to recompute the claim themselves. The inputs are queryable. The scoring rules are documented. The intermediate computations are reproducible. If you do not believe the marketplace, you do not have to. You can pull the inputs, apply the published rules, and arrive at the same score. If your numbers do not match theirs, that itself is evidence β either the issuer made an error, or you misunderstood the rules, or the issuer is quietly deviating from the published methodology. In all three cases, a verifiable system surfaces the disagreement; an asserted system buries it.
The difference matters because trust is, structurally, a transfer of risk from the entity making a decision to the entity standing behind a claim. When the claim is asserted, the only entity standing behind it is the issuer's brand. When the claim is verifiable, the entity standing behind it is the methodology, which can be inspected, contested, and improved without depending on any single issuer. This is exactly the same property that distinguishes a notarized document from a personal promise. Both can be true. Only one survives a hostile counterparty, a regulator's audit, or the issuer going out of business.
This is also the property that distinguishes infrastructure from product. Asserted trust is a product the issuer sells. Verifiable trust is a piece of infrastructure the ecosystem builds on. Buyers, regulators, insurers, and downstream platforms can all hang weight on a verifiable claim. Almost no one outside the issuer's revenue circle can responsibly hang weight on an asserted one.
The Verification Ladder, from bottom to top
Verification is not binary. It is a ladder, and most real-world systems sit on a specific rung even if the marketing makes them sound higher. The four rungs, in order of weakest to strongest, are: self-attested, platform-attested, jury-attested, and cryptographically-anchored. Every step up the ladder roughly multiplies the cost of producing the claim by an order of magnitude, and roughly divides the counterparty risk by an order of magnitude. Both numbers are rough. The shape is right.
Understanding the ladder is what lets a buyer or regulator look at a green badge and immediately ask the only question that matters: which rung is this, and what rung does my use case actually require? Putting a million dollars of escrow under an agent with a self-attested badge is malpractice. Demanding a cryptographically-anchored attestation from an agent that is going to summarize meeting notes is overengineering. The ladder is what makes those judgments tractable rather than vibes-based.
Rung 1: self-attested
A self-attested trust claim is one the agent or its operator made about itself. The agent's documentation says it is reliable. The vendor's marketing says the agent passes 95% of edge cases. The benchmark page on the vendor's site shows graphs the vendor produced. There is no independent input. The reader either trusts the vendor's word or they do not.
Self-attestation is not worthless. It is the floor. It tells you the vendor is willing to make a claim in writing, which is more than zero β it gives the buyer something to hold the vendor to in a contract dispute. It also gives the regulator something to enter into evidence if the vendor's claim is later proven false. But it transfers almost no risk to anyone other than the vendor's own legal exposure. The buyer is still the entity primarily on the hook if the agent fails to meet the claim.
Self-attestation is appropriate when the use case is low-stakes and the buyer's primary defense against failure is their own ability to detect and reverse the agent's mistakes quickly. Internal productivity tools, prototype workflows, low-revenue marketing automations. Not financial transactions. Not customer-facing decisions that cannot be retracted. Not anything where the cost of the first incident would exceed the buyer's appetite to take the risk personally.
Rung 2: platform-attested
A platform-attested claim is one the agent's hosting platform or marketplace made about it. The marketplace reviewed the agent's self-assessment, ran some basic checks, and stamped it. The hosting platform monitors the agent's runtime behavior and publishes uptime statistics. The badge represents the platform's opinion, informed by some platform-internal data that the buyer cannot see directly.
Platform attestation is a meaningful step up from self-attestation because the platform has skin in the game β its reputation is on the line if its badges turn out to be misleading. It is also an order of magnitude more expensive to produce because the platform has to actually look at the agent before stamping it. The asymmetry between platforms that take this seriously and platforms that rubber-stamp it is one of the largest signal sources buyers have, but it is invisible from outside the platform.
The limits of platform attestation are real. The platform's incentives are not perfectly aligned with the buyer's. The platform wants its high-revenue agents to look good. The platform's reviewers may not have the domain expertise to evaluate the agent's specific capability claims. The platform may not detect cross-platform misbehavior β the agent that looks great on this marketplace and is actively defrauding buyers on another. And the platform's review is opaque to the buyer; the buyer is asked to trust the platform's process without seeing it.
Platform attestation is appropriate for use cases where the buyer is comfortable accepting the platform's vouching as their primary defense, and where the cost of a missed bad agent is bounded by something other than the badge β typically, by the platform's escrow, dispute, or refund policies. Most current marketplace badges sit at this rung. Most marketing language describes them as if they were higher.
Rung 3: jury-attested
A jury-attested claim is one produced by an independent panel of evaluators applying a published rubric to evidence they can re-examine. In Armalo's case, the jury is a multi-LLM panel with documented prompts and explicit outlier trimming β the top and bottom 20% of judgments are dropped β but the principle generalizes. A jury can be human, machine, or hybrid. What matters is that the jurors are not the issuer, the rubric is public, the evidence is replayable, and the deliberation produces an artifact that can be re-examined later.
Jury attestation is two or three orders of magnitude harder to produce than platform attestation because every dimension of every evaluation has to actually be run, not just plausibly stamped. The agent has to be put through real cases. The cases have to be designed to expose the dimensions you care about β accuracy, scope honesty, reliability, safety. The judgments have to be aggregated with explicit anti-gaming rules. The evidence has to be stored in a way that lets a future skeptical reader re-grade the same case and reach the same conclusion.
The payoff is that jury attestation is the lowest rung at which the trust claim is meaningfully verifiable by an outside party. The platform that issues it is no longer the only entity vouching for it. The methodology is. A buyer who does not trust the platform can still trust the score, because the score's defense is the published methodology and the queryable evidence, not the platform's brand. The platform is now a service provider running an attestation process, not a brand selling its own opinion.
Jury attestation is appropriate for use cases where the buyer needs to defend their procurement decision to someone else β a board, a regulator, an insurer, a customer of their own. The buyer can point at the jury's methodology and say: I did not just take the marketplace's word for it. There is an independent process behind this number, and here is the evidence trail. That is the rung at which the trust claim starts doing real work in a contractual or regulatory frame.
Rung 4: cryptographically-anchored
A cryptographically-anchored claim is a jury-attested claim whose evidence and methodology have been hashed and committed to an immutable public log. The jury's verdict is signed. The evidence is content-addressed so that any modification to the underlying records changes the hash and breaks the signature. The score itself is anchored on a public blockchain or transparency log so that the issuer cannot quietly revise history after the fact.
Anchoring is what survives the issuer. If Armalo disappeared tomorrow, an anchored score would still be verifiable because the hashes and signatures would still match the published evidence, and any independent operator could continue serving the same score with the same underlying methodology. The anchoring also defeats the most insidious failure mode of jury attestation: silent revision. Without anchoring, the issuer can change the rubric, re-run the jury, and update the score without anyone noticing. With anchoring, every revision is itself a public, dated event, and the historical score remains queryable as the answer the system returned at that specific moment.
Cryptographic anchoring is roughly another order of magnitude more expensive than jury attestation, mostly in storage costs, key management complexity, and the engineering required to keep the anchoring path actually trustworthy under the operational pressure of running it at scale. It is also the rung at which the trust claim finally becomes infrastructure in the strict sense β survivable beyond the issuer, queryable independently of the issuer, and inspectable by adversarial readers without the issuer's cooperation.
Anchoring is appropriate for any decision where the cost of a quietly revised history would itself be a material harm. Insurance underwriting. Regulatory compliance. Legal proceedings. Long-tail liability for high-autonomy agents. Any case where you might need to prove, years later, what the system said at the moment a specific decision was made. Most use cases do not need this rung. The ones that do, need it badly enough that nothing lower will do.
What each rung actually costs
A candid accounting of the cost differential helps practitioners pick a rung honestly rather than aspirationally. These are rough numbers that vary by stack and scale, but the ratios hold.
Self-attestation costs effectively nothing to produce. The vendor writes a paragraph. The cost of consuming it is also nothing, which is the problem β the buyer has nothing to inspect. The only real cost is the legal exposure the vendor takes on by making a written claim they could later be held to.
Platform attestation costs the platform some review time per agent β typically minutes to a few hours of analyst attention, plus the engineering cost of whatever automated checks the platform runs. It costs the buyer the time to read the platform's documentation about what the badge means. The total cost per attested agent is in the dollars to low hundreds of dollars range, depending on how seriously the platform takes its review process.
Jury attestation costs significantly more. A serious eval run against a real agent in Armalo's frame is tens to low hundreds of LLM judgments across multiple capability dimensions, each with replayable evidence. The compute cost is in the tens of dollars per evaluation cycle. The methodology development cost β building good rubrics, designing adversarial cases, calibrating the jury to avoid systemic bias β is much higher and amortizes across many agents. The total cost per attested agent is in the hundreds to low thousands of dollars over an agent's lifecycle, depending on how often the agent is re-attested and how broad its capability claims are.
Cryptographic anchoring adds storage, key management, and on-chain commitment costs on top of the jury attestation. Per-anchor on-chain costs on Base L2 are pennies per commitment, but the key management discipline required to keep the anchoring path trustworthy is non-trivial β operationally similar to running a certificate authority. The total marginal cost per anchored attestation is small, but the fixed cost of running the anchoring infrastructure correctly is substantial.
The right ladder height for a use case is the lowest rung at which the cost of a false-trust incident exceeds the cost of producing the attestation. Below that line, you are paying for verification you do not need. Above it, you are saving money by underprotecting yourself against an incident that will eventually arrive.
The four common slip patterns: how systems quietly fall down a rung
Most systems do not consciously occupy a low rung. They aim for a higher one and quietly slip down to a lower one through specific operational failures. Naming the slip patterns helps practitioners see them coming.
The evidence-storage slip. A system designs its scoring as rung 3, runs evaluations against published rubrics, has a real multi-LLM jury β and stores the evidence in a way that is not actually replayable by an outside party. The internal team can re-grade, but the evidence is locked behind authenticated APIs, retained for only 30 days, or stored in a format that loses the model versions and prompts that produced it. The score is rung 3 in spirit and rung 2 in practice because the verifiability claim cannot survive an audit. The fix is treating evidence storage as a first-class engineering concern with retention SLAs, content addressing, and explicit external-reader access patterns.
The methodology drift slip. A system publishes its scoring rubric on day one and then quietly tunes the weights, the thresholds, the prompts, the trim percentages over the following months. Each individual change is reasonable. The cumulative effect is that the published methodology no longer describes the actual production system, and the score an outside reader would compute does not match the score the system displays. This is rung-3 marketing wearing rung-2 mechanics. The fix is treating methodology revisions as public events: every change to the scoring code generates a published changelog, a versioned rubric, and a recomputation trail that lets readers reconcile old and new scores.
The adjudicator independence slip. A system claims a multi-party jury, and the jury is technically multi-party β but all the jurors are LLMs from the same provider, with the same training data biases, deployed by the same operator, prompted by the same engineer. The independence is technically present and operationally absent. A bias in the underlying model substrate produces correlated errors across jurors, and the outlier trimming does not save the score because the outliers are themselves correlated with the median. The fix is genuine substrate diversity β multiple model families, ideally multiple providers, with explicit measurement of inter-juror agreement to detect when the panel is producing artificially high consensus.
The dispute volume slip. A system has a documented dispute procedure, public outcomes, and an adjudication track record β and the procedure is so onerous, slow, or expensive that essentially no one uses it. The absence of disputes looks like evidence of accuracy and is actually evidence of friction. A scoring system with no disputes filed against it is not necessarily a great scoring system; it is sometimes a system that has priced disputes out of reach. The fix is monitoring dispute submission rates as a leading indicator of methodology health, and explicitly subsidizing dispute filings on contested edges so that the dispute path stays warm.
Each of these slips is the difference between a system that the marketing language places at rung 3 and the system the operational reality places at rung 2. Buyers and regulators evaluating a trust system should specifically probe for these patterns rather than accepting the rung claim at face value.
The cross-platform problem: why portability is itself a verification dimension
A trust claim that is verifiable only inside the issuing platform is not really at the rung the issuer claims, regardless of how rigorous the methodology is. This is a subtle point and worth working through carefully.
Consider an agent that has earned a high-rung jury-attested score on Platform X. The score is real. The methodology is published. The evidence is replayable. By the rung-3 criteria laid out above, the score qualifies. But the score lives entirely inside Platform X's data model, identifier scheme, and API surface. A buyer who wants to hire the agent on Platform Y has no way to import the score, no way to verify it from outside Platform X's authentication boundary, and no recourse if Platform X decides to revise the score after the buyer has made the hire on Platform Y.
This is a real failure mode and it has shown up in early agent marketplaces. Two competing platforms each compute trust scores for the same agent. The scores are different β sometimes substantially β because each platform has its own evidence base and its own methodology. A buyer comparing across platforms cannot reconcile the two numbers because there is no shared substrate. The honest answer is that both numbers are platform-conditional claims, not properties of the agent.
Verifiability has a portability dimension that the four-rung ladder implicitly assumes. A rung-3 score with no portability is operationally a rung-2 score for any consumer outside the issuing platform. A rung-4 score with cryptographic anchoring on a public substrate is portable by construction β the anchor lives somewhere any consumer can read regardless of which platform they came from.
The practical implication for buyers is that a single platform's high-rung score deserves more scrutiny when used outside that platform than inside it. The practical implication for platforms is that contributing to a shared substrate (a public oracle, a transparency log, an open identifier scheme) is what lets their scores actually function as infrastructure rather than as platform-internal product features.
The reader artifact: the Verification Ladder scorecard
When you encounter a trust claim about an AI agent β a badge, a score, a tier, a green checkmark β run it through this scorecard before letting it influence a decision. The scorecard returns a rung from 1 to 4, plus a list of the missing properties that prevent it from rising higher.
Section A: Inputs
- Are the inputs that produced the claim queryable by a non-customer of the issuer? (Required for rung 3+)
- Are the inputs replayable β can the same evidence be re-evaluated by an independent reader and produce the same answer? (Required for rung 3+)
- Are the inputs cryptographically content-addressed so that tampering is detectable? (Required for rung 4)
Section B: Methodology
- Are the scoring rules and weights published in enough detail that an independent reader could implement them? (Required for rung 3+)
- Are deviations from the published methodology themselves logged as public events? (Required for rung 4)
- Is there an explicit anti-gaming layer β outlier trimming, cross-validation, rate limiting on submissions? (Recommended at rung 3+, required for rung 4)
Section C: Adjudicators
- Is the entity producing the score the same as the entity making money from the agent's success? (If yes, you are at rung 1 or 2 regardless of other properties.)
- Is the adjudication panel multi-party β multiple independent evaluators rather than a single reviewer? (Required for rung 3+)
- Is the panel's deliberation an artifact that can be re-examined later? (Required for rung 3+)
Section D: History
- Can you query what the score was at a specific past moment, not just what it is now? (Required for rung 3+)
- Are score revisions themselves logged as public events with timestamps and reasons? (Required for rung 4)
- Does the score history survive the issuer disappearing β would it still be inspectable if the operator went dark tomorrow? (Required for rung 4)
Section E: Disputes
- Is there a documented procedure for contesting the score? (Required for rung 3+)
- Are the outcomes of disputes themselves part of the public record? (Required for rung 3+)
- Are dispute outcomes anchored alongside scores? (Required for rung 4)
A claim that fails any rung-3 requirement is, by this scorecard, either rung 1 or rung 2. A claim that satisfies all rung-3 requirements but fails any rung-4 requirement is rung 3. The scorecard is intentionally strict β most badges in the wild today do not earn rung 3, and many that market themselves as verified do not earn rung 2.
Use the scorecard not to dismiss lower-rung claims but to align them with their appropriate use cases. A rung-2 badge is fine for a productivity workflow. It is malpractice as the only defense behind a six-figure escrow.
The temporal dimension: a verifiable claim today is not a verifiable claim forever
A subtlety the four-rung model glosses over is that verifiability is a property that decays. A rung-3 claim today, with replayable evidence and published methodology, becomes a rung-2 claim or worse over time as the underlying infrastructure ages. Evidence storage degrades. Old methodology versions become hard to reconstruct. The team that designed the rubric leaves and nobody remembers exactly what edge cases the prompts were tuned for. The cryptographic primitives used for anchoring become considered weak. Three years after issuance, a once-rigorous claim may be operationally indistinguishable from a self-attestation.
This is not a hypothetical concern. It is what has already happened to most early-2020s reputation systems. The scores still exist, but the evidence trails have been silently deleted in storage cleanups, the methodology documents have been moved or lost in re-orgs, and the original engineers are no longer available to answer reconstruction questions. A buyer trying to verify those scores today gets the answer that the operator's records have aged out of accessible storage and the claim has to be taken on faith. The claim quietly slipped down the ladder without anyone explicitly demoting it.
The defense against temporal decay is intentional preservation. Evidence retention policies that cover the full lifetime of any claim built against the evidence β which, for high-stakes use cases, is essentially indefinite. Methodology documents stored alongside the scores they produced, version-stamped so that historical scores can be reconstructed against the methodology in effect when they were issued. Cryptographic anchoring on substrates with credible long-term durability (which excludes some of the more fragile blockchain ecosystems and includes the better-engineered ones plus traditional certificate transparency logs). And periodic self-audits in which the operator re-verifies a sample of historical claims to confirm the verification path still works.
This discipline is expensive. It is also the property that distinguishes infrastructure from a feature. A platform that does not commit to long-horizon evidence preservation is not really at rung 3 or 4 for any claim that might need to be defended more than a year after issuance, regardless of how the methodology looks today.
Counter-argument: "Verification is expensive theater that slows the market"
The strongest objection to insisting on verifiable trust is that it adds friction to a market that is moving fast and benefits from the friction being low. Self-attestation lets vendors ship agents and lets buyers hire them within hours. Platform attestation adds a day or two. Jury attestation adds a week and noticeable cost. Cryptographic anchoring adds operational complexity that most teams have no reason to take on for most agents. The objection goes: most agents are not high-stakes; most decisions are reversible; the speed of the market is itself one of its competitive advantages over the human equivalents being replaced. Why drag the whole market through a verification process designed for outliers?
The objection is partially right and largely a misframing. It is right that not every agent needs rung 4. It is right that imposing rung 3 on every transaction would slow the market unnecessarily. The misframing is treating this as a binary choice between full verification or no verification. The ladder exists precisely so that participants can match the verification cost to the stakes. A productivity agent gets rung 1 or 2. A customer-support agent handling refunds gets rung 3. A financial-execution agent handling escrow gets rung 4. The market does not slow down; it grows up.
The other half of the answer is that asserted trust is itself paying a cost β it is just paying it in a different currency. Every false-trust incident in a market dominated by self-attestation produces a wave of reactive caution, contract revisions, regulator attention, and insurance premium hikes that affects every participant, including the honest ones. The friction the market avoids by skipping verification gets deposited elsewhere, usually with worse incidence β concentrated on the buyers who happened to draw the bad agent. Verification distributes the friction more evenly and converts it into a public good. It feels slower because the cost is visible. It is, on net, faster because the cost is bounded.
The right framing is not "how much friction can we tolerate" but "where does the friction land." Verification puts it at the moment of issuance, paid by the issuer who has the most leverage to amortize it across many transactions. Assertion puts it at the moment of incident, paid by whichever buyer lost the lottery. The first is infrastructure. The second is externality.
What Armalo does
Armalo's scoring system is designed to occupy rung 3 by default, with cryptographic anchoring at rung 4 available for use cases that need it. The composite score across twelve dimensions β accuracy, Metacal self-audit, reliability, safety, security, bond, latency, scope honesty, cost efficiency, model compliance, runtime compliance, harness stability β is produced by a multi-LLM jury with the top and bottom 20% of judgments trimmed to defeat single-evaluator gaming. Scoring rules and weights are published. Underlying evidence is content-addressed and queryable through the Trust Oracle at /api/v1/trust/. Score revisions are themselves logged events with timestamps and reasons. A 200-point swing in either direction triggers an anomaly review before the new score becomes the public record. Decay is mechanical: one point per week after a seven-day grace period, applied identically to every agent. Disputes go through a documented procedure with public outcomes. Cryptographic anchoring on Base L2 is available for high-stakes attestations and is the default for any score backing economic activity above a defined threshold. The point is not that Armalo is the only system that could do this. The point is that until a system meets these properties, its trust claims should be read as marketing, not as infrastructure.
FAQ
Is rung 4 always better than rung 3? Not for most use cases. Rung 4 adds operational complexity that pays off only when the cost of silent score revision is itself a material harm. For an agent producing meeting summaries, rung 3 is more than adequate. For an agent executing financial transactions whose history might need to be reconstructed in a legal proceeding three years later, rung 4 is the floor.
How do I tell what rung a marketplace's badge actually sits on? Apply the scorecard. The five sections β inputs, methodology, adjudicators, history, disputes β produce a rung answer. If the marketplace cannot answer any of the section-A questions affirmatively, you are at rung 1 or 2 regardless of the marketing language on the badge.
What if I do not have time to apply the scorecard for every agent? Pre-grade the platforms, not the agents. Most of a platform's badges sit at the same rung because the rung is determined by the platform's verification process, not the individual agent. Once you know a platform's rung, you can use that as the default for every agent on it, and apply the scorecard only when an unusually high-stakes hire warrants the extra check.
What is the cheapest legitimate way to move from rung 2 to rung 3? Replace the platform's internal review with a published methodology and a multi-party adjudication panel applied to replayable evidence. The hard part is not the panel; it is the methodology and the evidence storage. Most platforms that claim rung 3 are actually rung 2 with better marketing. Building real rung 3 takes a quarter of dedicated engineering and a design partner willing to be the first agent through the process.
Does cryptographic anchoring require a blockchain? Anchoring requires an append-only public log that no single party controls. Public blockchains are the most readily available substrate, but certificate transparency logs and similar transparency primitives also qualify. The substrate matters less than the property: nobody, including the operator, can quietly revise history after the fact.
Can a single agent be at different rungs for different claims? Yes, and frequently should be. An agent might have a rung-3 score for accuracy because its evals are jury-attested, a rung-4 record for its bond posture because the bond is anchored on-chain, and a rung-2 claim for its uptime because the uptime statistic is platform-reported. Buyers should evaluate each claim against the rung appropriate to the stakes it is being used to defend.
Will buyers actually demand verification, or will the market keep accepting assertion? Right now, mostly the second. The transition happens after the first major public incident in which an asserted badge fails a buyer with the resources to litigate. After that, procurement teams everywhere start asking the scorecard questions, and the platforms that cannot answer them lose enterprise pipeline. The transition is already starting in regulated industries β financial services, healthcare, critical infrastructure β and will spread outward from there.
Why does Armalo publish its methodology if a competitor could copy it? Because the moat is not the methodology. The moat is the operational discipline of running it at scale, the multi-LLM jury infrastructure, the dispute-adjudication track record, the bond and escrow rails that the score plugs into, and the network of platforms that have already integrated against the Trust Oracle. A competitor that copies the methodology has copied a spec, not the ecosystem that makes the spec useful. Open methodology is itself the rung-3 requirement.
Bottom line
The word "verified" is the most overworked term in the agent economy. Most of the badges wearing it are assertions in a green wrapper. The Verification Ladder is the discipline that turns the word back into something specific. Self-attestation tells you the vendor is willing to make a claim. Platform attestation tells you a platform stamped the claim. Jury attestation tells you an independent process produced the claim against published rules. Cryptographic anchoring tells you the claim will survive its issuer. Each rung does different work. Match the rung to the stakes. Stop letting the marketing pick the rung for you.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦