Insights

BuilderEvaluation & scoring

Eval Provenance: Tracking Which Judge Decided What And Why It Matters In Court

2026-06-2022 minarmalo Team

When a pact violation goes to dispute, the eval that scored it has to be reconstructible. Provenance is the difference between a verdict and a hand-wave.

Continue the reading path

Topic hub

Agent Evaluation

This page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.

Strategic Guide

Agent Evaluation Framework

Curated Collection

Evaluation Blueprints

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

TL;DR

When a behavioral pact violation triggers a financial penalty, the eval that scored the violation has to be reconstructible weeks or months later. A reviewer needs to verify that the right judges were on the panel, that the panel composition was committed before the eval ran, that the prompt and rubric were not modified mid-eval, that each judge's score is attributable, and that the deliberation log shows the actual reasoning rather than a sanitized version. Most evaluation systems record the verdict and discard everything else. That works until the first dispute. After that the absence of provenance becomes the thing the lawyers argue about. This essay walks through what provenance has to capture, what it costs to capture it, and the schema we settled on after several disputes that taught us what was missing.

A Dispute That Could Not Be Reconstructed

Early in the platform's life we settled an escrow on a pact violation. The agent's operator filed a counter-claim arguing that the eval was wrong. They wanted the verdict reviewed. We pulled up the eval record. It contained the verdict, the timestamp, the agent identifier, and a one-line summary of the panel decision. That was it.

The operator asked which judges were on the panel. We could not say with certainty because the panel selection logic at the time was non-deterministic and we had not logged the selection. The operator asked what the rubric was at the time of the eval. We could not say with certainty because the rubric had been updated twice between the eval and the dispute and we had not version-stamped the eval against a specific rubric snapshot. The operator asked what the agent's response had been. We had it, but we could not prove it had not been edited because the storage layer had no integrity guarantees and the original raw response hash was not recorded.

We ended up settling the dispute generously in the operator's favor. The verdict had probably been correct. We had no way to prove it. The cost of the settlement was meaningful but the more meaningful cost was reputational. An evaluation system that cannot reconstruct its own decisions is not an evaluation system; it is a black box that issues opinions. We pulled the eval pipeline and rebuilt the provenance layer before any other feature work.

The rebuilt pipeline records substantially more, costs more to operate, and has been worth it many times over. Every dispute since the rebuild has been adjudicated on the basis of the reconstructed eval, not on the basis of trust in the platform. Operators who win disputes win them on the merits of the reconstruction. Operators who lose disputes lose them on the merits of the reconstruction. The reputation of the score depends on the ability to defend it.

This essay is about what the rebuilt provenance layer captures and why each piece of it is necessary. The schema is opinionated. Every field exists because at some point its absence cost us something. The conclusion is that provenance is not an add-on; it is the foundation of an eval system that can survive scrutiny. If the provenance is not designed in from the start, retrofitting it is more expensive than rewriting the eval pipeline.

What A Dispute Reviewer Actually Asks

Dispute reviewers, whether human or automated, ask a specific set of questions in a specific order. The provenance schema has to answer those questions. If the schema cannot, the dispute escalates to negotiation rather than adjudication, and negotiation is where money gets spent on goodwill rather than on fairness.

The first question is identity. Which agent was being evaluated, which version of the agent, which deployment context, which counterparty. Identity sounds trivial but it is not. Agents have versions, and a verdict on version 1.2 does not necessarily apply to version 1.3. Counterparties have multiple deployment contexts, and an eval run for one context does not bind the agent in another. The provenance has to fix the identity at the time of the eval and resist later confusion.

The second question is panel composition. Which judges were on the panel, what model versions, what specific weights or fine-tunes, and how were they selected. If panel selection is random, the random seed has to be recorded and the selection function version has to be recorded so the selection can be reproduced. If panel selection is deliberate, the selection criteria and the responsible party have to be recorded. The reviewer has to be able to verify that the panel was not stacked, that judges were not added or removed mid-eval, and that the panel composition was committed before the eval started.

The third question is rubric. What was the rubric at the time of the eval. Rubrics evolve and the eval has to be scored against the rubric that was active when it ran, not the rubric that is active when the dispute is reviewed. The provenance has to capture the rubric version, the prompt template version, and any per-eval customizations. If the rubric was being A/B tested, the variant that the eval used has to be recorded.

The fourth question is evidence. What was the agent's actual output, what were the inputs, what was the context. The reviewer needs the raw evidence in tamper-evident form, not a summary. Many disputes hinge on whether the agent's output meant what the eval interpreted it as meaning. The reviewer has to read the actual output to make that judgment.

The fifth question is reasoning. What did each judge actually conclude and why. Not the verdict the panel reported, but the underlying reasoning of each judge. Reviewers often find that judges agreed on the verdict but for very different reasons, and the reason matters for the dispute. Sometimes a judge's reasoning is itself the dispute, when the operator argues that the judge applied an inappropriate criterion.

The sixth question is integrity. Has any of this been modified after the fact. Storage layers fail, logs get rewritten, well-intentioned cleanups erase exactly the data the reviewer needs. The provenance has to be tamper-evident, which usually means cryptographic commitments to the original record. If the platform cannot prove the record is unmodified, the operator can argue that the record was tampered with, and the platform has no defense.

The seventh question is policy. Was the eval run according to the platform's stated policy. If the stated policy was that panels would be at least five judges and the panel was four, the eval is invalid regardless of its outcome. The provenance has to record enough to demonstrate policy compliance.

These seven questions form the core of the dispute reviewer's checklist. Provenance schemas that answer all seven survive disputes. Schemas that answer some but not all generate disputes that escalate to negotiation.

Identity And The Versioning Problem

The identity question seems trivial until you try to nail it down. An agent is a software system that runs against a model. The agent has a version. The model has a version. The deployment configuration has a version. The eval was run at a specific moment in time when each of these had a specific version. Six months later, when the dispute arises, all three may be different. Which version was actually evaluated?

The naive solution is to record the agent's reported version string at the time of the eval. This works only if the agent reports honestly. We have seen cases where the agent's reported version was a marketing label rather than a content hash, and the actual binary that ran was different from the binary that the operator later claimed was the version under evaluation. The dispute came down to whether the agent's behavior was the agent's behavior or the operator's behavior, and the version string did not resolve it.

The robust solution is to record a content hash of the agent's runtime artifacts at the time of the eval. The agent has to commit to its own version cryptographically, not just by self-reporting. We require agents to expose a deterministic version endpoint that returns a hash of their executable code and configuration, and we record that hash in the eval provenance. If the operator later claims a different version was running, the hash falsifies the claim. If the platform later claims a different version was running, the hash falsifies that claim too.

The deployment context is harder. An agent can be deployed against many different downstream APIs, with many different system prompts, with many different operating budgets. The eval has to capture the context the agent was operating in, not just the agent's binary. We record the context as a structured envelope that includes the downstream API endpoints, the system prompt hash, the operating budget, and the regulatory context. The envelope is committed to the eval record and is part of the provenance.

The counterparty matters because evals can be run for different parties with different criteria. An eval run as part of a bilateral commercial relationship may use criteria that an eval run for the public leaderboard does not. The provenance has to capture which criteria applied, which means it has to capture for whom the eval was run. We record the counterparty identifier and a hash of the bilateral criteria when applicable.

Identity gets recorded once at the eval's start and is immutable thereafter. Any change to identity invalidates the eval. This rule is strict and has occasionally been frustrating for operators who wanted to amend their version mid-eval, but the rule is not negotiable. Without strict identity binding, the eval becomes a moving target.

Panel Composition And The Selection Function

The panel composition question has multiple parts. Who was on the panel, when were they added, who selected them, by what process, and was the composition committed before the agent's response was scored.

The simplest version of provenance records the panel members by identifier and version. This is necessary but not sufficient. A reviewer needs to know how the members were selected. If the platform selected them, the reviewer has to verify that the selection was not biased toward judges that would produce a particular verdict. If the agent or operator influenced the selection, the reviewer has to consider whether that influence created an inappropriate panel.

We use a deterministic selection function with a recorded seed. The function takes the eval identifier, the dimension being evaluated, and a per-quarter random seed, and produces the panel composition. Given the same inputs the function produces the same panel. The seed is committed at the start of the quarter and published; the eval identifier is committed at the start of the eval. The reviewer can re-run the selection function and verify that the panel that was used is the panel the function would have produced. This is reproducibility through determinism.

The selection function itself has a version. We update it occasionally to reflect new judge availability, retired judges, or changes in the panel size policy. Each version is committed and the eval provenance records which version was active at the time of the eval. A reviewer who wants to re-execute the selection has to use the historical version, not the current one. This requires keeping the historical version available in the codebase, which we do by tagging each selection function version with the date range during which it was active.

Mid-eval changes to the panel are forbidden. Once the panel is selected, the panel is fixed for the duration of the eval. If a judge times out or fails to return a response, the eval is degraded and either retried or marked as inconclusive. We do not silently substitute another judge because that substitution could be exploited to game the verdict. The provenance records every judge that was selected and what response, if any, each judge returned. Timeouts and failures are first-class data, not noise to be suppressed.

The commitment timestamp matters. The panel has to be committed before the agent's response is scored, not after. Otherwise the platform could in principle select panels based on the agent's response, which would defeat the purpose of having a panel. We use a cryptographic commitment scheme: the panel hash is published before the eval starts, and the eval cannot proceed until the commitment is recorded. The commitment is part of the provenance.

This level of rigor sounds excessive until you watch a dispute play out. The first question every operator asks in dispute is whether the panel was stacked. Without commitment proof, the answer is a hand-wave. With commitment proof, the answer is mathematical. The cost of the commitment scheme is small. The benefit is the ability to defend the verdict.

Rubric Versioning And Active Variants

Rubrics evolve. The rubric that scored an eval six months ago is not the same rubric that is active today. The provenance has to bind each eval to the rubric that was active when it ran. Otherwise the dispute reviewer has no basis for evaluating whether the score was correct.

Rubric versioning has to capture three things. First, the rubric content itself: the criteria, the weights, the scoring functions. Second, the prompt template that the panel saw, which encodes the rubric for the judges. Third, any per-eval customizations or A/B test variants that affected this specific eval.

We store rubrics in a versioned rubric registry. Each version has a content hash and an effective date range. When an eval runs, the active rubric version for that dimension is recorded in the provenance. If the rubric is updated, old evals stay bound to their original version. The reviewer can pull up the historical rubric and read what criteria were applied.

The prompt template is a separate version because the same rubric content can be encoded in different prompt templates. We A/B test prompt templates regularly to find phrasings that produce more reliable judge agreement. The eval provenance records which template variant was used. If the dispute alleges that the prompt was misleading or biased, the reviewer can read the exact prompt the panel saw.

Per-eval customizations are rare but they exist. Some commercial relationships have bilateral criteria that supplement or replace the default rubric. The eval provenance records the customizations and the agreement that authorized them. Without this record, the operator could later argue that the eval was scored against the wrong criteria, and the platform would have no defense.

The rubric versioning rule is that no eval is ever rescored against a new rubric. If the rubric changes, future evals use the new rubric, but past evals stay on the rubric that was active when they ran. This is a counterintuitive policy because it means that two evals against the same agent can produce different verdicts based on rubric drift. The alternative is worse: rescoring would mean every score is provisional and disputes would have no anchor. We chose the bound version.

Evidence And Tamper-Evidence

The evidence is the agent's actual output, the inputs the agent saw, and any intermediate artifacts the panel considered. The provenance has to capture all of this in a form the reviewer can trust.

The agent's output is the most important piece. We record the raw output as the agent emitted it, before any post-processing. The output is content-hashed and the hash is committed to the eval record. If anyone modifies the stored output later, the hash changes and the modification is detectable. This catches both malicious tampering and well-intentioned cleanups that altered formatting.

The inputs are recorded in the same form. The exact prompt the agent saw, the exact context, the exact metadata. Inputs sometimes include sensitive data and we have to balance recording with privacy, but the eval provenance treats the input as an immutable record. If the dispute alleges that the input was different from what the agent expected, the reviewer can verify the actual input.

Intermediate artifacts include the agent's tool calls, its retrieved context, its internal reasoning if exposed. Many agents do work that is not visible in the final output, and that work is sometimes the basis of the dispute. An agent that produced a correct final output through inappropriate intermediate steps might fail an eval that scores process as well as outcome. The provenance has to capture the process for those evals to be defensible.

Tamper-evidence is the property that any modification to the record is detectable. We use cryptographic commitments throughout. Each piece of evidence has a content hash, the hashes are aggregated into a Merkle tree, and the tree root is committed to a tamper-evident log at the time of the eval. The log itself has integrity guarantees through standard append-only structures. A modification to any piece of evidence invalidates the Merkle root, which is detectable on inspection.

This sounds like overkill for evaluations. It is not overkill once you have lost a dispute because you could not prove your records were unmodified. The cost of the cryptographic infrastructure is low. The cost of being unable to defend a record is high. We default to high integrity for everything in the provenance and accept the slight overhead.

The Eval Provenance Schema

Here is the artifact this essay was built around. This is the schema the eval provenance layer writes for every evaluation. Use this if you are designing your own eval system.

EvalProvenanceRecord {
  evalId: UUID                              // unique identifier for this eval
  evalStartTime: timestamp
  evalEndTime: timestamp
  platformVersion: hash                     // version of the eval platform

  agent: {
    id: identifier
    versionHash: hash                       // cryptographic version commitment
    deploymentContext: {
      systemPromptHash: hash
      downstreamEndpoints: [url,...]
      operatingBudget: budget
      regulatoryContext: enum
    }
    counterparty: identifier OR null
  }

  panel: {
    selectionFunctionVersion: version
    selectionSeed: hash                     // committed at quarter start
    members: [{
      judgeId: identifier
      modelVersion: hash
      addedAt: timestamp
      removedAt: timestamp OR null
    },...]
    panelHash: hash                         // committed before scoring
    commitmentTimestamp: timestamp
  }

  rubric: {
    rubricVersion: version
    rubricContentHash: hash
    promptTemplateVersion: version
    promptTemplateHash: hash
    customizations: structured OR null
  }

  evidence: {
    agentOutput: {
      raw: string
      hash: hash
    }
    inputs: {
      prompt: string
      context: structured
      hash: hash
    }
    intermediateArtifacts: [{
      type: enum
      content: structured
      hash: hash
    },...]
    merkleRoot: hash
  }

  judgments: [{
    judgeId: identifier
    score: number
    reasoning: string
    confidenceInterval: range
    abstained: boolean
    timestamp: timestamp
    hash: hash
  },...]

  deliberation: {
    log: [{
      step: enum
      participant: identifier
      content: string
      timestamp: timestamp
    },...]
    finalDecision: structured
    trimmedJudgeIds: [identifier,...]      // for transparency in trim disputes
    decisionMethod: enum                    // trimmed mean, median, full panel
  }

  integrity: {
    schemaVersion: version
    signedBy: identifier                    // platform signing identity
    signature: hash
    auditLogReference: identifier
  }
}

The schema is dense. Every field exists for a reason and removing any of them creates a dispute the reviewer cannot fully resolve. Some fields, like the trimmed judge IDs and the deliberation log, are visible to the operator on request but are not published by default because publishing them creates gaming risks. The reviewer has access to everything; the operator has access to enough to challenge a verdict but not enough to predict future verdicts.

The schema is versioned. The version is recorded in the provenance so that future tooling can read old records correctly. We have changed the schema several times as we have learned what disputes need. Old records stay readable through schema migration adapters, not by rewriting the records themselves.

How A Real Dispute Plays Out With And Without Provenance

A worked example of a dispute makes the value of provenance concrete. Two cases, both involving a pact violation that triggered an escrow release. One case had full provenance, one did not.

The first case was the case from the opening of this essay, before the rebuild. An agent had been scored as violating a pact on data accuracy. The escrow released forty thousand dollars in penalties to the counterparty. The operator filed a dispute claiming the eval was wrong. The reviewer asked which judges had scored the eval. We could provide names but not the panel selection seed or the selection function version. The reviewer asked what the rubric was at the time. We could provide the current rubric but the rubric had been updated since the eval and we could not bind the eval to the historical version. The reviewer asked what the agent's actual output had been. We had it but the storage layer had no integrity guarantees and the operator argued the stored output had been altered. We could not refute the argument.

The dispute settled on the basis that the platform could not defend the verdict. The penalty was refunded in full. The counterparty was unhappy because they believed the original verdict was correct. The operator was satisfied but the satisfaction was based on the platform's inability to defend rather than on the merits of the case. The platform lost reputation with both parties because both parties saw that the eval system could be effectively challenged regardless of the underlying truth.

The second case was a pact violation that came up after the provenance rebuild. Another agent, scored as violating a pact on response timeliness. The escrow released sixty thousand dollars in penalties. The operator filed a dispute. The reviewer asked which judges had scored the eval. We provided the panel composition with the selection seed and the selection function version, and the reviewer re-ran the selection function to confirm the panel matched. The reviewer asked what the rubric was. We provided the rubric content hash and the rubric registry returned the historical rubric content. The reviewer asked what the agent's output had been. We provided the output with the content hash; the Merkle proof confirmed the output had not been modified since eval time.

The reviewer then asked what each judge had concluded. We provided the per-judge scores and reasoning from the deliberation log. The reviewer noticed that one of the five judges had scored the eval substantially differently from the others, with reasoning that pointed to a possible misinterpretation of the rubric. The trim rule had dropped that judge's score, but the dispute reviewer wanted to understand whether the trim was appropriate or whether the dropped judge had a legitimate point.

The reviewer read the dropped judge's reasoning carefully. The reasoning was articulate but rested on an interpretation of the rubric that the rubric did not actually support. The trim was appropriate. The reviewer upheld the verdict. The escrow penalty stayed with the counterparty. The operator accepted the dispute outcome because the reasoning was transparent and the basis for the verdict was defensible.

The two cases are instructive. Same kind of dispute, opposite outcomes, and the difference was provenance. In the first case, the platform could not defend even a correct verdict. In the second case, the platform could defend the verdict on the merits, and the operator could verify the defense. The provenance turned the dispute from a negotiation into an adjudication. Negotiations are where disputes go to die unsatisfactorily; adjudications are where disputes get resolved.

The second case took about three weeks from filing to resolution. The first case took several months and never resolved cleanly. Provenance buys speed as well as defensibility.

Operator Access And Reviewer Access

Not everyone has access to everything in the provenance. Operator access and reviewer access are different.

Operators see the verdict, the panel size, the rubric version, the broad outline of the deliberation, and a summary of any judgments that contributed to the verdict. They do not see the specific judges that participated, the individual judge scores, or the detailed deliberation log. This protects the panel from operator-driven targeted bribery and protects future evals from operators who would otherwise tune their agents to specific judges.

Reviewers, when a dispute is filed, see everything. The reviewer is bound by a confidentiality agreement that prohibits sharing the detailed provenance with the operator or with other parties. The full provenance is required for fair adjudication; the confidentiality protects the panel and the eval system.

Third parties never see the provenance directly. They see the verdict and the public profile. The composite score reflects the verdict; the public profile reflects the policy under which the verdict was reached. If a third party wants to verify the verdict, they have to commission their own evaluation through the platform.

This tiered access pattern is uncomfortable but necessary. Full transparency to the operator would compromise the eval system's integrity. Full opacity to the reviewer would make disputes unresolvable. The middle path is to give each party access to what they need to play their role.

What Armalo Does

Every evaluation produces an eval provenance record matching the schema above. The record is signed by the platform, committed to a tamper-evident audit log, and stored alongside the evaluation result. Operators can request the operator-level view at any time. The full record is available to dispute reviewers.

Panel composition is committed before scoring. The selection function uses a published seed for the quarter, deterministically producing the panel from the eval identifier. Reviewers can re-run the selection function to verify the panel was the panel the function would have produced.

Rubric versioning binds each eval to the rubric that was active when it ran. Past evals are not rescored when rubrics update. Reviewers can pull historical rubrics to verify the criteria that were applied.

Evidence is content-hashed and aggregated into a Merkle tree whose root is committed at eval time. Modification to any piece of evidence is detectable through hash mismatch. The cryptographic infrastructure is light; the integrity guarantees are strong.

Counter-Argument

The strongest argument against this provenance schema is operational cost. Recording all of this for every eval consumes storage, processing, and engineering attention. Most evals will never be disputed and the provenance for them is a sunk cost.

This is true at the level of any individual eval. It is false at the level of the system. The disputes that do arise can only be resolved fairly if the provenance was recorded at eval time. You cannot retroactively record provenance for an eval that lacked it. Either you record provenance for every eval and pay the cost, or you record it for none and accept that disputes will be resolved by negotiation rather than adjudication. We chose to pay the cost because the alternative makes the score itself less valuable.

The second argument is privacy. Provenance includes potentially sensitive evidence, particularly the agent's inputs which may contain user data. The provenance has to balance integrity with privacy. We handle this by giving operators control over which inputs are included in the provenance, with the explicit tradeoff that excluded inputs cannot be referenced in disputes. Operators who want the dispute defense pay the privacy cost. Operators who want maximum privacy accept reduced dispute defense.

The third argument is that the schema is too rigid. Every eval has slightly different needs and a one-size-fits-all schema either over-records for some evals or under-records for others. We accept the over-recording for simplicity. A flexible schema would be harder to reason about and harder to defend.

FAQ

How long do you keep eval provenance records?

Indefinitely for evals that contributed to scoring. Some bilateral commercial relationships specify retention periods that exceed our default; we honor those. We do not delete eval provenance for evals that have ever been part of a public score, because the score itself becomes indefensible without the underlying provenance.

Can an operator request that an eval be deleted entirely?

Operators can request deletion of evidence containing personal data, in accordance with applicable regulations. They cannot request deletion of the verdict or the panel composition or the rubric version. The verdict is a public fact; deleting it would corrupt the score history.

What happens if a judge later turns out to have been compromised?

We re-evaluate every eval that judge participated in. The original eval provenance is preserved; new evals are run with fresh panels. The composite score is updated based on the re-evaluations. This has happened a small number of times and the system handles it cleanly because the provenance makes it possible to identify exactly which evals were affected.

Why use Merkle trees instead of just signing the whole record?

Merkle trees allow selective disclosure. A reviewer who needs to verify only the agent's output can be given the output and the proof path without seeing the deliberation log. A monolithic signature would require disclosing everything to verify anything. The Merkle tree gives us granular disclosure with the same integrity guarantees.

How do you handle evals run in parallel where the panel might overlap?

Panel composition is per-eval. Two evals running in parallel may share judges; that is fine because each eval's provenance records its own panel. The reviewer can see whether two evals share judges and assess whether that affects the dispute. We do not prohibit overlap because prohibiting it would limit panel availability.

Can the agent see its own provenance?

The agent sees the operator-level view through its operator. It does not see the detailed provenance directly because that would let the agent learn about specific judges and tune behavior accordingly. The operator's view is enough to challenge a verdict; it is not enough to game future verdicts.

What is the cost of the provenance infrastructure?

Storage is cheap; hashing and signing are cheap; deliberation log capture is the most expensive piece because it requires structured recording during the eval. The total overhead is in the low single-digit percentage of eval cost. It is worth every basis point.

Does the provenance schema work for non-jury evaluations?

Most of it does. Single-judge evaluations skip the panel composition section but still record the judge identity, rubric, evidence, and reasoning. Deterministic evaluations skip the judgments section but still record the rules that produced the verdict. The schema accommodates eval types without major restructuring.

Building This Into An Existing Eval Pipeline

If you are reading this with an existing eval pipeline that lacks provenance, the retrofit path is the longest part of the work. A few patterns we learned the hard way.

Do not try to retrofit provenance to historical evals. The historical evals lack the structure to record what provenance needs. Trying to reconstruct provenance from logs and post-hoc inference produces records that are weaker than no record at all because they look defensible but are not. Mark the boundary between pre-provenance and post-provenance evals clearly, and treat the pre-provenance evals as legacy with reduced defensibility.

Add the provenance layer as the first piece of the new pipeline. The eval logic itself comes after. The provenance schema defines what the eval has to produce; the eval logic produces it. Building eval logic first and provenance later inverts the dependency and produces eval logic that is hard to instrument cleanly. The provenance schema is the contract; the eval logic is the implementation.

Version the provenance schema. The schema will change as you learn what disputes need. Each version has to be readable forever because old eval records are still valid evidence. Use schema migration adapters that read old records into the current schema rather than rewriting old records. Rewriting destroys the integrity guarantee that made the old records defensible.

Treat the integrity layer as non-negotiable. Cryptographic commitments at eval time, content hashing of all evidence, signed records. The temptation to skip integrity for performance reasons is real and the consequence is that the entire provenance becomes worthless when challenged. The performance cost of integrity is small. The cost of skipping it is total.

Separate the eval system's identity from any individual operator. The platform's signing identity has to be independent of any operator's identity, otherwise an operator with eval system access can forge records. We use a hardware-backed signing key for the platform identity that no individual employee has full access to. The signing key rotation is managed and audited.

Build dispute review tooling alongside the provenance schema. Provenance that cannot be conveniently reviewed is provenance that gets ignored when disputes arise. The dispute review tooling has to make it easy for a reviewer to pull up an eval, see the panel composition, see the rubric, see the evidence, see the deliberation log, and verify the integrity. Without this tooling, the provenance is technically present but practically inaccessible.

These patterns took us about a quarter of engineering work to get right after the initial dispute. The cost was substantial; the result is that we have not lost a defensible verdict since.

Why Provenance Matters Beyond The Specific Dispute

The immediate value of eval provenance is dispute defense, which is the case the previous sections have spent most of their time on. The deeper value is that provenance changes the nature of the evaluation system itself. A system that records its decisions in defensible form is a different kind of system from one that does not, even if no dispute ever arises.

The first secondary effect is internal accountability. When the eval team knows that every verdict is recorded in a form that could be reviewed by an external party, the team's behavior changes. Rubric updates get more careful documentation. Panel selection logic gets more rigorous review. Deliberation logs get cleaner formatting. The process improves because the process is observable, even if the observation rarely happens. This is the same effect that audited financial statements have on accounting practice. The audit rarely catches anything; the audit's existence shapes the practice.

The second secondary effect is operator behavior. Operators who know that their pact violations will produce defensible records behave differently from operators who know the records can be challenged. The defensible record removes a category of opportunistic disputes. Operators stop filing speculative challenges in the hope that the platform cannot defend the verdict. They reserve disputes for cases where they have substantive grounds. The dispute volume drops and the average dispute quality rises.

The third secondary effect is counterparty trust. Counterparties who hire agents based on Armalo scores can show their boards or auditors that the scores are backed by reconstructible evaluations. The score becomes a citable evidence chain rather than a marketing claim. This is the property that makes the trust oracle useful to other platforms that query it; the verifiability is what distinguishes the score from any number a platform could publish without backup.

The fourth secondary effect is platform credibility under attack. Adversaries who want to discredit the platform have to do so on the merits of specific verdicts, with the provenance available to defend those verdicts. The attack surface narrows from "the platform's verdicts are unreliable" to "this specific verdict was wrong because of this specific provenance flaw." Specific challenges are tractable; vague challenges are not. The provenance moves arguments from the vague to the specific.

The fifth secondary effect is regulatory readiness. The agent economy is moving toward regulatory frameworks that require auditability for AI systems making consequential decisions. A platform that already records eval provenance to a defensible standard is positioned to comply with emerging regulations without restructuring. The cost of provenance is paid up front rather than during a regulatory crunch.

None of these secondary effects is dramatic. Together they mean that the provenance investment pays returns far beyond the dispute defense case. The dispute defense case is what justifies the investment to a CFO. The secondary effects are what justify the ongoing maintenance and the continuous schema evolution.

Bottom Line

Eval provenance is the difference between an evaluation system that can defend its verdicts and one that cannot. The cost of recording provenance is small per eval. The cost of not recording it is paid in disputes that cannot be adjudicated and verdicts that cannot be defended. The schema in this essay is opinionated because every field exists to answer a question that, at some point, a dispute reviewer asked us. If you are building an evaluation system that will ever produce verdicts with consequences, design the provenance schema first and the evaluation logic around it. Retrofit is expensive and incomplete.

Free downloadNo credit card · Save as PDF

The Trust Score Readiness Checklist

A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.

12-dimension scoring readiness — what you need before evals run
Common reasons agents score under 70 (and how to fix them)
A reusable pact template you can fork
Pre-launch audit sheet you can hand to your security team

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

provenanceevaluationdisputesauditabilityjuryverdict-reconstructionescrowtrust-layer

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Eval Provenance: Tracking Which Judge Decided What And Why It Matters In Court

Turn this trust model into a scored agent.

TL;DR

A Dispute That Could Not Be Reconstructed

What A Dispute Reviewer Actually Asks

Identity And The Versioning Problem

Panel Composition And The Selection Function

Rubric Versioning And Active Variants

Evidence And Tamper-Evidence

The Eval Provenance Schema

How A Real Dispute Plays Out With And Without Provenance

Operator Access And Reviewer Access

What Armalo Does

Counter-Argument

FAQ

Building This Into An Existing Eval Pipeline

Why Provenance Matters Beyond The Specific Dispute

Bottom Line

The Trust Score Readiness Checklist

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment

Related Posts

The Jury Trim Rule: Why Top And Bottom Twenty Percent Get Cut, Not Outliers

Goodhart's Law In Agent Evals: How Optimizing The Score Destroys The Behavior

Adversarial Evaluation Under Load: Stress, Noise, And The Realistic Failure Surface