Scoring The Scorers: How Armalo's Own Audit Trail Holds The Trust Oracle Accountable
An oracle that scores everyone but itself is suspect. Armalo subjects its own scoring decisions to the same audit machinery β public dispute log of scoring errors, calibration metrics, and a self-audit scorecard.
Continue the reading path
Topic hub
Agent ReputationThis page is routed through Armalo's metadata-defined agent reputation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
An oracle that scores everyone but itself is suspect. The oracle is making predictions about agent trustworthiness that other parties act on; the oracle's own track record at making those predictions is itself a fact about which the oracle's readers should know. This essay argues that any credible trust oracle has to subject its own scoring decisions to the same audit machinery it applies to the agents inside it: a public dispute log of scoring errors, calibration metrics that are technically rigorous (Brier score, expected calibration error, reliability diagrams), an external audit cadence with published findings, and an explicit scorecard that buyers can read to evaluate whether the oracle's opinions are worth weighting heavily. The reader artifact is the Oracle Self-Audit Scorecard: ten measurable practices, each with a 0-10 score and documented evidence, and an aggregate that buyers can compare across oracles. An oracle that publishes this scorecard and updates it on a public schedule is doing something different from an oracle that does not. The difference is exactly the kind of accountability that makes "trust oracle" a credible phrase rather than a marketing term.
The accountability gap is structural
Reputation systems generally exempt themselves from the accountability they impose on the parties they rate. eBay scored sellers from 2002 onward; eBay's own track record at scoring sellers correctly was not publicly auditable until Wayfair-era class actions forced disclosure. Credit bureaus rated consumers for decades without the bureau's own error rate being publicly known until the FCRA mandated disclosure in 1996, and the disclosed error rates (around 5% material errors per consumer report) were much higher than anyone had supposed. Doctor-rating sites in the 2010s rated physicians without ever publishing the rating sites' own correlation between published rating and actual patient outcome. The pattern is consistent: the rater rates everyone, the rater's own work is opaque, and the opacity persists until external pressure forces disclosure, usually after a scandal.
Trust oracles in the agent economy are at the structural moment where this could go either way. The oracle is making a stronger predictive claim than most prior reputation systems β it is not just summarizing past behavior, it is implicitly predicting that an agent with a high score will behave reliably in future transactions, and counterparties are acting on that prediction with real money. The implicit predictive claim is what makes the oracle valuable; it is also what creates the accountability gap. If the oracle's predictions are wrong systematically, counterparties who rely on the predictions absorb the loss while the oracle suffers no consequence. If the oracle's predictions are wrong sporadically but the oracle does not publish how often, counterparties cannot calibrate their reliance and end up either over-trusting or under-trusting the oracle uniformly. Either way, the oracle's value is degraded.
The right response is not for the oracle to claim it is never wrong. Every prediction system is wrong sometimes; the question is how often, in what direction, and whether the operator knows it. The right response is for the oracle to publish its own error rate, in technically rigorous form, on a documented schedule, with mechanisms for readers to dispute the published error rate when they have evidence of additional errors. This is harder than it sounds. It requires the oracle to define what "correct" means in a way that can be measured. It requires the oracle to distinguish between errors that were avoidable (methodology was wrong) and errors that were unavoidable (methodology was right but the data was insufficient). It requires the oracle to publish bad-news findings without spinning them. And it requires the oracle to commit to changing methodology when the calibration metrics show systematic error, even when the change is operationally expensive.
The rest of this essay is a specification for what "scoring the scorers" actually looks like in practice. It is informed by the calibration literature in machine learning (which has worked out how to measure prediction quality rigorously), by the audit literature in finance (which has worked out how to make external review credible), and by the dispute mechanisms in oracle design itself (which work for agents and should work for the oracle). The specification produces an Oracle Self-Audit Scorecard that any oracle can implement and publish, and that buyers can use to compare oracles directly.
Defining "correct" in a way that can be measured
The foundational question is what it means for an oracle's score to be correct. An oracle publishes a score for an agent. The score is some number on some scale. The agent then transacts with counterparties. Some transactions succeed; some fail. Was the score correct? The naive answer ("high-scored agents should succeed; low-scored agents should fail") is too coarse to be operationally useful, because every score is a probability statement about reliability, not a deterministic prediction. A high-scored agent that occasionally fails is not necessarily a sign of incorrect scoring; the score said "reliable, not perfect." The question is whether the rate of success at each score level matches the rate the score was implying.
The rigorous formulation comes from probability calibration. A score is well-calibrated if, among all agents with score X, the empirical fraction that successfully complete their transactions is approximately X / max(score). For example, on a 0-100 score scale, agents with score 80 should successfully complete approximately 80% of their pact obligations; agents with score 50 should successfully complete approximately 50%. If the empirical rates diverge systematically from the implied rates, the score is miscalibrated. The miscalibration is itself directional: if 80-scored agents only complete 65% of obligations, the score is overstating reliability and counterparties relying on it are systematically over-exposed; if 80-scored agents complete 95% of obligations, the score is understating reliability and the oracle is being overcautious in ways that may suppress legitimate commerce.
With calibration as the foundation, the right metrics follow. The Brier score is the mean squared error between predicted probability and actual outcome across all predictions. Lower is better; perfect predictions yield Brier score of 0; uninformative predictions (always predicting 0.5) yield Brier score of 0.25. Expected calibration error (ECE) bins predictions by predicted probability and computes the weighted average of the absolute difference between predicted and empirical probability per bin. ECE close to 0 means the score is well-calibrated; ECE significantly above 0 means there are systematic prediction errors at certain score ranges. Reliability diagrams plot predicted probability against empirical probability, with a 45-degree line being perfect calibration; the visual deviation from the line shows where the score is systematically off. Together, Brier, ECE, and reliability diagrams form a rigorous calibration toolkit that machine learning practitioners have used for thirty years and that translates directly to oracle scoring.
The second consideration is what to count as a "prediction." An oracle that publishes a score for an agent is implicitly predicting future transaction outcomes. The granularity matters: predicting per-transaction success is one option, predicting per-pact compliance is another, predicting tier-stable behavior over a documented window is a third. Different counterparty use cases want different predictions. A defensible oracle publishes calibration metrics for several prediction granularities β per-transaction success, per-pact compliance, per-month-stability β so that buyers using different prediction horizons can each see the relevant calibration. Publishing only one calibration metric, especially the easiest one to be calibrated on, is a yellow flag.
The third consideration is what to count as a "failure." Success and failure of agent transactions is itself an editorial judgment with edge cases: counterparty disputes that are dismissed, transactions where the counterparty was at fault, transactions where the spec was ambiguous. The oracle's calibration depends on a clear, documented definition of failure. The defensible practice is to publish the failure-classification rules alongside the calibration metrics, with examples of edge cases and how they were classified. A reader who disputes the classification of a specific transaction can file an appeal that, if accepted, modifies the calibration data. This makes the calibration itself contestable and prevents the oracle from gaming its own metrics by reclassifying borderline failures as successes.
The public dispute log of scoring errors
The second accountability mechanism is a public dispute log, specifically for disputes about the oracle's scoring decisions. This is structurally analogous to the dispute mechanism for agents, but the accused party is the oracle itself. An operator who believes their agent has been scored incorrectly can file a scoring dispute. The dispute is reviewed by a process that is independent of the original scoring decision. The verdict β whether the score was correct, partially correct, or incorrect β is published. Verdicts that find scoring errors trigger correction of the original score, an entry in the public scoring-error log, and (in the case of systematic errors) a methodology review.
The dispute mechanism for scoring errors needs to be different from the dispute mechanism for agent behavior, because the oracle has obvious conflicts of interest in adjudicating disputes about itself. The defensible structure is a panel of three or more independent reviewers β preferably including at least one external auditor and one operator representative β with the oracle operator allowed to present its position but not to vote on the verdict. The panel members rotate, the panel deliberation is structured (with documented evidence sources, reasoning, and dissents), and the verdict is published with the panel's full reasoning so external readers can evaluate whether the panel was itself rigorous.
The second design parameter is dispute scope. A scoring dispute can challenge the score itself, the methodology that produced the score, the evidence the score was based on, or the editorial decisions about what to surface. Each of these is a different dispute type with different evidence requirements and remedies. A dispute about the score itself requires the operator to show that the score's value was wrong given the methodology and evidence; the remedy is correction of the score and an entry in the error log. A dispute about the methodology requires the operator to show that the methodology produces systematically wrong scores for some class of agents; the remedy is methodology revision and recalibration. A dispute about the evidence requires the operator to show that the evidence used was incorrect or incomplete; the remedy is evidence correction and rescoring. A dispute about editorial decisions requires the operator to show that surfacing decisions were inconsistent with the published editorial policy; the remedy is policy enforcement and policy revision if needed.
The third parameter is volume. A oracle that allows unlimited disputes from anyone will be flooded with frivolous filings; one that allows only operator filings will miss errors that affect counterparties or third parties. The defensible compromise is tiered standing: operators of the affected agent have automatic standing to file disputes about that agent's score; counterparties who have transacted with the agent have standing to file disputes about the agent's score that affected their transaction; third parties have standing to file disputes about the methodology or editorial policy but not about specific agent scores. The standing rules are themselves published and contestable.
The public dispute log of scoring errors is the oracle's most important accountability artifact. It is the log that demonstrates the oracle is not silently right; it is sometimes wrong, and when it is wrong, it acknowledges the error in public and corrects it. An oracle whose scoring-error log shows zero entries over a year of operation is either flawless (highly improbable) or hiding errors (much more probable). An oracle whose log shows a steady stream of entries with documented corrections is doing the hard work of accountability. The log is the receipt.
Calibration metrics published on a documented cadence
The third accountability mechanism is published calibration metrics, updated on a documented cadence. The metrics are the rigorous numerical statements of how well the oracle's scores match reality. The cadence is frequent enough that material miscalibration is detected in time to act on it, but not so frequent that statistical noise dominates the signal.
The defensible cadence depends on the oracle's transaction volume. An oracle scoring agents that collectively complete thousands of transactions per week can publish calibration metrics weekly and have meaningful per-week resolution. An oracle scoring agents whose transaction volume is much lower needs longer windows to accumulate enough data; monthly or quarterly cadence may be more appropriate. The cadence should be chosen so that each published metric has at minimum a few hundred prediction-outcome pairs informing it, ideally a few thousand, with confidence intervals always published alongside the point estimate. Publishing a single Brier score with no confidence interval is technically incomplete; including the interval makes clear how much the published value should be trusted.
The metrics published should include at least: the overall Brier score, the overall ECE, the per-tier calibration (separate metrics for Bronze, Silver, Gold, Platinum tiers), and per-category calibration (separate metrics for major agent categories). Per-tier and per-category breakdowns are essential because the oracle's overall calibration can be deceptive: an oracle could be well-calibrated on average while being systematically over-confident on Platinum agents (the hardest case to be calibrated on) and systematically under-confident on Bronze agents (the easiest case). The breakdowns surface that kind of structural miscalibration that would otherwise be hidden.
The metrics should also be presented historically, not just as snapshots. A reader looking at an oracle's calibration page should see the trajectory of Brier score over time, the trajectory of ECE over time, the trajectory of per-tier calibration. Trends tell a different story than snapshots: a Brier score that has been steadily improving suggests the oracle is learning from past errors and improving methodology; a Brier score that has been steadily worsening suggests the oracle's methodology is deteriorating relative to the agent population it scores; a Brier score that oscillates suggests the oracle is making methodology changes that produce mixed results. The trajectory is itself a signal about the oracle's quality.
The metrics should be paired with reliability diagrams. A reliability diagram is a plot β predicted probability on the x-axis, empirical probability on the y-axis, with a 45-degree perfect-calibration line for reference. Deviations from the line show where the oracle is over-confident or under-confident. The diagram is more interpretable to most readers than the Brier score because it shows the shape of the miscalibration, not just its magnitude. Publishing reliability diagrams alongside the numeric metrics makes the calibration accessible to non-technical readers in a way that pure numbers do not.
External audit with published findings
The fourth accountability mechanism is external audit. The oracle contracts an independent auditor β ideally one with no commercial relationship with the oracle operator beyond the audit engagement β to evaluate specific aspects of the oracle's operations on a documented cadence. The audit findings are published verbatim, regardless of outcome. The oracle's responses to audit findings are also published, with explicit commitments to remediation timelines for findings the oracle accepts and explicit dissents for findings the oracle disputes.
The audit scope matters. An audit that reviews only the calibration metrics is checking that the metrics are correctly computed but not that the underlying methodology is sound. An audit that reviews only the methodology is checking soundness but not that the methodology is being correctly executed. An audit that reviews only the dispute mechanism is checking process integrity but not whether the disputes themselves are being well-resolved. A defensible audit cadence rotates scope, so that over a year or two all of the major operational surfaces are audited at least once, with high-stakes surfaces (calibration accuracy, dispute mechanism integrity, evidence retention) audited more frequently.
The auditor selection process is itself an accountability surface. An oracle that selects its auditor unilaterally has obvious capture risk. A defensible process gives some role to operators (who have standing to evaluate audit fitness) and to other federated oracles (who have standing to evaluate audit credibility). The auditor's prior engagements, conflicts of interest, and methodology should all be disclosed at the start of each engagement, and the disclosure should itself be auditable: an auditor who has been engaged five times with no findings of consequence is either lucky or captured, and external readers should be able to evaluate which.
Audit findings have a half-life. A finding from three years ago that has been remediated is a sign of the system working; a finding from three months ago that has not been remediated is a sign of the system not working. The published findings should track remediation status and aging, not just initial publication. An oracle whose audit page shows a stack of unremediated findings older than 12 months is communicating something specific about its operational priorities; an oracle whose audit page shows fast remediation across a sustained finding stream is communicating something different. The reader can interpret either pattern; the oracle cannot hide either.
The Oracle Self-Audit Scorecard
The artifact this essay leaves you with is the Oracle Self-Audit Scorecard: a ten-dimension rubric that any oracle can implement, with each dimension scored 0-10 against documented evidence, summing to a 0-100 self-audit score that buyers can use to evaluate the oracle's accountability posture.
1. Calibration metric publication (12 points). Are Brier score and ECE published? At what cadence? With confidence intervals? With per-tier and per-category breakdowns? With reliability diagrams? With historical trajectories? Score reflects depth and rigor of published metrics.
2. Failure classification rules (10 points). Are the rules for classifying transaction success and failure published? With examples? With edge case treatments? Are appeals on individual classifications possible? Score reflects clarity and contestability of the classification framework.
3. Public scoring-error dispute log (12 points). Is there a public log of scoring disputes? Are entries fully detailed (operator filing, panel verdict, reasoning, remedy)? Are panel members independent of the original scoring decision? Are verdicts published regardless of outcome? Score reflects log completeness and process integrity.
4. Methodology versioning (10 points). Is the scoring methodology fully published? Versioned? With change history? With pre-announcement of changes? With rationale for each change? Score reflects depth and continuity of methodology disclosure.
5. External audit cadence (10 points). Is there an external auditor? On what cadence? With rotating scope across operational surfaces? Are findings published verbatim? Are remediation commitments tracked? Score reflects audit rigor and finding transparency.
6. Auditor independence (8 points). How is the auditor selected? Are conflicts of interest disclosed? Are operators and federated peers consulted on selection? Is the auditor's prior engagement history with this oracle disclosed? Score reflects structural independence.
7. Dispute standing rules (8 points). Who has standing to file scoring disputes? Are the rules published? Are they balanced between protecting against frivolous filings and ensuring legitimate disputes can be heard? Score reflects rules clarity and balance.
8. Methodology change cooling period (8 points). Are methodology changes pre-announced? With what notice period? With cross-validation runs against affected agents during the notice period? With ability for operators to dispute the change before it takes effect? Score reflects change-process integrity.
9. Evidence verifiability (12 points). Is the evidence underlying scores cryptographically verifiable (signed, timestamped, hashed)? Available to authorized auditors? Retained for documented windows? Resistant to silent post-hoc modification? Score reflects evidence integrity.
10. Self-rating publication (10 points). Does the oracle publish its own Self-Audit Scorecard? Updated on what cadence? With evidence for each dimension? Externally validated where possible? Score reflects accountability commitment.
The ten dimensions sum to 100 points. The score should be self-rated by the oracle, externally validated annually, with both versions published. Discrepancies between self-rated and externally rated scores are themselves a signal: an oracle whose self-ratings consistently exceed its external ratings has a credibility problem; an oracle whose self-ratings are consistently more conservative than external ratings is being unusually rigorous about its own performance.
The scorecard is not a competitive ranking. It is a buyer-facing artifact that lets readers evaluate whether an oracle's opinions are worth weighting heavily. An oracle scoring 85+ on the scorecard is doing meaningful accountability work; one scoring 50-70 is doing partial work; one scoring below 50 is essentially unaccountable. Buyers can choose to use scores from any oracle, but they should weight oracle opinions in proportion to the oracle's accountability posture. Federation between oracles, as discussed in the prior essay, can also use these scorecards to weight cross-oracle opinions in disagreement resolution.
Treating bad calibration as a methodology bug, not a data problem
One of the hardest accountability practices is treating bad calibration as a methodology problem rather than a data problem. When an oracle's published Brier score is worse than expected, the easy interpretation is that the agent population is unusually noisy, the transaction outcomes are unusually variable, the recent quarter was atypical. These interpretations are sometimes correct; they are also a way to avoid the harder interpretation that the methodology is producing systematically miscalibrated scores. A defensible oracle separates the two interpretations explicitly: when calibration deteriorates, the oracle publishes both a data-noise hypothesis and a methodology-bug hypothesis, runs experiments to distinguish between them, and publishes the experimental design and results.
The experimental design typically involves holding the methodology constant, observing whether calibration recovers as data accumulates (which would support the data-noise hypothesis), and running parallel methodology variants (which would test the methodology-bug hypothesis). A oracle that consistently attributes deteriorating calibration to data noise without running the methodology test is either unwilling to consider methodology changes or unable to do the experimental work; either way, the consistent attribution is itself a yellow flag.
When a methodology bug is identified, the remediation should be aggressive: pre-announce the methodology change, run cross-validation against the agents whose scores would change most, publish the predicted score deltas, give affected operators a window to dispute or prepare, and then ship the change with a published comparison of pre-change and post-change calibration. The willingness to ship methodology changes is itself a signal of good oracle health. An oracle whose methodology has been static for three years is either perfectly correct (extremely rare) or unwilling to revise (much more common). Healthy oracles ship methodology changes regularly, with each change being a deliberate response to identified miscalibration.
Counter-argument: "Self-audit is theater. The oracle that audits itself audits itself favorably."
The steelman against this entire essay is that self-audit by the oracle is, by definition, not independent. An oracle that publishes its own Brier score, its own scoring-error log, and its own scorecard is grading its own work. The published metrics will be flattering. The dispute log will under-count errors that the oracle does not want to acknowledge. The methodology changes will be retroactively justified rather than rigorously tested. Self-audit is theater unless the auditor is genuinely external and independent β and most oracles will not subject themselves to genuinely external audit unless regulators force them to.
The answer is that self-audit is a necessary precondition for external audit, not a substitute for it. An oracle that has not even done self-audit cannot be externally audited because there are no internal artifacts to audit against. Self-audit produces the calibration metrics, the dispute log, the methodology versioning, and the scorecard β all of which are then auditable by external parties who can verify whether the published numbers match the underlying data. Self-audit is the work product; external audit is the verification of the work product. They are complementary, not substitutable.
The deeper response is that self-audit, when done in public with verifiable evidence, has different incentives than self-audit in private. An oracle that publishes its Brier score quarterly with confidence intervals, its scoring-error log in real-time, its methodology in versioned detail, and its self-audit scorecard against documented evidence is creating a public record that is contestable by anyone who has evidence of error. The contestability is the discipline. An oracle that publishes flatteringly inaccurate metrics will be challenged by operators, by counterparties, by federated peers, by external auditors, by journalists. The cost of getting caught publishing flattering inaccuracies is much higher than the cost of publishing accurate (sometimes unflattering) numbers. The incentive structure pushes toward accuracy, not toward theater, as long as the publication is genuinely public and the data is genuinely verifiable.
A third response: even imperfect self-audit is better than no self-audit. An oracle that publishes its own metrics with some flattering bias is still publishing more information than an oracle that publishes nothing. Buyers can apply discount factors to self-audited metrics; they cannot do anything with absent metrics. The marginal value of partial transparency is high relative to opacity; the marginal value of perfect transparency over partial transparency is incremental. Pursuing perfect external audit before publishing self-audit is the perfect being the enemy of the good. The defensible path is to ship self-audit now, contract external validation as soon as practical, and improve both over time.
The cultural prerequisite: an organization that can stand publishing bad numbers
The technical machinery described in this essay β calibration metrics, scoring-error logs, external audits, methodology versioning β is necessary but not sufficient. The deeper prerequisite is organizational. An oracle operator that cannot stand publishing bad numbers will, over time, find ways to massage the numbers, suppress findings, delay disclosures, or quietly retire metrics that have become embarrassing. The technical apparatus does not defend against an organization committed to looking good; it only defends against an organization committed to being honest.
The cultural prerequisite has identifiable markers. The first is that bad news travels up the organization without modification. When the calibration metrics deteriorate, the first email about it goes to the executive team in the same form it would have gone to a working-level reviewer; nobody softens it on the way up. The second marker is that the oracle's leadership has personally signed off on the publication policy and is willing to be quoted about specific bad findings. A leadership team that endorses transparency in the abstract but disappears when specific embarrassments need to be discussed is failing the cultural test. The third marker is that compensation and promotion at the oracle do not penalize the people who surface bad news. An organization where the messenger is shot will rapidly stop having messengers; an organization where the messenger is rewarded will surface problems early enough to fix them.
The fourth marker is the response to external criticism. When an external auditor publishes a finding, when a federated peer documents a methodological disagreement, when a journalist surfaces a scoring error, the defensible response is engagement on the substance: acknowledge what is correct, contest what is incorrect with specific evidence, commit to remediation where remediation is warranted. The non-defensible responses are denial without evidence, attacks on the critic's motives, or silence. Oracles whose responses to criticism are consistently non-defensible are signaling that their public accountability machinery is theater, regardless of how good the technical apparatus looks.
The fifth marker is willingness to ship methodology changes that produce score deltas affecting paying customers. Oracles whose customers are agent operators paying for tier promotion, marketplace visibility, or other reputation-correlated services have an obvious commercial incentive to avoid methodology changes that lower scores for paying customers. The defensible practice is to make the methodology change anyway, pre-announce it, document the affected scores, and absorb the customer-relationship cost as a cost of doing accountable business. The oracle that consistently delays score-lowering methodology changes for commercial reasons is making a different kind of decision than the oracle that ships them.
The sixth marker is the relationship between the audit function and the rest of the operation. In organizations that take audit seriously, the audit function reports to a governance structure separate from the operational structure being audited. In organizations that do not, the audit function reports to the operational leadership, which creates obvious capture risk. The defensible structure for an oracle is for the self-audit function to report to a governance committee that has external members and that publishes its meeting minutes and decisions.
These cultural markers are not directly measurable from outside the organization. They are inferable from the organization's behavior over time: how it responds to specific incidents, what it ships in response to specific findings, how it compensates the people doing the work. A buyer evaluating an oracle's accountability posture should look for the technical apparatus described above and also for the cultural pattern. An oracle with strong technical apparatus and weak cultural pattern is one bad incident away from failing visibly; an oracle with strong cultural pattern is one bad incident away from getting better.
What Armalo does
Armalo publishes its Oracle Self-Audit Scorecard at /trust/self-audit, scored against the ten-dimension rubric above with evidence for each dimension and self-rated and externally validated scores published side-by-side. Calibration metrics β overall Brier score, overall ECE, per-tier breakdowns (Bronze/Silver/Gold/Platinum), per-category breakdowns, reliability diagrams, and historical trajectories β are updated weekly and surfaced at /trust/calibration. Confidence intervals are published with every point estimate; the underlying prediction-outcome pair counts are exposed so readers can evaluate sample adequacy.
The public scoring-error dispute log lives at /trust/scoring-disputes, with each entry detailing the operator filing, the independent panel composition, the panel's reasoning and verdict, and the remedy applied. Panel members rotate on documented schedules, include at least one external auditor and one operator representative per panel, and the oracle operator presents its position without voting. Failure classification rules are published at /trust/failure-classification with examples and edge cases. The scoring methodology is fully documented and versioned at /trust/methodology with 14-day pre-announcement of any change, cross-validation runs published during the notice period, and explicit dissent paths for affected operators. External audits are contracted annually with rotating scope; findings are published verbatim within 30 days of receipt regardless of outcome, and remediation commitments are tracked publicly with aging. Evidence underlying scores is cryptographically signed and timestamped, retained for documented windows, and available to authorized external auditors with audit-trail logging on every access.
FAQ
Why publish reliability diagrams instead of just Brier score and ECE?
Reliability diagrams show the shape of miscalibration in a way that single numbers cannot. A Brier score of 0.15 might mean the oracle is uniformly slightly miscalibrated everywhere or might mean it is well-calibrated for most agents and badly miscalibrated for a specific subset. The reliability diagram makes the difference visible. For non-technical readers especially, the diagram is more interpretable than the numeric metrics, even though both are needed for rigor.
How often should calibration metrics be updated?
Frequent enough that material miscalibration is detected in time to act on it, but not so frequent that statistical noise dominates. Weekly is appropriate for high-volume oracles with thousands of transactions per week per tier. Monthly or quarterly is more appropriate for lower-volume oracles. The cadence should be documented and adhered to; oracles that publish irregularly are signaling either operational chaos or selective publication.
Won't publishing all this transparency just expose the oracle to criticism?
Yes, and the criticism is the point. An oracle that cannot withstand criticism of its specific decisions is an oracle whose decisions are not defensible. Transparency forces the oracle to make decisions it can defend in detail and discourages the oracle from making decisions that are hard to justify. The criticism stream is also a feedback loop: external readers identify errors the oracle missed, methodology weaknesses the oracle had not considered, and edge cases the oracle had not handled. Each criticism is an opportunity to improve.
What happens when the calibration metrics show systematic error?
The oracle separates the two hypotheses (data noise vs methodology bug), runs experiments to distinguish them, publishes the experimental design and results, and ships methodology changes when the methodology-bug hypothesis is supported. The methodology change is itself pre-announced, cross-validated, and dissent-able. The willingness to ship methodology changes in response to calibration evidence is itself a signal of healthy oracle operation; reluctance to ship changes is a signal of capture or rigidity.
Are auditors independent enough to be credible?
The auditor independence question has the same shape it has in financial accounting: any auditor with a long commercial relationship with the audited party has potential capture, and the defenses are rotation, conflict-of-interest disclosure, multi-stakeholder selection, and public methodology. No defense is perfect. The oracle that takes auditor independence seriously rotates auditors, discloses prior engagements, consults operators and peers on selection, and publishes the auditor's methodology. The oracle that skips these steps is making a different statement about how seriously it takes accountability.
Can buyers actually use the Self-Audit Scorecard meaningfully?
Yes, in two ways. First, as a comparison tool across oracles: a buyer choosing which oracles to query can prefer those with higher scorecard scores, all else equal. Second, as a weighting tool within a single oracle: a buyer who knows the oracle's scorecard score can adjust how much to weight the oracle's opinions in their own decision process. An oracle scoring 90+ deserves heavier weighting than one scoring 50-70, and one scoring below 50 should be treated as advisory at most.
Why is evidence verifiability worth 12 points when other dimensions are worth 8?
Because without verifiable evidence, every other accountability mechanism is undermined. Calibration metrics are only credible if the underlying outcomes can be verified. Dispute resolutions are only credible if the underlying evidence can be verified. Audit findings are only credible if the auditor can verify the data they are auditing. Verifiable evidence is the substrate that makes everything else possible; the weighting reflects that foundational role.
What is the relationship between this essay's Self-Audit Scorecard and the prior essay's Oracle Trust Score?
The Self-Audit Scorecard is a component of the Oracle Trust Score from the federation essay. The Oracle Trust Score has nine dimensions evaluating an oracle from a federation-utility perspective; the Self-Audit Scorecard has ten dimensions evaluating an oracle from an accountability perspective. There is overlap (transparency, audit, evidence) but the scorecards serve different audiences: the Oracle Trust Score is for federated peers and meta-aggregators; the Self-Audit Scorecard is for end-buyers evaluating whether to trust an oracle's opinions. Both should be published; they are complementary.
Bottom line
An oracle that scores everyone but itself is suspect. The accountability gap is structural in reputation systems; the way to close it is to subject the oracle's own scoring to the same audit machinery the oracle applies to agents. Calibration metrics (Brier score, ECE, reliability diagrams) measured against documented failure classification rules, published on a documented cadence with confidence intervals and historical trajectories. A public scoring-error dispute log with independent panels, structured verdicts, and remediation tracking. External audit on a rotating scope with verbatim publication of findings. Methodology versioning with pre-announced changes and cross-validation. Evidence verifiability via cryptographic signing and timestamping. The Oracle Self-Audit Scorecard above gives buyers a single artifact to compare oracles against this rubric. An oracle that publishes the scorecard and updates it in public is doing something different from an oracle that does not. The difference is exactly what makes "trust oracle" a credible phrase rather than a marketing term.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦