Insights

BuilderEvaluation & scoring

Calibrated Refusal: Teaching The Jury To Say "I Don't Know" Instead Of Hallucinating Confidence

2026-06-2322 minarmalo Team

A jury that always returns a verdict is a jury that hallucinates when it should not decide. Calibrated refusal lets judges abstain when their confidence does not justify a vote.

Continue the reading path

Topic hub

Agent Evaluation

This page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.

Strategic Guide

Agent Evaluation Framework

Curated Collection

Evaluation Blueprints

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

TL;DR

A multi-LLM jury that always returns a verdict on every case is producing noise on the cases that should not have been decided. Forcing a judge to vote when the evidence is ambiguous, the rubric does not cover the situation, or the model is operating outside its competence is asking the model to hallucinate confidence. Calibrated refusal is the discipline of letting judges abstain when their internal confidence falls below a threshold, then accounting for abstentions in the verdict aggregation. This essay defines the pattern, the three things you have to measure (expected calibration error, abstention rate, and AURRA), and a Calibration Test Protocol you can run on your own jury to find out whether your judges actually know what they do not know.

The case the jury got wrong by getting it right too confidently

The case that taught us this lesson was straightforward in its structure and miserable in its consequences. An agent we will call AGENT-9118 produced an output that summarized a research paper on a clinical trial. The summary contained one factual claim that was technically defensible from the paper but materially misleading without the surrounding methodology context that the agent had stripped out. The omission was not a hallucination; everything in the summary was sourced. But the way it was sourced changed the meaning.

We sent the case to the jury. Five judges. Each one returned a confident verdict. Three said the summary was acceptable; the agent had cited the paper accurately. Two said it was unacceptable; the omission was material. The trimmed mean (after dropping the highest and lowest scores in our standard 20 percent trim) landed in the acceptable range. The composite score for that case nudged up.

The agent shipped the summary to a customer. The customer's domain expert, reading it in context, immediately flagged the omission. The customer asked us how the jury had cleared this. We pulled the per-judge raw judgments. All five judges had returned confident scores. None had flagged uncertainty. None had said "this case is genuinely hard and I am not sure my verdict is reliable." Every judge had taken its best guess and packaged it as a confident assessment.

The failure mode here is subtle. None of the judges were wrong in any technical sense. The case was genuinely on the boundary of acceptable summarization. Three judges thought it leaned acceptable; two thought it leaned not. That is what an honest panel of human reviewers would also produce on a hard case. The problem was not the disagreement; it was that the system aggregated the disagreement into a single verdict number that hid the underlying uncertainty. The customer reading the score had no signal that this was a contested case. They saw a number that looked decisive.

The fix is calibrated refusal. We want judges to be allowed to say "I do not have enough confidence to vote on this one" when the case is genuinely on the edge of the rubric. We want the aggregation to treat abstentions specifically — neither as approvals nor as rejections — and we want the final verdict to flag cases where so many judges abstained that the remaining vote count is not reliable. This essay lays out how to build that pattern into a multi-LLM jury, how to measure whether it is working, and how to avoid the failure modes calibrated refusal introduces if implemented carelessly.

What "calibrated" actually means in this context

The word "calibrated" is doing a lot of work here, so let us be specific. A calibrated judge is one whose stated confidence corresponds to its empirical accuracy. If the judge says "I am 90 percent confident this output is acceptable," then across the population of cases where the judge has said 90 percent, it should be right roughly 90 percent of the time. The 90 means something measurable.

Most off-the-shelf LLMs are not natively well-calibrated. They tend to be overconfident on hard cases and reasonably calibrated on easy ones. The aggregate effect is that their confidence numbers cluster in the high range with little useful discrimination. A judge that says it is 95 percent confident on cases it actually gets right 70 percent of the time is uncalibrated — its high-confidence judgments are not actually more reliable than its medium-confidence ones.

Calibrated refusal does two things. First, it asks the judge to produce a confidence number alongside its verdict. Second, it instructs the judge to abstain (not vote) if its confidence falls below a threshold. The threshold is a parameter you tune. The act of asking for confidence does not make the model calibrated; it just gives you a number to use for filtering. The calibration itself comes from how you instruct the judge, what evidence it has, what rubric it is using, and how you have validated its confidence outputs against ground truth.

A naive implementation looks like this: you append "return your verdict and your confidence on a scale of 0 to 100" to the judge prompt. You set the threshold at 60. You drop any verdict where confidence is below 60. The judges that remain vote; the trimmed mean is computed. This works, sort of, but it leaves you with two unsolved problems.

The first unsolved problem is that the confidence numbers are themselves not validated. The model is just emitting a number that looks like confidence; whether that number corresponds to actual reliability is unknown. A model that emits 95 on every case is technically returning confidence; you have learned nothing.

The second unsolved problem is that abstention skews the verdict. If hard cases produce more abstentions, then the cases that survive into the trimmed mean are systematically the easier cases. The mean is now overstating the agent's quality because the cases the jury abstained from were the ones it was about to score lower on.

Both problems have solutions. The solutions require explicit measurement. We will get to them. First we have to define what we are measuring.

Expected calibration error: do the confidence numbers mean anything

The core measurement of judge calibration is expected calibration error, ECE. The construction is straightforward. You bin the judge's predictions by stated confidence — say, ten bins of 0-10, 10-20,..., 90-100. Within each bin, you compute the average stated confidence and the empirical accuracy on a labeled validation set. The ECE is the weighted average of the absolute difference between stated and empirical, weighted by bin size.

A perfectly calibrated judge has ECE near zero. Its 90-percent-confident predictions are right 90 percent of the time, its 70-percent-confident predictions are right 70 percent of the time, and so on across all bins. A judge with ECE of 0.1 is, on average, off by 10 percentage points between stated and actual reliability. A judge with ECE of 0.3 is essentially producing noise; its confidence scores have no useful relationship to its accuracy.

Measuring ECE requires a labeled validation set. You need cases where you have ground truth verdicts (from human experts, from a higher-tier jury, from any source you trust more than the judge under test). You run the judge on these cases, capturing both verdict and confidence. You bin the results, compute the empirical accuracy per bin, and calculate the weighted error.

The labeled validation set is the bottleneck. Most teams do not have one. They run their jury on production cases without ground truth and assume the jury is calibrated because the verdicts look reasonable. This is the central failure of most evaluation programs. Without a ground-truth validation set, you cannot measure whether your judges are calibrated. You are flying blind.

Building a ground-truth set is unglamorous but tractable. You assemble a couple hundred cases that span the domain, ideally with deliberate inclusion of hard and easy cases at known proportions. You have human experts label each case according to the rubric your jury uses. You treat their consensus as ground truth (with appropriate humility about cases where the experts also disagree). You run your jury on the validation set, bin the predictions, compute ECE.

Do this for each judge in the panel separately. The judges will not have the same ECE. Some will be substantially better calibrated than others. This per-judge ECE becomes the input to deciding which judges to keep, which to retire, which to weight more heavily in the aggregation, and what abstention threshold to use for each.

A practical heuristic: a judge with ECE under 0.10 is well-calibrated and useful. A judge with ECE between 0.10 and 0.20 is acceptable for some uses but should not be heavily weighted. A judge with ECE above 0.20 should be considered for retirement or for major prompt revision; its confidence numbers are not telling you much.

ECE is not the only calibration metric — Brier score, log loss, and reliability diagrams all add information — but ECE is the simplest single number and the easiest to discuss across teams. Use it as the headline metric.

Abstention rate: how often the judge passes

The second measurement is abstention rate. This is the fraction of cases where the judge declines to vote because its confidence falls below the threshold.

A judge with abstention rate near zero is either operating on easy cases, has a poorly calibrated confidence number that lives near the high end, or has its threshold set too low to ever trigger. None of these are good. A judge that never abstains is not exercising calibrated refusal; it is voting on every case regardless of competence.

A judge with abstention rate near one is either operating on cases entirely outside its competence, has a too-low threshold, or has confidence outputs that live near the low end. This is also not useful; the judge is contributing nothing to the verdict.

The target abstention rate depends on the case mix and the threshold. As a starting point, a judge in a mature jury operating on a representative case mix should abstain on something like 10 to 25 percent of cases. The exact number is empirical; you tune the threshold to land in this range and then measure whether the calibration of the non-abstained verdicts has improved.

Abstention rate by itself is not informative. It must be paired with a measurement of what happens to verdict quality when abstentions are removed. If a judge abstains on 30 percent of cases and the remaining 70 percent of verdicts are dramatically better calibrated than the prior all-vote regime, the abstention is doing useful work. If a judge abstains on 30 percent of cases and the remaining 70 percent are no better calibrated than before, the abstention is just noise.

This pairing — abstention rate plus post-abstention calibration — is the actual signal. We track it as a single composite that we will introduce next.

AURRA: the curve that ties it together

AURRA stands for Area Under the Refusal-Rate Accuracy curve. It is a metric we have adopted because it directly measures the value of allowing abstention. The construction goes as follows.

For a given judge with confidence outputs, sort all cases by confidence, descending. The most confident case is at position 1; the least confident is at position N. Now imagine progressively allowing the judge to abstain on the lowest-confidence cases. At abstention rate 0, the judge votes on all cases; at abstention rate 0.5, the judge votes only on the top half by confidence; at abstention rate 1, the judge abstains on everything.

For each abstention rate, measure the accuracy of the judge's verdicts on the cases it did vote on (the high-confidence half, the high-confidence quarter, etc.). Plot accuracy on the y-axis, abstention rate on the x-axis. The curve should rise as abstention rate rises — the judge gets more accurate as it sheds the cases it was less sure about.

The area under this curve, normalized appropriately, is AURRA. A judge with high AURRA is one whose confidence numbers actually predict accuracy: shedding low-confidence cases meaningfully improves accuracy on the remaining ones. A judge with flat AURRA is one whose confidence numbers do not discriminate; shedding low-confidence cases does not help.

AURRA combines calibration and discrimination into a single number that answers the question we actually care about: is allowing this judge to abstain doing useful work? If AURRA is high, abstention is buying us accuracy. If AURRA is flat, abstention is just throwing away cases without improving the rest.

The practical use of AURRA: for each judge in your panel, compute AURRA on a validation set. Set the abstention threshold to the point on the curve where the marginal accuracy gain from one more abstention starts to plateau. Above that point, you are losing useful cases without gaining much accuracy. Below that point, you have more accuracy on the table.

AURRA is also useful for comparing judges. If you have two candidate judge models for the same role in the panel, the one with higher AURRA is the one whose confidence numbers are more useful — even if their raw accuracy is similar. AURRA prefers judges whose confidence is informative.

Like ECE, AURRA requires ground-truth labels. It cannot be computed on production traffic without them. The same validation set that you built for ECE is the one you use for AURRA. The investment in the validation set pays for itself across both metrics.

The aggregation problem: what to do when judges abstain

Once you allow judges to abstain, you have to decide how to aggregate verdicts when some judges have abstained. Naive aggregation breaks. If three judges vote and two abstain, do you take the mean of the three? What about your trim policy? If you normally trim the top and bottom 20 percent of five judges, you trim one each side. If you have only three votes, the trim either cannot apply or has to be reformulated.

There are several aggregation patterns that work. We will describe the one we use.

First, set a minimum vote count. If fewer than M judges voted (out of N total), the case is escalated rather than auto-decided. M is typically the majority threshold, ceiling of N/2 + 1, but you can tune it. For a five-judge panel with abstention enabled, M=3 means at least three judges have to vote for the case to be auto-aggregated; below that, the case routes to a more expensive escalation path (a tier-two jury, a human reviewer, or a flagged-for-review queue).

Second, scale the trim to the surviving votes. If you normally trim the top and bottom 20 percent of N judges, scale that to 20 percent of the surviving voters. With three votes, that may round to zero (no trim); with four votes, one trim each side. The principle is that the trim is a robustness device against outlier judges, and with fewer votes the trim has to be lighter to avoid eliminating useful information.

Third, weight by judge calibration. If your judges have differing ECE or AURRA, weight their verdicts in the trimmed mean by their calibration scores. A well-calibrated judge contributes more to the verdict than a poorly calibrated one. This is more sophisticated than equal-weight aggregation and produces measurably better verdicts on contested cases.

Fourth, surface the abstention pattern in the verdict metadata. The verdict object should include not just the score but the abstention rate (how many of the panel abstained), the per-judge confidence histogram, and a flag indicating whether the verdict is high-confidence (most judges voted, similar verdicts) or low-confidence (many abstentions or wide spread among voters). Buyers consuming the verdict can then act differently on high-confidence and low-confidence verdicts.

This last point is the most important. The whole reason to introduce calibrated refusal is to give buyers better signal. If the verdict surface is just a number, you have hidden the signal that the underlying machinery worked so hard to produce. Surface the signal. Make low-confidence verdicts visible.

How to write the judge prompt for calibrated abstention

The judge prompt is where calibrated refusal actually happens. The model has to be instructed to produce confidence and to abstain. The instruction language matters more than people expect.

A bad instruction: "Return your verdict and confidence." The model will produce a number that looks like confidence but is not validated and tends to live near 90.

A better instruction: "Return your verdict on the rubric. Then return a confidence number from 0 to 100. Use 0-30 if the case is genuinely outside the rubric or you cannot apply the rubric meaningfully. Use 30-60 if the rubric applies but the evidence is ambiguous. Use 60-85 if the rubric applies and your verdict is reasonably supported. Use 85-100 only if the rubric clearly applies and the verdict is unambiguous."

This instruction does two things. It anchors the confidence scale to specific decision contexts (rubric applicability, evidence ambiguity), and it forces the model to think about why its confidence is what it is rather than producing a number from habit.

The even-better instruction adds a refusal directive. "If your confidence is below 50, do not vote. Return ABSTAIN as your verdict and explain in one sentence why this case falls below the confidence threshold." The explanation requirement is important — it gives you a corpus of abstention reasons that you can analyze to understand where the rubric is failing or where the case mix is drifting outside what the judge can handle.

The choice of threshold (50 in the example above, but you tune it) depends on AURRA analysis. The threshold should be set to the point where allowing the judge to vote produces a measurable accuracy gain over abstention. Below that point, the votes are noisy.

There is one more refinement. Some teams ask the judge to produce both confidence and a separate "applicability" score that addresses whether the rubric covers the situation. A case can be high-confidence-on-rubric but low-applicability (the verdict is decisive but the rubric is the wrong tool for the job). This decoupling is useful but adds complexity. Start with single-axis confidence; add applicability if the abstention reasons suggest the rubric is the bottleneck.

The Calibration Test Protocol

Here is the named artifact. This protocol can be applied to any multi-LLM jury to determine whether the judges are calibrated and whether allowing abstention would improve verdict quality. Run it before deploying calibrated refusal in production.

Step one: assemble a validation set of at least 200 cases representative of your production case mix. The cases should span easy, medium, and hard difficulty in known proportions. Have at least three human experts independently label each case according to the same rubric your jury uses. Treat the consensus label as ground truth; flag cases where the experts disagreed substantially as boundary cases.

Step two: run your current jury on the validation set without abstention enabled. Capture each judge's verdict and (if you can extract it via prompting) confidence number. Also capture the trimmed-mean composite verdict per case.

Step three: compute ECE per judge against the ground-truth labels. Identify any judge with ECE above 0.20 — these are calibration failures and need prompt revision or replacement before abstention will help.

Step four: compute AURRA per judge. Identify the threshold for each judge where marginal accuracy gain from another abstention plateaus. Record the per-judge thresholds.

Step five: re-run the jury with abstention enabled at the per-judge thresholds. Capture verdicts, abstention flags, and post-abstention composite verdicts. Compute the new aggregate accuracy and the abstention rate.

Step six: compare the abstention-enabled regime against the all-vote baseline. Look at three numbers. Aggregate verdict accuracy on cases that were resolved (i.e., where enough judges voted). Escalation rate (cases where too many judges abstained for auto-resolution). Composite confidence (the system's stated confidence in its own verdicts on resolved cases).

Step seven: tune. The first run will have parameters that are imperfect. Adjust thresholds, judge weights, escalation policies. Re-run on a holdout slice of the validation set. Iterate until the resolved-case accuracy clearly exceeds the baseline and the escalation rate is operationally tolerable.

Step eight: deploy in shadow mode. Run abstention-enabled jury alongside the production jury for a week or two on real traffic. Compare. Only switch the production jury to abstention-enabled when the shadow results clearly outperform.

This protocol is unglamorous and time-consuming. The validation set alone takes weeks to assemble well. Most teams will be tempted to skip it and just turn on abstention with default parameters. We have done that experiment. It does not work. Without the validation set you cannot tune the thresholds; without the tuning the abstention pattern is essentially random; without good abstention the calibrated-refusal pattern is just adding latency and complexity for no measurable gain.

The validation set, once built, is reusable. It is not a one-time cost. You re-run the protocol every time you change a judge model, revise a rubric, or add a new case category. The validation set becomes the institutional memory of what your jury actually knows.

The escalation path: what happens when too many judges abstain

Calibrated refusal only works if abstentions are routed somewhere productive. If too-many-abstentions cases are silently dropped, the agent gets no verdict and the system has just thrown away a hard case. If they are auto-approved as a default, you have introduced a permissive bias. Both are wrong.

The right answer is an escalation path. Cases where the panel cannot reach a verdict route to a higher-cost evaluation surface. The options, in increasing order of cost.

First escalation: a stronger model panel. If your tier-one jury is composed of mid-tier models, your tier-two jury can use frontier models. They will be more expensive per case but should resolve more of the hard cases. If the tier-two panel can also abstain, you have a third tier.

Second escalation: a domain-specific tool. Some hard cases are hard because they require domain knowledge that general LLMs do not have. Routing to a tool that retrieves the relevant ground truth (a database query, a retrieval pipeline, an external service) before re-running the jury can resolve cases that no general LLM panel can settle.

Third escalation: human review. The most expensive but ultimately authoritative path. A trained human reviewer reads the case and the abstention reasons and produces a verdict. This is slow and expensive but provides ground truth labels for the cases that none of the automated tiers could handle. Those labels feed back into the validation set, which improves future calibration.

The escalation path should be designed up front as part of the calibrated refusal deployment. Without it, you have a system that knows when to refuse but no plan for what to do with refused cases. The plan is the difference between a measured improvement in verdict quality and an operational mess.

A practical metric: the escalation rate. What fraction of total cases end up in escalation? If it is too high (say, above 15 percent), your jury is operating on a case mix beyond its capability and you need either more capable judges or domain tooling. If it is too low (under 2 percent), your abstention thresholds are too lax and the jury is voting on cases it should not. The right number is empirical and domain-specific; track it.

The scoring implications

Calibrated refusal changes how the composite score is computed in subtle but important ways. We will work through the implications because skipping them produces a system that scores agents inconsistently.

First, abstention is not a verdict. A case where the jury abstained should not contribute to the agent's score in either direction. It contributes to the case-coverage statistic (some fraction of cases were not resolved by automated jury) but not to the score. Counting an abstention as a partial-fail biases the agent's score downward; counting it as a partial-pass biases upward. The right answer is to exclude it from the score numerator and denominator entirely.

Second, escalated cases need to be resolved before the score is final. If you publish a score that excludes escalated cases, the agent's score is computed on the easy subset of cases — exactly the bias we were trying to avoid. The score should be held until escalations resolve, then computed on the full case set including the escalated verdicts.

Third, score lineage should disclose abstention rates. The composite score metadata should include the fraction of cases that required escalation. A score computed on a panel where no judge ever abstained is meaningfully different from a score computed on a panel where 20 percent of cases required escalation. Both can be valid; the difference is informative for buyers reading the score.

Fourth, certification tiers may need adjustment. The rubric for Bronze, Silver, Gold, and Platinum certifications was likely calibrated against an all-vote jury. Switching to abstention-enabled juries shifts the score distribution. The certification thresholds may need to move to maintain the intended population mix. This is a one-time recalibration when you deploy abstention; do it deliberately, not by accident.

Fifth, the score's confidence band should widen for cases with high abstention rates. The score is an estimate; the confidence in that estimate depends on how many judges contributed. A case resolved by all five judges with similar verdicts has a tight confidence band. A case resolved by three judges with one outlier has a wider band. Expose the band, not just the point estimate.

This is more accounting than people want to do. It is also the difference between a scoring system that produces meaningful numbers and one that produces numbers-shaped noise.

The counter-argument

The sharpest counter-argument is that calibrated refusal trades one problem for a worse one. The original problem was that the jury was overconfident on hard cases. The replacement problem is that the jury escalates a meaningful fraction of cases, slowing throughput, raising cost, and introducing operational complexity around the escalation path. For many use cases, the original overconfidence was actually fine — buyers reading a low-confidence score implicitly understood it was a low-confidence score, and the system overall worked.

This argument has two parts and both deserve a response.

The throughput-and-cost part is real. Calibrated refusal does add latency on hard cases (escalation paths take time) and cost (escalation tier judges are more expensive, human review is the most expensive). For high-volume applications where the per-case stakes are low, the operational overhead may not be worth the verdict quality improvement. We accept this trade-off explicitly. The pattern is most useful where per-case stakes are high (escrow eligibility, certification gating, deal qualification) and less useful where stakes are low (every casual generation request).

The buyers-implicitly-understand part is wishful. We have not seen evidence that buyers extract uncertainty signal from confident-looking scores. They read the number and act on it. Hiding uncertainty inside a number that looks decisive is a system design flaw, not a feature. The buyer's behavior is downstream of the surface; if the surface produces a confident-looking number, the buyer treats it as confident. The fix is to expose the uncertainty, which requires the system to know when to be uncertain, which requires calibrated refusal.

There is a more sophisticated version of the counter-argument: that confidence numbers from LLMs are unreliable enough that building a system on top of them is shaky. This argument is partially correct — uncalibrated LLM confidence is essentially noise, as we established earlier. But the response is not to abandon the pattern; it is to invest in calibration. Validate the confidence numbers against ground truth. Retire judges whose confidence is uncalibrated. Tune the thresholds against measurable accuracy gains. The pattern works only if you do the calibration work; the work is the point.

What Armalo does

Armalo's multi-LLM jury supports calibrated refusal as a per-judge configuration. Each judge in the panel has a tunable confidence threshold and an abstention prompt that anchors the confidence scale to rubric applicability and evidence ambiguity. Verdict aggregation accounts for abstentions: cases require a minimum vote count for auto-resolution, and the trim policy scales to the surviving voter set.

We maintain a labeled validation set per evaluation domain. Judges are scored against the validation set using ECE and AURRA; thresholds are tuned to the AURRA plateau point. Judges that fail calibration (ECE above 0.20) are flagged for retirement or prompt revision. Calibration metrics for each judge are exposed in the evaluation event metadata.

The escalation path has three tiers. Tier one is the default cost-optimized panel. Tier two is a frontier-model panel reserved for cases tier one cannot resolve. Tier three is human review, used sparingly for cases tier two also cannot resolve and where the verdict has high economic consequence (escrow eligibility for a high-value deal, certification tier upgrade).

The trust oracle exposes confidence metadata alongside scores. Each composite score returned includes the fraction of cases that required escalation and the post-aggregation confidence band. Buyers calling the oracle can choose to require a confidence threshold (refuse low-confidence verdicts) or accept verdicts at any confidence level with appropriate disclosure.

Certification tiers (Bronze, Silver, Gold, Platinum) were recalibrated when calibrated refusal was deployed to maintain the intended population distribution. The recalibration is documented and the tier thresholds are public.

Frequently asked questions

If a judge abstains too often, does it stop being a useful panel member? Depends on AURRA. A judge with high AURRA whose abstentions are concentrated on genuinely hard cases is contributing useful refusal signal even with high abstention rate. A judge with flat AURRA whose abstentions are random is just adding latency.

Does calibrated refusal slow down evaluation throughput? Yes on hard cases that escalate. No on easy cases that the tier-one jury resolves cleanly. Throughput cost is concentrated on the cases that most need careful resolution; this is a feature.

Can the agent under evaluation game the abstention pattern? Gaming abstention requires producing outputs that systematically push judges into low confidence without changing the verdict. This is hard in practice because the same prompt structures that confuse one judge tend to confuse the others. Diversity in the panel is the protection.

What if the validation set is biased toward easy cases? The ECE and AURRA computed on a biased validation set will be biased estimates. The measurements will look better than reality. Build the validation set deliberately to span the difficulty distribution; if you cannot, treat the calibration metrics as upper bounds rather than truth.

Does this apply to deterministic checks? Deterministic checks do not need abstention; they pass or fail unambiguously. Calibrated refusal is specific to probabilistic LLM judges. The pattern is not relevant to assertion-style tests.

How often should the validation set be refreshed? Whenever the case mix in production drifts substantially or when judge models update. As a rough cadence, plan to refresh quarterly. Stale validation sets produce stale calibration estimates.

What happens to the agent's score on cases that escalated to human review? The human verdict is the verdict. It enters the score the same way an automated verdict would, with metadata indicating the escalation path. Human-reviewed verdicts also feed back into the validation set as ground truth labels.

Can buyers opt out of abstention and demand a verdict on every case? Yes, but at their own risk. The trust oracle supports a force-vote mode that disables abstention. The verdict metadata flags this mode so buyers downstream can see that the verdict was forced.

Bottom line

A jury that always votes is a jury that hallucinates confidence on the cases that should have escalated. Calibrated refusal is the discipline of letting judges admit when they do not know enough to vote, then routing those cases to a path that can resolve them properly. Done badly, it is just dropped traffic and operational chaos. Done well, it produces measurably better verdicts on hard cases and gives buyers honest confidence signals on the easy ones. The work is in the calibration measurement: build the validation set, compute ECE and AURRA per judge, tune the thresholds against the data, design the escalation path before deploying. Skipping any of those steps produces a system that looks calibrated but is not. Do the work.

Free downloadNo credit card · Save as PDF

The Trust Score Readiness Checklist

A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.

12-dimension scoring readiness — what you need before evals run
Common reasons agents score under 70 (and how to fix them)
A reusable pact template you can fork
Pre-launch audit sheet you can hand to your security team

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

multi-llm-jurycalibrationevaluationabstentiontrust-layermeasurement

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Calibrated Refusal: Teaching The Jury To Say "I Don't Know" Instead Of Hallucinating Confidence

Turn this trust model into a scored agent.

TL;DR

The case the jury got wrong by getting it right too confidently

What "calibrated" actually means in this context

Expected calibration error: do the confidence numbers mean anything

Abstention rate: how often the judge passes

AURRA: the curve that ties it together

The aggregation problem: what to do when judges abstain

How to write the judge prompt for calibrated abstention

The Calibration Test Protocol

The escalation path: what happens when too many judges abstain

The scoring implications

The counter-argument

What Armalo does

Frequently asked questions

Bottom line

The Trust Score Readiness Checklist

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment

Related Posts

Evaluation Replay: When You Re-Run Old Evals With New Judges And Get A Different Truth

The Jury Trim Rule: Why Top And Bottom Twenty Percent Get Cut, Not Outliers

Eval Provenance: Tracking Which Judge Decided What And Why It Matters In Court