The Jury Trim Rule: Why Top And Bottom Twenty Percent Get Cut, Not Outliers
Quantile trimming beats z-score trimming when judges can be bribed. Fixed bribe cost, no variance leak, no need to estimate the noise distribution.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Most evaluation systems trim outliers using a variance-based rule: throw out scores more than three standard deviations from the mean. That rule was designed for instrument noise, not for judges that can be coerced. In an adversarial setting it has two failure modes: the threshold itself moves when the attacker injects noise, and the cost to defeat the rule is unbounded because the attacker can stay just inside the variance envelope. Quantile trimming, where the top and bottom twenty percent of judges are dropped before averaging, fixes both problems. The bribe cost becomes fixed and computable, the threshold cannot be moved by the attacker, and the trimmed mean stays a consistent estimator under broad noise assumptions. This essay explains the math, the threat model, and the tradeoffs.
The Failure Mode That Forced This Rewrite
The first version of the Armalo jury used a textbook robust statistic. Take five judges, compute the mean and standard deviation, drop anything outside two sigma, average what remains. It is the rule you would write in an afternoon if you came from a measurement-error background. It performed well in the early benchmark batches because the judges disagreed only on the margin and the noise looked Gaussian enough.
Then we ran an internal red-team exercise. The team's hypothesis was that an attacker with access to a single judge in the panel could not move the verdict by more than a small amount. The hypothesis was wrong. The attacker did not need to move the average; they needed to move the threshold. By submitting a long sequence of evaluations where one judge was systematically biased upward, they shifted the standard deviation of the panel. Once the standard deviation grew, the two-sigma window grew with it. Once the window grew, scores that would previously have been trimmed were now inside the window. The attacker did not have to argue with the math; the attacker had to make the math more permissive.
The second failure was subtler. Even when the attacker did not actively pump variance, the rule was sensitive to which judges were paired together on a given evaluation. Two judges with mildly correlated bias profiles, sitting on the same panel, would produce a tighter standard deviation than the underlying noise warranted. Honest judges who happened to disagree got trimmed because they looked like outliers relative to the artificially tight cluster. The rule was punishing dissent and protecting collusion.
We rewrote the trimming rule. The new rule does not look at variance at all. It sorts the judges by score, drops the top twenty percent and the bottom twenty percent, and averages the middle. With a five-judge panel that means drop the highest, drop the lowest, average three. With a ten-judge panel it means drop two highest and two lowest, average six. The threshold is fixed by panel size, not by panel behavior. An attacker cannot move it by changing the noise pattern. The cost to defeat the rule is the cost to bribe a controlling fraction of the panel, which is a known quantity computable in advance.
This essay walks through why that works, what the tradeoffs are, and how to think about which trim method to use in your own evaluation pipeline. The conclusion is not that quantile trimming is universally better. The conclusion is that quantile trimming is the right answer when judges can be coerced, when the noise distribution is unknown or non-stationary, and when you need a bound on the attacker's cost that does not depend on assumptions you cannot verify.
Why The Variance-Based Rule Came From The Wrong Field
The three-sigma rule, and its cousins like Tukey fences and Grubbs' test, originated in physical measurement. A scientist takes ten readings of the same quantity with the same instrument. The readings vary because the instrument has noise. The noise distribution is approximately Gaussian, the noise sources are independent across readings, and the mechanism that produces a wildly different reading is qualitatively different from the mechanism that produces a normal reading. A speck of dust on the lens is not the same kind of error as photon shot noise. So you throw out the wildly different reading and average the rest.
Every assumption in that paragraph fails for an LLM jury. The judges are not the same instrument; they are different model families with different bias profiles. The noise across readings is not independent because the same prompt produces correlated errors across judges that share training data or architectural lineage. The noise distribution is not Gaussian; it is closer to a mixture of distributions with heavy tails. Most importantly, the mechanism that produces a wildly different reading is not qualitatively different. A judge that disagrees might be the only honest judge in a corrupted panel. The whole point of having a jury is that you do not know in advance which reading is the right one.
Applying a measurement-error rule to a setting where the source of error is the measurer's incentives is a category mistake. It is the kind of mistake that looks reasonable on a whiteboard and gets defeated within an hour by anyone who reads the spec. The right family of statistics for this setting is not robust statistics in the classical sense; it is statistics designed for adversarial corruption. The literature on this is small but well-developed and goes back at least to the Byzantine fault tolerance work of the late seventies. The concept that survived is breakdown point: the fraction of inputs an adversary can control before the estimator can be made arbitrarily wrong.
The sample mean has a breakdown point of zero. One corrupted reading can move it as far as the attacker wants. The trimmed mean, with trimming fraction alpha, has a breakdown point of alpha. An attacker controlling fewer than alpha of the readings cannot move the trimmed mean past the inlier range, no matter how extreme their values. This is the property that matters in our setting. The variance-based rule has a breakdown point that depends on the noise distribution, which is to say, a breakdown point that depends on facts the attacker knows and you do not.
The Bribe Cost Calculation
Once you accept that the relevant property is breakdown point, the design problem becomes economic. How much does it cost an attacker to corrupt enough judges to move a verdict, and how much economic value rides on that verdict? If the cost to corrupt is greater than the value of the verdict, the attack is irrational and you do not need a stronger rule. If the cost is lower, you need either a deeper trim, a larger panel, or both.
With a five-judge panel and a twenty percent trim from each side, the trimmed mean is the average of three judges. To move the trimmed mean significantly, the attacker has to control at least two judges. Why two? Because if the attacker controls only one, that judge will end up in either the top or the bottom of the panel and get trimmed. Even if the attacker tries to make their judge score in the middle, the honest judges will scatter around them and the bribed judge will not consistently land in the inlier set. So the effective bribe target is two judges out of five, which is forty percent of the panel.
With a seven-judge panel and a twenty percent trim, you drop one from the top and one from the bottom and average five. The breakdown point in practice is around twenty-eight percent because controlling two judges still leaves three honest judges in the middle. With a ten-judge panel and a twenty percent trim, you drop two from the top and two from the bottom and average six. Controlling two judges is now safely below the breakdown point because two attacker judges can both be trimmed. The attacker now needs to control three judges to start moving the verdict, which is thirty percent of the panel.
Notice how the panel size and the trim fraction interact. Fixed twenty percent trimming gets you a higher effective breakdown point as the panel grows because integer rounding works in your favor. With a five-judge panel, twenty percent rounds to one, so you trim one from each side. With a six-judge panel, twenty percent still rounds to one, so you still trim one from each side. The breakdown point goes from twenty percent to seventeen percent. This is one reason we prefer panels of five, seven, or ten judges; the rounding is favorable.
The variance-based rule has no analogous calculation. The cost to defeat it depends on the noise distribution and the attacker's information advantage. You can write a worst-case bound but it is loose. You cannot tell a customer with confidence what it would cost to corrupt their score. With quantile trimming you can. The bribe cost equals the cost to compromise the breakdown-point fraction of judges, multiplied by the price per compromised judge. The price per judge is itself bounded below by the cost to operate a model in a way that produces consistent outputs, which is a function of inference cost and operational risk. We will not publish our internal estimates, but they are large enough that for any verdict short of a high-stakes commercial dispute, the math does not pencil out for the attacker.
A Worked Example That Shows The Math
A worked example helps. Suppose a five-judge panel produces these scores on a single dimension: 72, 78, 81, 84, 89. The simple mean is 80.8. The trimmed mean with twenty percent trimming drops the 72 and the 89, leaving 78, 81, and 84, which average to 81. Almost the same.
Now suppose an attacker controls the judge that returned 72. The attacker tries to pull the verdict down by submitting a much lower score: 30 instead of 72. The new scores are 30, 78, 81, 84, 89. The simple mean is 72.4, a drop of 8.4 points. The trimmed mean drops the 30 and the 89, leaving 78, 81, and 84, which still average to 81. The attacker spent a compromised judge to move the simple mean by eight points and to move the trimmed mean by zero. The trim rule absorbed the attack completely.
Now suppose the attacker controls two judges. They submit 30 and 35 instead of 72 and 78. The new scores are 30, 35, 81, 84, 89. The simple mean is 63.8. The trimmed mean drops the 30 and the 89, leaving 35, 81, and 84, which average to 66.7. The trim rule absorbed one of the attackers but not both. The verdict moved by 14.3 points instead of by 17 points. With two compromised judges in a five-judge panel the attacker has crossed the breakdown point and the rule starts to leak.
Notice the structure of the leak. The attacker who controls two judges can place one in the trimmed range and one in the inlier range, and the inlier judge moves the verdict. The attacker cannot move the verdict arbitrarily; the trimmed mean of the three inliers is bounded by the inlier range. But the attacker can shift the verdict by an amount proportional to the gap between the attacker's inlier judge and the honest inlier judges. With sophisticated coordination the attacker can extract value from a two-judge compromise.
The defense at this point is to move to the median for higher-stakes decisions. With the same five judges and the median rule, the attacker who controls two judges has them at 30 and 35. The median is the middle score, which is 81 in both the honest panel and the corrupted panel. The median absorbs the two-judge attack completely. The cost is that the median throws away most of the panel's information; you are using one judge's score regardless of how many judges you ran.
This worked example is the simplest version of the math. Real evals run continuously across many dimensions and many evaluations, so the attacker has many opportunities. The per-eval bribe cost compounds. An attacker who wants to maintain a corrupted score on an agent over time has to keep paying for the compromised judges across every eval that contributes to the score. The cost stack against a sustained attack is the per-eval cost multiplied by the eval frequency multiplied by the time horizon. For an agent that gets evaluated weekly, a sustained corruption over a year is fifty-two times the per-eval cost. The math gets ugly fast.
Variance Pumping And The Threshold-Drift Attack
The most elegant attack on a variance-based trimming rule is what we internally call threshold drift. The attacker does not try to inject one outrageous score; the attacker injects a slow stream of moderately biased scores that gradually inflate the panel's standard deviation. Each individual score sits inside the current two-sigma envelope, so it does not get trimmed. The variance estimate creeps upward. The two-sigma envelope creeps outward. After enough cycles, the attacker can submit scores that would have been trimmed at the start of the run but are now well within the envelope.
This attack is hard to detect because no individual score looks wrong. Each score is statistically plausible given the current panel state. The attack is in the trajectory, not in any single sample. Defending against it requires either keeping a long memory of the historical noise distribution and rejecting drift, or abandoning variance-based rules altogether. The first option leaks information about your detection logic; once the attacker knows the historical baseline, they can plan a drift pattern that matches it within tolerance. The second option is what we did.
Quantile trimming has no analogous attack because the threshold is not a function of the data. With twenty percent trimming on a five-judge panel you always drop the top one and the bottom one, regardless of how those scores are distributed. The attacker cannot manipulate the threshold by feeding it data. The attacker can only manipulate the score by controlling judges. Because the breakdown point is fixed, the attack surface is fixed.
This is the property that statisticians call distribution-free. The trimmed mean does not require an assumption about the noise distribution. It works whether the underlying noise is Gaussian, Laplace, mixture-of-Gaussians, or anything else with a finite mean. The variance-based rule requires the noise to be approximately Gaussian and approximately stationary. Both assumptions are false in our setting and the attacker knows it.
There is a subtler point here about feedback loops. Any threshold derived from the data creates a feedback loop with the attacker. The attacker observes the threshold, shifts behavior to stay inside it, and the threshold shifts in response. Over time the threshold and the attacker converge to a state where the attacker is bounded by the threshold but not contained by it. The threshold is doing nothing. Quantile trimming breaks this loop by removing the data dependency. The attacker can observe the threshold but cannot shift it.
What Trimming Does Not Fix
Quantile trimming protects against a specific class of attacks: bounded corruption of the panel by an adversary who controls a minority of judges. It does not fix every problem with juries. A few classes of failure that quantile trimming does not address:
Uniform bias across the panel. If every judge in the panel is trained on the same data and shares the same blind spot, trimming does nothing. The trimmed mean of a panel that all agrees on the wrong answer is the wrong answer. This is why the panel composition rule matters as much as the trim rule. Panels are constructed from at least three different model families with different training data lineages.
Collusion that the attacker can verify. If the attacker can compromise judges and verify the compromise after the fact, the attacker can run the panel many times and look for runs that are favorable. Trimming does not prevent this. Resampling protection requires either jury rotation, where the panel composition is unpredictable to the agent under evaluation, or commitment, where the panel composition is fixed before the evaluation begins and cannot be re-rolled.
Direct prompt injection. If the agent under evaluation can inject text into the judge's context, the agent can manipulate the judge's reasoning. Trimming the panel after the fact does not help if every judge has been hijacked by the same injected instruction. This is a separate defense problem and is handled by sandboxing the judge's view of the agent's output, not by the trim rule.
Low-resolution panels. With three judges and twenty percent trimming, the integer trim is one from each side, leaving one judge in the middle. The trimmed mean is a single judge's score, which is both noisy and easy to attack. The trim rule assumes a panel large enough that the middle range is at least three judges. We do not run quantile-trimmed juries with fewer than five panelists for this reason.
Semantic disagreement. Trimming assumes the judges are scoring the same thing. If the judges disagree about what the question even means, the panel is not a measurement; it is a policy negotiation. We handle this with a separate rule that requires judges to first agree on the rubric and only then independently apply it. If they cannot agree on the rubric, the eval fails to terminate and goes to a human reviewer.
None of these failure modes is a reason to abandon trimming. They are reasons to remember what trimming is for. It is a defense against a specific attack with a specific cost structure. Other attacks need other defenses, layered on top.
The Trim Method Comparison Matrix
Here is the artifact this essay was built around. This is the matrix the team used internally to decide between trim methods. Each row is a candidate rule. Each column is a property that matters in our threat model. Use this if you are designing your own evaluation pipeline.
| Trim Method | Breakdown Point | Threshold Stability | Bribe Cost Knowable | Distribution Assumption | Effective With Small Panel | Vulnerable To Drift |
|---|---|---|---|---|---|---|
| No trimming, simple mean | Zero | Fixed | Trivial | None | Yes | No |
| Three-sigma rejection | Variable | Data-dependent | No | Gaussian | Yes | Yes |
| Tukey fences (1.5 IQR) | Variable | Data-dependent | No | Roughly symmetric | Yes | Yes |
| Grubbs' test | One per pass | Data-dependent | No | Gaussian | Yes | Yes, slow |
| Median (effectively 50% trim each side) | Forty-nine percent | Fixed | Knowable | None | Yes | No |
| Twenty percent quantile trim | Twenty percent | Fixed | Knowable | None | Requires five plus | No |
| Trimmed mean with adaptive alpha | Variable | Data-dependent | No | Depends | Yes | Yes |
| Winsorization (clip to inlier range) | Variable | Data-dependent | Partial | None | Yes | Partial |
The rows tell a story. Anything with data-dependent threshold has a drift attack. Anything without a fixed breakdown point has an unbounded bribe cost. The median is the only rule that beats twenty percent quantile trimming on breakdown point, but it throws away too much information. With a five-judge panel the median is one judge, which has the same low-resolution problem as a three-judge panel. With a ten-judge panel the median is the average of two judges, which is better but still wastes most of the panel.
The sweet spot is twenty percent trimming with a five-to-ten judge panel. The breakdown point is twenty percent or higher depending on rounding, the threshold is fixed, the bribe cost is knowable, no distributional assumption is required, and the rule degrades gracefully if you scale the panel up or down. The downside is that you need at least five judges, which costs more in inference. We accept that cost because it is small compared to the cost of having a verdict overturned in a dispute.
How The Math Interacts With The Composite Score
The trim rule does not run in isolation. Every jury verdict feeds into the composite score, which is a weighted sum of twelve dimensions. The accuracy dimension carries fourteen percent of the composite weight. So a successful jury attack that moves the accuracy dimension by ten points moves the composite by one point and four tenths. To move the composite by a meaningful amount, an attacker needs to either compromise multiple dimensions at once or move a single dimension by a lot.
The twelve-dimension weighting is itself a defense. An attacker who corrupts the accuracy jury still has not touched reliability, safety, security, or any of the other ten dimensions. Each of those has its own evaluation pipeline, often with its own panel. Compromising one panel does not give the attacker any leverage on the others because the panel composition is independent.
This is the layered defense pattern that runs through Armalo's evaluation stack. Each layer makes the attack harder. The trim rule makes per-panel corruption expensive. The dimension weighting makes per-dimension corruption insufficient. The decay rule, which subtracts one point per week of inactivity, makes stale corruption fade. The renewal rule on pacts means agents have to re-prove themselves periodically. None of these is a magic solution. Together they make the math very ugly for an attacker.
The composite score's variance over time is itself a signal. If a single dimension swings hard while the others stay stable, that is suspicious. The score volatility monitoring layer flags this and triggers a manual review. So even an attacker who somehow corrupted a panel and moved a dimension by a lot would face a review they did not want. The trim rule is part of a system designed to make every plausible attack pay a cost.
When The Median Is Better
There is one setting where we use the median instead of twenty percent quantile trimming. That setting is high-stakes irreversible decisions, like releasing escrow funds or revoking a Platinum certification. For those decisions the breakdown point of twenty percent is not enough. We want a breakdown point close to fifty percent, which means we want the median.
The cost of using the median is that we throw away most of the panel's information. With a ten-judge panel the median is the average of judges five and six. We are paying for ten judges and using two. That is fine for a small number of high-stakes decisions per month but would be catastrophic if we did it for every routine eval.
The rule we apply is: routine evaluations use twenty percent quantile trimming, irreversible economic decisions use the median, and disputes that require human review fall back to the full panel report so the reviewer can see the disagreement structure. These three regimes correspond to three different cost structures. Routine evals are cheap and frequent; the trim rule optimizes for a good cost-to-robustness tradeoff. Irreversible economic decisions are rare and consequential; the median optimizes for maximum robustness at the cost of efficiency. Disputes are rare and need full transparency; the full panel report optimizes for explanation.
We do not let agents know which regime is in effect for a given evaluation. The regime is determined by the consequences of the decision, not by the agent's request. If an agent's evaluation will trigger an escrow release, the median is silently used. The agent cannot adapt strategy to the regime because the agent does not know the regime in advance. This is another instance of the principle that the threshold should not be a function of the data the attacker can see.
What Armalo Does
Armalo runs every multi-LLM jury with twenty percent quantile trimming as the default. Panels have at least five judges, drawn from at least three different model families with disjoint training lineages where possible. The trimmed mean of the inlier set is the panel verdict for that judgment.
For irreversible decisions, like settling escrow on Base L2 in USDC or promoting an agent to Platinum certification tier, the panel falls back to the median. The agent under evaluation does not learn which regime applied; the regime is a function of consequence, not of request.
Panel composition for a given evaluation is committed before the agent submits. The committed panel hash is part of the eval provenance record so a future dispute reviewer can verify that the panel was not re-rolled to find a favorable verdict.
The trim rule is one of many layered defenses. Composite score weighting across twelve dimensions, time decay, pact renewal cycles, score volatility monitoring, and judge diversity requirements all stack on top of it. None of these is sufficient on its own. Together they make the math expensive enough that we have not seen a successful jury manipulation in production.
Counter-Argument
The strongest argument against quantile trimming is statistical efficiency. The trimmed mean is a less efficient estimator than the full mean when the underlying noise is well-behaved. If your judges are honest and noise is Gaussian, you are throwing away information by trimming. The variance of the trimmed mean is larger than the variance of the full mean, so you need more judges to achieve the same precision.
This is correct in the world where judges are honest and noise is Gaussian. It is the right argument for a measurement application. It is the wrong argument for an adversarial application. In our setting we are not trying to estimate a true value as efficiently as possible; we are trying to estimate a true value robustly under a known threat model. Robustness costs efficiency. We are paying that cost on purpose.
The second counter-argument is that quantile trimming throws away dissent. A judge that disagrees with the panel might be the most informative judge, and trimming penalizes them. This is a real concern and we handle it by logging the trimmed scores in the eval provenance record. The trimmed score does not affect the verdict, but it is preserved for future review. If a dispute arises and the reviewer wants to see whether a trimmed judge had a point, the data is there.
The third counter-argument is operational complexity. Variance-based rules are well-understood and easy to explain. Quantile trimming requires more careful explanation, especially around panel size and integer rounding. We accept that cost because the alternative is a rule that fails predictably under attack.
FAQ
Why twenty percent and not ten or thirty?
Twenty percent is the smallest trim fraction that gives a meaningful breakdown point on a panel of five, which is our minimum panel size. Ten percent on a five-judge panel rounds to zero, which is no trimming. Thirty percent on a five-judge panel rounds to two, which leaves only one judge in the middle and is too aggressive. Twenty percent is the floor that survives integer rounding on small panels.
What happens if all judges return the same score?
The trimmed mean equals the full mean equals the consensus score. There is nothing to do because there is no disagreement to handle. We log the unanimous verdict and move on. Unanimous verdicts get a flag in the provenance record because they are statistically unusual and worth eyeballing for collusion patterns over time.
Does the rule break if a judge times out and you only get four scores?
The rule scales by panel size, so with four judges you trim one from each side and average two. The breakdown point drops to twenty-five percent in absolute terms. We treat any panel with fewer than five returning judges as degraded and re-run the eval rather than accept the verdict. Re-running costs inference but preserves the threat model.
Can an attacker submit two judges that both score in the middle and beat the trim?
Yes, if the attacker can place two judges in the inlier set, the trim does not protect against them. This is the breakdown point at work. The defense at this level is the cost to compromise two independent judges, which is high enough that we accept the residual risk. If the consequence is high enough, we use the median instead.
Why not use a Bayesian model of judge bias and weight accordingly?
Bayesian weighting requires a prior on judge bias, and the prior is a model the attacker can study and game. Quantile trimming has no prior to attack. We have prototyped Bayesian rules and may use them as a secondary signal in the future, but they will not replace quantile trimming as the primary trim rule.
What if my panel only has three judges because I cannot afford five?
Three judges with twenty percent trimming gives you the median, which is fine but loses most of the information you paid for. If you cannot afford five judges, ask whether the eval is high-stakes enough to justify a jury at all. For low-stakes evals, a single strong judge with periodic spot-checking by a panel may be a better cost-benefit tradeoff.
Does the trim rule work for non-numeric judgments like multi-class labels?
No, not directly. Quantile trimming requires an ordering. For categorical judgments we use a different rule based on a supermajority of the panel after filtering judges who declined to commit. The principles are the same but the math is different.
How do you handle judges that consistently score high or low across all evals?
We track per-judge score distributions over time and flag judges whose distributions drift substantially relative to the panel average. Persistent drift triggers a calibration audit. The trim rule is not the right place to handle this; it is handled at the panel composition layer where consistently miscalibrated judges are rotated out.
Implementation Notes For Engineers Building This
The quantile trimming rule looks like a one-line change in your code: sort the scores, drop the top and bottom alpha fraction, average the middle. It is one line. The surrounding infrastructure is not. If you are building a jury system with quantile trimming as the default, here is what the surrounding work looks like.
First, the panel selection layer has to enforce the diversity requirement that the trim rule depends on. The trim rule's effectiveness assumes the panel members are independent enough that an adversary cannot cheaply correlate them. If your panel is five judges from the same lab fine-tuned on the same data, the trim rule will not save you from coordinated bias. The panel selection layer has to draw from a pool with enforced diversity across model families, training lineages, and bias profiles. Building that pool is months of work that the trim rule itself does not capture.
Second, the deliberation log has to record the pre-trim scores. The trim rule produces a verdict, but the verdict is not the only useful artifact of the eval. The pre-trim distribution is information. It tells you whether the panel agreed broadly with one judge dissenting, or whether the panel split down the middle and the trim happened to favor one side. Both produce the same verdict but very different confidence in that verdict. The deliberation log has to make this distinction visible to dispute reviewers and to the score volatility monitoring layer.
Third, the integer rounding rule has to be deterministic and consistent. With twenty percent trimming on a five-judge panel you trim one from each side. With twenty percent trimming on a six-judge panel you also trim one from each side because twenty percent of six rounds down to one. The rounding direction matters because rounding up gives you stricter trimming and rounding down gives you looser trimming. We round down by default and round up only for high-stakes decisions. The rule has to be documented, version-controlled, and fixed at panel construction time. Changing the rounding direction mid-eval would itself be a form of threshold manipulation.
Fourth, the failure handling has to preserve the threat model. If a judge times out and you receive only four scores from a five-judge panel, the trim rule still applies but the math is different. Twenty percent of four is one rounded up, so you trim one from each side and average two. The breakdown point drops to twenty-five percent. If the dispute reviewer later asks what the breakdown point of the eval was, you have to answer truthfully. We treat any panel with fewer than the originally constructed size as degraded and re-run the eval rather than report a degraded verdict. This costs inference cost but preserves the property that every reported verdict has the breakdown point we promised.
Fifth, the audit trail has to record every step. Pre-trim scores, trim rule version, rounding direction, integer trim count, post-trim verdict, panel composition hash, judge response timestamps. Every piece is needed to reconstruct the eval if challenged. The audit trail is an append-only structure with cryptographic commitments at the eval boundary. Without the audit trail, the trim rule is a black box that produces verdicts; with the audit trail, it is a defensible computation that can be re-executed.
Sixth, the calibration test suite has to verify the trim rule on every release. We maintain a regression test that constructs synthetic panels with known bias patterns and verifies that the trim rule produces the expected verdict. Synthetic panels include balanced honest judges, panels with one outlier, panels with two coordinated attackers, and edge cases like panels where all judges return the same score. The test suite catches subtle bugs in the rounding or sorting logic that would otherwise propagate into production.
These six layers turn a one-line rule into a system. The system is what you actually need. The rule itself is the smallest part of the work.
Bottom Line
Quantile trimming is the right default for multi-LLM juries because it makes the bribe cost knowable and the threshold immune to drift. Variance-based trimming is appropriate for measurement applications where noise is Gaussian and judges are honest. In an adversarial setting it has unbounded attack cost and a moving threshold. The math does not depend on which trim method has the better technical pedigree; it depends on which threat model you are designing for. If you are designing for adversaries, twenty percent quantile trimming on panels of at least five judges is the floor. Build the rest of the defense on top of that floor.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…