Insights

BuilderEvaluation & scoring

Single-Judge Bias: The Empirical Case For Three Or More Independent Models

2026-06-2122 minarmalo Team

A single LLM judge has bias profiles you cannot see. Length bias, position bias, self-preference, sycophancy. Three independent model families is the floor.

Continue the reading path

Topic hub

Agent Evaluation

This page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.

Strategic Guide

Agent Evaluation Framework

Curated Collection

Evaluation Blueprints

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

TL;DR

A single LLM judge has bias profiles you cannot see at decision time and may not be able to detect after the fact. Length bias, where the judge prefers longer responses regardless of quality. Position bias, where the judge prefers whichever option appears first or last. Self-preference, where the judge favors outputs from its own model family. Sycophancy, where the judge agrees with whatever the prompt implies it should agree with. Each of these has been measured empirically in published research and replicated in our internal evaluations. The bias magnitudes are not small. The fix is not better prompting; it is panel diversity. Three or more independent model families is the floor for any evaluation that will inform a consequential decision. This essay walks through the bias profiles, the empirical evidence, and the diversity scorecard we use to construct panels.

The Day We Discovered Our Best Judge Was Lying

In the early platform days we had a favorite judge. It was the strongest available model from one major lab, it scored well on our calibration set, it gave articulate reasoning, and it was fast. We used it as the primary judge in many evaluations. We thought we had a good system.

Then we ran a controlled experiment. We took two hundred eval pairs where two agent responses had been rated by our judge. For each pair, we constructed a swapped version where the order of the two responses was reversed. The judge had originally scored response A above response B. In the swapped version, we presented response B first and response A second. The judge now scored response B above response A. Same content, different order, opposite verdict.

The rate at which this happened was not small. Across the two hundred pairs, the judge changed its verdict due to position alone in about thirty percent of cases. For pairs where the underlying quality difference was small, the rate was higher. For pairs where the underlying quality difference was large, the rate was lower but not zero. Our favorite judge was making thirty percent of its decisions on a basis we had not considered.

We ran the same experiment with several other judges. The bias was present in all of them, with varying magnitudes. The strongest model had a position bias around twenty-five percent. The weakest had a position bias around forty percent. None had a position bias close to zero.

This was a sobering experiment because it called into question every eval we had run with a single judge. Some of those evals had informed certifications that we had charged operators for. Some had contributed to public composite scores. We had not been deceiving anyone deliberately, but we had been treating a flawed instrument as if it were precise.

The response was to rebuild the evaluation pipeline around panels of at least three independent judges drawn from different model families, with consistent counter-balancing of position to detect and discount position bias. The cost was a multiple of what single-judge evaluations had cost. The benefit was an evaluation system whose verdicts were defensible.

This essay walks through the four major bias profiles, the empirical evidence for each, why panel diversity is the right defense, and what the diversity scorecard looks like that we use to construct panels. The conclusion is that single-judge evaluation is appropriate only for low-stakes decisions where bias on the order of twenty to forty percent is acceptable. For any decision where the verdict will inform certification, settlement, or commercial relationships, three or more model families is the floor.

Length Bias And The Verbosity Trap

Length bias is the tendency of LLM judges to prefer longer responses regardless of underlying quality. The bias is well-documented in academic research and is robust across model families and prompt formulations. We have replicated it internally in every judge we have tested.

The mechanism is intuitive. A longer response provides more surface for the judge to find evidence of effort, comprehensiveness, or thoughtfulness. The judge interprets length as a proxy for quality even when length is not correlated with quality. A response that is twice as long as a competitor is rated higher even when the additional length is filler, repetition, or wandering.

The magnitude varies by judge but is consistently substantial. In controlled experiments where we varied response length while holding content quality constant, judges preferred longer responses in roughly sixty percent of cases when the length difference was large. The bias persisted when we explicitly instructed the judge to ignore length. The bias persisted when we provided rubrics that did not mention length. The bias is in the model's fundamental preference structure and cannot be prompted away.

Length bias has direct consequences for agents that learn from judge feedback. An agent training against a length-biased judge will learn to produce verbose responses. The verbosity is not desired by the user; it is an artifact of the training signal. Once the agent learns to be verbose, the judge rates it higher, the operator sees the higher score, and the operator promotes the verbose behavior. The feedback loop produces agents that are worse for users while looking better on evaluations.

The defense against length bias at the panel level is to use judges with different length sensitivities. Some judges are more length-biased than others; mixing them dilutes the bias. We measure each judge's length sensitivity on our calibration set and require the panel to span a range of sensitivities. A panel where every judge is heavily length-biased produces verdicts that reflect length, not quality. A panel that mixes high-sensitivity and low-sensitivity judges produces verdicts that reflect a mix.

At the rubric level, the defense is to make length explicit and bounded. We specify response length expectations in the rubric and penalize responses that exceed the expectation. This works partially. It does not eliminate the bias but it brings it within a tolerable range. The bias still leaks through in cases where two responses are both within the expected length range and the longer one wins.

Length bias is the most important bias to be aware of when designing eval rubrics because it is invisible to the judge and to the operator. The judge does not know it is being biased. The operator does not see the bias in the verdict. Only controlled experiments with length variation surface the bias. Most eval pipelines never run those experiments and never know the magnitude of the bias they are absorbing.

Position Bias And The Order Effect

Position bias is the tendency of LLM judges to prefer responses based on the order in which they are presented. The bias has been measured in published work and replicates across judges. The direction varies; some judges prefer the first option, some prefer the last, some prefer the option in a particular position relative to a reference.

The mechanism is a combination of attention patterns and language modeling priors. Judges process responses sequentially and may give more weight to information that appears earlier or later in their context. Some judges have learned during training that the first option in a comparison is more likely to be the correct answer because of training data biases. Some have learned the opposite for similar reasons.

The magnitude of position bias is large. In controlled experiments where we presented the same pair of responses in both orders, judges flipped their verdicts in twenty to forty percent of cases. The flip rate was higher when the underlying quality difference was small. The flip rate was lower but not zero when the difference was large. The bias is robust to instruction. We have tried prompting judges to ignore position and the bias remains.

The defense at the panel level is counter-balancing. For each pairwise comparison, half of the panel sees the responses in one order and half sees the responses in the other order. The verdicts from each half are aggregated separately and the panel verdict is the average. This dilutes position bias by construction. If every judge has the same direction of position bias, the dilution is partial. If different judges have different directions of position bias, the dilution is more effective.

We measure each judge's position bias on the calibration set and require the panel to mix judges with opposing position biases when possible. This is harder than mixing length biases because position bias direction can be unstable across prompt types. A judge that has a first-position bias on coding tasks may have a last-position bias on summarization tasks. The mixing has to account for the prompt type.

At the eval design level, the defense is to use single-response scoring instead of pairwise comparison wherever possible. Single-response scoring has its own biases but does not have position bias because there is no position to be biased about. Pairwise comparison is more sensitive but introduces the position effect. We use pairwise comparison only when the dimension benefits from it and counter-balance every comparison.

Self-Preference And The Family Effect

Self-preference is the tendency of LLM judges to prefer outputs that share characteristics with their own model family. The mechanism is partly trained: judges have been exposed to outputs from their own family during training and have learned to recognize and favor the family's distinctive style. The mechanism is partly architectural: judges respond more strongly to writing patterns that match their own internal generation patterns.

The magnitude is meaningful and varies by family. In experiments where we presented judges with response pairs where one response was generated by a model in the same family and one was generated by a model in a different family, judges preferred the same-family response in roughly sixty percent of cases when content quality was held constant. The preference held across many topic domains and persisted when we masked the response styles to remove obvious family fingerprints.

Self-preference has direct implications for any agent operator who builds on a particular model family and is then evaluated by judges in the same family. The agent gets a small but real boost from the judges' family preference. An operator who builds on a different family gets a small but real penalty. Over many evaluations, these effects compound into score differences that are not justified by underlying quality.

The defense is panel diversity. A panel with judges from at least three different model families dilutes self-preference. A panel where every judge is from the same family magnifies self-preference because every judge is biased in the same direction. We require panels to draw from at least three families and prefer panels with broader diversity when available.

The enforcement is non-trivial because the model family taxonomy is fuzzy. Some models share architectural lineage even if they have different commercial names. Some models share training data lineage even if they have different architectures. We maintain a family taxonomy that groups judges by both architecture and training lineage and requires panels to span at least three independent groups under this taxonomy. The taxonomy is updated as new model families emerge.

Self-preference is the bias that operators worry about most because it has obvious commercial implications. Operators who use one model family want their agents to be evaluated fairly by judges from other families. The diversity rule serves this goal. It also serves the broader goal of ensuring that no single model family's preferences dominate the score's meaning.

Sycophancy And The Implicit-Cue Effect

Sycophancy is the tendency of judges to agree with whatever the prompt implies they should agree with. If the prompt mentions that one response was written by a leading expert, the judge tends to favor that response. If the prompt mentions that one response was written by a beginner, the judge tends to disfavor that response. The bias responds to implicit social and reputational cues in the prompt.

Sycophancy is the most concerning bias in some ways because it is the easiest for an attacker to exploit. An operator who can inject implicit cues into the eval prompt can shift verdicts in their favor without modifying the underlying responses. We have seen real-world attempts at this through operator metadata that judges incorporated into their reasoning even though the metadata was not part of the rubric.

The magnitude of sycophancy varies by judge and by cue type. In controlled experiments, judges shifted verdicts by ten to thirty percent based on cues that were irrelevant to the rubric. The shift was larger for stronger reputation cues and smaller for weaker ones. The shift persisted when the rubric explicitly instructed the judge to ignore reputation cues. The bias is robust to instruction.

The defense is to remove cues from the judge's view. We strip operator metadata, reputation signals, and any other prompt fields that are not part of the rubric before presenting the response to the judge. The judge sees only the response content and the rubric. This is operationally annoying because some context is genuinely useful, but the alternative is allowing operators to manipulate verdicts through metadata.

At the panel level, sycophancy can be mitigated by using judges with different sycophancy profiles. Some judges are more responsive to implicit cues than others. A panel that mixes high-sycophancy and low-sycophancy judges produces verdicts that are less manipulable than a panel composed of either type alone. We measure sycophancy on the calibration set and prefer panels with mixed profiles.

The deepest defense against sycophancy is rubric design that does not invite implicit cues. Rubrics that ask judges to evaluate against objective criteria are less sycophantic than rubrics that ask judges to evaluate quality in general. Specific rubrics produce specific verdicts. Vague rubrics invite the judge to fill in the blanks with whatever the prompt implies should fill them.

Why Three Is The Floor And Not Two

The minimum panel size for diversity-based bias mitigation is three. Two judges is not enough for several reasons.

First, with two judges and disagreement, you have no tiebreaker. The panel verdict is either the average, which is meaningless when judges fundamentally disagree, or one judge's verdict, which makes the panel functionally a single judge. Three judges allows majority decisions and reveals when disagreement is structural rather than noise.

Second, two judges cannot cover the bias profile space. Two judges might both have first-position bias, or both have length bias in the same direction, or both have self-preference toward the same family. Three judges with deliberately different profiles is the smallest panel that can reasonably cover the major bias dimensions. Five is better. Three is the floor.

Third, two judges does not allow trim operations that protect against an outlier or a corrupted judge. The trim rule needs at least three judges to drop one and average two, and that minimum is uncomfortable because dropping one of three is dropping a third of the panel. Five judges with twenty percent trim is more robust because dropping one is dropping a fifth.

Fourth, two judges cannot diversify across model families enough. Two families is the minimum for diversity but produces structural risk if one family is later found to have widespread bias. Three families spreads the risk and makes any single family's bias less consequential.

We permit two-judge evaluations only for development and debugging. Production evaluations use at least three judges. Certification and high-stakes evaluations use at least five. The cost increases with panel size but the marginal benefit is large enough through five judges that the cost is justified.

The Judge Diversity Scorecard

Here is the artifact this essay was built around. This is the scorecard we use to construct panels and to verify that a constructed panel meets diversity requirements. Use this if you are building your own evaluation pipeline.

Diversity Dimension	Measurement	Minimum Requirement
Model family count	Count of distinct families on the panel	Three or more independent families
Architectural lineage	Hash-based grouping by transformer architecture variant	At least two distinct architectural groups
Training data lineage	Estimated by family taxonomy	At least two distinct lineage clusters
Length bias range	Measured length sensitivity on calibration set	Span includes both high and low sensitivity
Position bias direction	First/last/neutral classification on calibration set	Mix of directions when prompt type allows
Self-preference profile	Same-family preference rate on calibration set	No more than half the panel from one family
Sycophancy responsiveness	Cue-induced shift on calibration set	Span includes both high and low responsiveness
Calibration agreement	Inter-judge correlation on baseline tasks	Above floor; below ceiling to ensure independence
Recency of judge inclusion	Date judge was first added to panel pool	Panel includes at least one mature and one fresh judge
Operator-disclosed exclusions	Judges the operator has flagged for conflict	None present on panel

The scorecard runs at panel construction time. The selection function uses the scorecard as a constraint set; any panel that fails to meet the minimums is rejected and the function searches for an alternative panel. Most evaluations have plenty of valid panels and selection completes quickly. Some evaluations have constraints that limit the panel pool, and the selection function may have to use a larger panel size or a different rubric to satisfy the diversity requirements.

The scorecard is not visible to the operator or to the agent. It is part of the eval system's internal construction logic. Operators see only the panel size and the family count, not the specific judges or the bias profiles. The scorecard is recorded in the eval provenance so that a future dispute reviewer can verify that the panel met diversity requirements at the time of the eval.

Empirical Numbers From The Internal Replication

The academic literature on judge bias is extensive but the magnitudes vary across studies and conditions. Our internal replication on the calibration set produces numbers we use operationally. Sharing them adds concreteness to the abstract bias categories.

Length bias on our calibration set, measured as the rate at which a judge prefers a longer response when content quality is held constant, ranges from forty-three percent on the most length-sensitive judge to fifty-seven percent on the least sensitive. Random preference would be fifty percent, so the most sensitive judge is biased toward longer in fifty-seven percent of cases and the least sensitive is actually slightly biased toward shorter. The mean across the panel pool is fifty-two percent, slightly above random. The bias is small per judge but consistent enough across judges to compound when panels are not constructed to dilute it.

Position bias on our calibration set, measured as the rate at which a judge changes its verdict when responses are presented in reversed order, ranges from twenty-two percent to thirty-eight percent. The variance across judges is large enough that we can construct panels with substantial mixing. The judges with the highest position bias also tend to have particular directional preferences that we can pair with judges of opposite directional preference. With careful pairing, the panel-level position bias drops to under ten percent.

Self-preference on our calibration set, measured as the rate at which a judge prefers a same-family response when content quality is held constant, ranges from fifty-three percent to sixty-eight percent. Random preference would be fifty percent. The strongest self-preference is on judges from a particular family that has distinctive style fingerprints; the weakest is on judges from families with less distinctive style. Across the pool, self-preference at the panel level drops to about fifty-two percent when the panel includes three or more independent families.

Sycophancy on our calibration set, measured as the verdict shift induced by adding irrelevant reputation cues to the prompt, ranges from twelve points to twenty-eight points on a hundred-point scale. The judges most responsive to cues tend to be the largest and strongest models, which is counterintuitive but consistent with the hypothesis that stronger models do more inference about implicit social context. Smaller and more narrowly tuned models tend to be less sycophantic. Mixing model sizes in the panel reduces panel-level sycophancy.

The interesting cross-bias finding is that the four bias profiles do not correlate strongly across judges. A judge with high length bias does not necessarily have high position bias. A judge with high self-preference does not necessarily have high sycophancy. This is good news for panel diversity because it means a panel constructed to mix one bias dimension is also somewhat mixed on the others. We do not have to optimize for every dimension simultaneously; we can pick the most important dimension for the eval type and other dimensions get diluted as a side effect.

These numbers are specific to our calibration set and the judges we currently include. Other calibration sets and other judge pools will produce different numbers. The methodology is what generalizes; the specific magnitudes depend on the population.

What Calibration Set Construction Requires

The scorecard depends on a calibration set that measures each judge's bias profile. Constructing the calibration set is itself a substantial project and is the unglamorous foundation of the diversity work.

The calibration set has to include controlled probes for each bias type. For length bias, pairs of responses that vary in length but match in quality. For position bias, pairs presented in both orders. For self-preference, pairs where one response is from the judge's family and one is from another family, with style masked. For sycophancy, pairs presented with and without irrelevant cues. Each probe is repeated across topic domains so the bias estimates are robust to topic.

The calibration set has to be large enough to give statistical power. We use several thousand probes per bias type per judge. Smaller calibration sets produce noisy bias estimates that are not useful for panel construction. The cost of running the calibration set on every judge in the pool is meaningful but pays for itself in the quality of the diversity decisions.

The calibration set has to be refreshed. Judge bias profiles drift as model versions update. We re-run the calibration set on each judge whenever the underlying model is updated and at least quarterly regardless. Drifted bias profiles are reflected in the scorecard and can change which panels are feasible.

The calibration set has to be private. If judges or their developers had access to the calibration set, they could in principle tune to it and the bias measurements would become meaningless. We do not share the calibration set externally and we use private hashing to compare judge outputs without exposing the probes.

What Armalo Does

Every production evaluation uses a panel of at least three independent judges drawn from at least three model families. Certifications and high-stakes evaluations use at least five judges. Panel composition is constrained by the Judge Diversity Scorecard, which requires diversity across architectural lineage, training data lineage, and bias profile.

Each judge in the panel pool has a measured bias profile from the calibration set. The profile is updated whenever the underlying model is updated and at least quarterly. Panel selection uses the bias profiles to construct panels that mix bias directions and magnitudes, diluting any single judge's bias.

Position bias is mitigated by counter-balancing pairwise comparisons. Half of the panel sees responses in one order and half sees them in the other. The verdicts from each half are aggregated separately. Position bias contributes to noise but does not bias the verdict toward any particular position.

Sycophancy is mitigated by stripping operator metadata, reputation signals, and other implicit cues from the response presented to judges. Judges see only the response content and the rubric. This costs some operator-supplied context but removes the most exploitable bias attack surface.

Counter-Argument

The strongest argument against multi-judge panels is cost. A three-judge panel costs three times what a single-judge evaluation costs in inference. A five-judge panel costs five times. For a platform that runs many evaluations, the cost difference is substantial.

This is real and is why single-judge evaluations remain appropriate for low-stakes decisions. Routine internal monitoring, development feedback, and many forms of operator-facing diagnostics can use single judges with the operator's understanding that the score is approximate. Production evaluations that contribute to certifications or to public composite scores require panels because the cost of bias in those contexts is much higher than the cost of additional inference.

The second argument is that bias profiles can be addressed through better prompting. Many people propose that adding instructions like "ignore length, focus on content" should reduce length bias to negligible levels. The empirical evidence is that it does not. The bias is in the model's fundamental representation, not in its surface response to instructions. Better prompting helps marginally; it does not eliminate the bias.

The third argument is that diversity within a single judge through multiple sampling can substitute for diversity across judges. Sampling the same judge multiple times with different temperatures or seeds produces variance, but the variance is correlated within the judge's bias profile. Multiple samples from a length-biased judge are all length-biased. Sampling does not address bias; it only addresses noise.

FAQ

How do you handle situations where there are not three independent model families available?

For most evaluation domains there are plenty of families to choose from. For specialized domains, like multilingual evaluation in less-resourced languages, the family pool can be small. We accept smaller diversity in those cases and flag the eval result as having limited diversity, so consumers know the score is more bias-prone than usual.

Are some bias profiles fatal to a judge's inclusion in panels?

If a judge has extreme bias in any single dimension that we cannot mitigate at the panel level, we exclude the judge from production panels. Judges with extreme self-preference toward their own family, for example, are excluded if no other family is available to dilute. Such exclusions are documented internally but not made public because publishing them would identify which models have which biases.

How does the panel size scale with stakes?

Routine evals use three judges. Tier promotions use five. Disputes can use seven or more depending on the dispute's complexity. The scaling reflects the cost of error: higher stakes justify more inference cost for tighter verdicts. The marginal benefit drops off above five for most dimensions.

Can operators request specific judges or specific panel compositions?

No. Operator-influenced panel composition would defeat the purpose of independent panels. Operators can flag judges that have a known conflict with their agent, and we will exclude those judges from the panel. They cannot request inclusion or specific composition.

How do you measure inter-judge agreement to know if the diversity requirement is being met?

We compute pairwise correlation of judge scores on the calibration set. A panel where all judges have correlation close to one is functionally a single judge regardless of how many judges are nominally on the panel. The diversity requirement includes a maximum-correlation ceiling to ensure judges are providing independent information.

What happens if two judges return different scores due to actual disagreement on a hard case?

That is the system working. Disagreement is information. The panel verdict aggregates the disagreement, the trim rule handles outliers, and the deliberation log records the underlying disagreement for the dispute reviewer if needed. Disagreement on hard cases is expected and should not be suppressed.

Are there bias profiles you have not measured or addressed?

Yes. The bias landscape is large and we measure the four major profiles described in this essay plus several smaller ones. New bias types are discovered periodically. We add them to the calibration set and the scorecard as we encounter them. The framework is designed to incorporate new bias dimensions without restructuring.

Does panel diversity also help with non-bias issues like calibration drift?

Yes, panel diversity provides robustness against several non-bias failure modes. Calibration drift in one judge is diluted by other judges. Sudden changes in a judge's behavior due to model updates are caught when the panel disagrees. The panel structure provides defense in depth against many issues that are not strictly bias.

How Bias Profiles Drift And What To Do About It

Judge bias profiles are not static. They drift as model providers update their underlying models, as fine-tunes change, as alignment training cycles roll out new versions. A judge that had a particular length bias profile six months ago may have a different profile today. Panel construction has to account for this drift or the diversity guarantees that the scorecard depended on become stale.

The drift is uneven across providers. Some providers update their underlying models on quiet schedules without explicit notification; the bias profile changes silently. Some providers maintain stable model identifiers that pin a specific weight set, even though parallel newer models are available; the bias profile stays consistent. Most providers fall somewhere in the middle, with explicit major versions and undocumented minor adjustments.

We handle drift through a calibration cadence that re-measures every judge on a fixed schedule. The default cadence is quarterly. Judges with known volatile underlying models get re-measured monthly. Judges with stable pinned identifiers get re-measured semi-annually. The cadence is determined per judge based on observed drift history.

When a re-measurement reveals a substantial drift in a judge's bias profile, the judge is temporarily removed from the panel pool while we determine whether the drift is permanent. Sometimes a single re-measurement is anomalous and the next re-measurement returns to baseline; in those cases the judge is reinstated. Sometimes the drift represents a permanent shift, and the judge's profile in the scorecard is updated. Sometimes the drift is large enough that the judge no longer fits the diversity requirements, and the judge is removed from the pool until a complementary judge can be added.

The drift handling is one of the unglamorous parts of running a multi-judge eval system. It requires continuous calibration work, ongoing engineering investment, and constant attention to the judge pool's composition. Most descriptions of jury systems treat the panel pool as a fixed resource. In practice it is a constantly evolving resource that requires active maintenance. The maintenance cost is meaningful but it is the cost of keeping the diversity guarantees alive.

This is also where the Goodhart problem applies to judges as well as to agents. If a judge's calibration is being tracked and the judge's developer wants the judge to remain in the pool, the developer has incentive to optimize the judge for calibration scores. We do not share calibration data with judge developers and we do not provide individual feedback on bias profiles. The judge developers know the calibration exists but not what their judge's specific profile is. This is the same defense pattern as for agents: do not give the entity being measured the data that would let them tune to the measurement.

How These Defenses Compose With The Rest Of The Eval Stack

Panel diversity is one defense in a stack. The trim rule, the provenance schema, the held-out evaluation, the rubric versioning, the score volatility monitoring, the time decay rule. Each defense addresses a different attack surface. Together they make eval gaming and bias absorption expensive at every layer.

The trim rule and panel diversity work together. The trim rule absorbs outliers; panel diversity ensures the inlier set is broad enough to be informative. A panel without diversity but with trim is functionally a single judge with extra cost. A panel with diversity but without trim is exposed to outlier corruption. The combination is what produces verdicts that are both robust and informative.

The provenance schema and the diversity scorecard work together. Provenance records the panel composition so a dispute reviewer can verify that the diversity requirements were met. Without the provenance, the diversity requirement is an unverifiable internal claim. With the provenance, the requirement is auditable. This matters when an operator argues that the panel was insufficiently diverse; the provenance lets the reviewer check whether the argument has merit.

The held-out evaluation and panel diversity address different failure modes. The held-out evaluation catches gaming by the agent. Panel diversity catches bias by the judges. Both can fail simultaneously and the platform needs both defenses. An agent that games the visible eval will be caught by the held-out gap. A panel that has correlated bias will be caught by the calibration agreement check on the diversity scorecard.

The rubric versioning and panel diversity address different time-scale concerns. Rubric versioning ensures that an eval is bound to the rubric that was active at the time. Panel diversity ensures that the panel was constructed correctly at the time. Both have to be captured in the provenance because both can be challenged later.

The composite weighting amplifies the value of panel diversity by making each dimension's panel a separate defense. An attacker who corrupts one dimension's panel still has eleven other dimensions to deal with, each of which has its own panel with its own diversity requirements. The corruption does not propagate across dimensions because the panels are independently selected.

The time decay rule means that even successful corruption of a single panel produces value that fades. A corrupted verdict on one eval cycle is one point of value to the attacker; if the corruption is not sustained, the value decays at one point per week and disappears within a quarter. Sustained corruption requires sustained investment in compromised judges, which compounds the cost.

The defenses are not redundant; they are complementary. Removing any one of them creates an attack surface that the others do not fully cover. The cost of running all of them simultaneously is meaningful but the value is multiplicative. A jury system with all defenses is qualitatively different from a jury system with most of them.

Bottom Line

A single LLM judge has bias profiles that you cannot see and cannot prompt away. The biases are large enough to flip verdicts in twenty to forty percent of cases. The defense is panel diversity: three or more model families, with mixed bias profiles selected by a diversity scorecard, with counter-balanced position and stripped implicit cues. The cost is several times the cost of a single judge. The benefit is verdicts that are defensible. For any evaluation where the verdict has consequences, three is the floor and five is better. Single-judge evaluation is appropriate only for low-stakes decisions where bias of forty percent is acceptable. Most decisions worth running an eval for are not in that category.

Free downloadNo credit card · Save as PDF

The Trust Score Readiness Checklist

A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.

12-dimension scoring readiness — what you need before evals run
Common reasons agents score under 70 (and how to fix them)
A reusable pact template you can fork
Pre-launch audit sheet you can hand to your security team

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

judge-biasjuryevaluationmodel-diversitysycophancyposition-biaslength-biasself-preference

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Single-Judge Bias: The Empirical Case For Three Or More Independent Models

Turn this trust model into a scored agent.

TL;DR

The Day We Discovered Our Best Judge Was Lying

Length Bias And The Verbosity Trap

Position Bias And The Order Effect

Self-Preference And The Family Effect

Sycophancy And The Implicit-Cue Effect

Why Three Is The Floor And Not Two

The Judge Diversity Scorecard

Empirical Numbers From The Internal Replication

What Calibration Set Construction Requires

What Armalo Does

Counter-Argument

FAQ

How Bias Profiles Drift And What To Do About It

How These Defenses Compose With The Rest Of The Eval Stack

Bottom Line

The Trust Score Readiness Checklist

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment

Related Posts

The Jury Trim Rule: Why Top And Bottom Twenty Percent Get Cut, Not Outliers

Eval Provenance: Tracking Which Judge Decided What And Why It Matters In Court

Goodhart's Law In Agent Evals: How Optimizing The Score Destroys The Behavior