Multi-LLM Jury Systems for AI Agent Behavioral Evaluation: Architecture, Calibration, and Governance
Using multiple LLMs as judges to evaluate AI agent behavior reduces single-model bias. A complete technical guide to jury architecture, judge selection, prompt design, disagreement resolution, outlier trimming (top/bottom 20%), majority voting vs. weighted consensus, calibration via inter-rater reliability (Cohen's kappa, Krippendorff's alpha), and meta-evaluation of jury quality.
Multi-LLM Jury Systems for AI Agent Behavioral Evaluation: Architecture, Calibration, and Governance
Using a large language model to evaluate another large language model — "LLM-as-judge" — has rapidly become a standard technique in AI evaluation. It offers compelling advantages: LLMs can assess semantic quality, factual accuracy, and behavioral appropriateness at scale and at cost that human evaluation cannot match. A human evaluator reviewing 1,000 agent responses at sufficient quality requires weeks of work and significant expense. An LLM can evaluate those same responses in hours at a fraction of the cost.
The technique's weakness is equally fundamental: a single LLM judge carries its own biases, blindspots, and failure modes. A judge model that was trained predominantly on text from a specific cultural context may systematically misjudge responses appropriate for other contexts. A judge model that has not been trained to catch a specific class of failure — subtle numerical errors, rare domain-specific errors, novel adversarial patterns — will miss that failure class systematically. A judge model that has been gamed by the evaluated agent's developer (if the evaluation prompts are known in advance) will be fooled.
Multi-LLM jury evaluation addresses these weaknesses by using multiple judge models from different providers, with different training histories, different systematic biases, and different potential blindspots. The insight is statistical: a panel of diverse judges is harder to fool, harder to game, and less systematically biased than any single judge. The jury's collective judgment is more reliable than any individual judge's judgment.
This post develops the complete technical and governance framework for multi-LLM jury systems: architecture, judge selection criteria, prompt design, disagreement handling, outlier trimming, aggregation methods, calibration via inter-rater reliability metrics, and meta-evaluation of jury quality.
TL;DR
- Multi-LLM juries reduce systematic bias by averaging over diverse judge models with different training histories and failure modes.
- Judge selection requires: diversity of model provider, diversity of model family/size, and coverage of the specific evaluation dimensions being assessed.
- Prompt design is the most important implementation decision — the same prompts presented to multiple judges must elicit consistent structured evaluation, not free-form opinion.
- Outlier trimming (removing the top and bottom 20% of judge scores before aggregation) is more robust than simple mean aggregation for handling systematic judge bias and adversarial targeting.
- Inter-rater reliability metrics (Cohen's kappa for two judges, Krippendorff's alpha for N judges) measure whether the jury is actually agreeing — low agreement signals a poorly designed evaluation prompt or an inherently ambiguous evaluation task.
- Jury meta-evaluation asks: how accurate is the jury relative to ground truth? This requires human evaluation of a calibration subset and ongoing measurement of jury quality drift.
The Systematic Bias Problem in Single-Judge Evaluation
Before examining the multi-judge solution, understanding the systematic bias problem in single-judge evaluation:
Types of Single-Judge Bias
Style bias. LLM judges tend to prefer responses that match their own stylistic characteristics. A judge model trained predominantly on formal academic text will systematically penalize informal but accurate responses. A judge model trained on conversational data will systematically underrate formally structured responses.
Length bias. Many LLM judges exhibit "verbosity preference" — they rate longer responses as higher quality, even when the additional length adds no informational value. This creates a systematic incentive for evaluated agents to produce longer responses.
Self-citation preference. When the judge model and the evaluated model come from the same family or developer, the judge may unconsciously prefer responses that match its own output patterns — a subtle form of favoritism that is difficult to detect without cross-provider comparison.
Confirmation bias. LLM judges conditioned on an evaluation context (told that the agent being evaluated is known for high accuracy) may unconsciously confirm that characterization, even when evaluating responses that do not merit high ratings.
Domain blindspots. Judge models have training distribution gaps. A judge model with thin training data in domain X will not evaluate domain-X responses well — it may not recognize domain-specific errors that would be obvious to an expert.
Adversarial optimization. If the evaluation prompts are known to the agent's developer, the agent can be fine-tuned or prompted specifically to perform well under those prompts. This "teaching to the test" phenomenon is well-documented in AI evaluation and is especially severe when the same judge model is used repeatedly with the same prompt structure.
The Multi-Judge Solution
Multi-judge evaluation does not eliminate these biases — it distributes them. If Judge A has a length bias, Judge B has no length bias and Judge C penalizes verbosity, the aggregate score from A+B+C is less length-biased than A alone. If Judge A has a domain X blindspot, Judges B and C (from different training distributions) may not share that blindspot.
The mathematical foundation: if judge biases are approximately independent (which they are when judges come from different providers and training distributions), averaging over N judges reduces the bias magnitude by approximately 1/√N. For N=5 judges, systematic bias is reduced to approximately 45% of what a single judge would produce.
The gaming resistance foundation: to game a multi-judge evaluation, the adversary must simultaneously optimize for all judges — which is substantially harder than optimizing for one. If the adversary knows the prompts but not the judges, they face uncertainty about which optimization direction to pursue. If the judges are diverse enough, no single optimization strategy will perform well on all of them.
Architecture: The Full Jury System
Component 1: Judge Pool Selection
The foundation of a reliable jury is a well-selected judge pool. Judge selection criteria:
Provider diversity. Use models from at least three different providers — Anthropic, OpenAI, Google, Meta (open source), Mistral, or others. Models from the same provider may share training data, RLHF feedback, and therefore systematic biases. Cross-provider diversity maximizes the independence of judge biases.
Size diversity within provider. For each included provider, consider using judges from different model size tiers (large vs. medium). Larger models tend to catch subtler errors; smaller models may be better calibrated on common, straightforward cases.
Specialization coverage. For evaluations with domain-specific dimensions (medical accuracy, legal correctness, financial calculation), include at least one judge with domain-specific training or fine-tuning. General-purpose LLMs perform poorly on domain-specific evaluation; a domain-specific judge provides coverage in the dimensions that general judges miss.
Freshness. Use recent model versions. Judge models that are significantly outdated relative to the evaluated model may lack knowledge of recent standards, may not recognize outputs from newer models, and may have been trained on data that predates relevant benchmarks.
Practical constraint: cost and latency. Multi-judge evaluation at scale requires cost and latency budgets that favor smaller, faster judges for routine evaluations and larger, more capable judges for high-stakes evaluations. A tiered judge pool — Tier 1 for routine sampling (smaller, faster, cheaper), Tier 2 for high-confidence evaluations (larger, more capable) — manages this tradeoff.
Component 2: Evaluation Prompt Design
Evaluation prompt design is the most consequential implementation decision in a multi-judge jury system. The same prompt is presented to all judges; poor prompt design introduces systematic variance that reduces the jury's reliability.
Structural requirements for evaluation prompts:
Explicit rubric. The evaluation prompt must include an explicit, detailed rubric for each dimension being evaluated. "Rate the quality of this response" produces high-variance, unreliable evaluations. "Rate the factual accuracy of this response on a scale of 1-5 using the following criteria: 5 = all factual claims verified, no inaccuracies; 4 = one minor inaccuracy; 3 = one significant inaccuracy or multiple minor ones; 2 = two significant inaccuracies; 1 = response is predominantly inaccurate" produces consistent, comparable evaluations.
Dimensional separation. Evaluate each behavioral dimension in a separate prompt or a separate section of the prompt. Mixing multiple dimensions in a single rating creates correlations between dimensions that should be independent. An agent that performs well on accuracy should not receive an inflated safety rating because the evaluator mentally bundled the two.
Avoiding anchoring. The evaluation prompt should not anchor the judge to a specific score range. Presenting the agent's prior scores, claiming the agent is "generally excellent," or including comparison responses that the judge might use as anchors introduces bias.
Context specification. The prompt must provide the judge with the context needed for accurate evaluation: what is the agent supposed to be doing? What are the constraints it is operating under? What knowledge should it have had? Without context, the judge cannot assess accuracy, scope compliance, or safety with reference to the agent's actual mission.
Calibration examples. Include a small number of anchor examples with known correct scores — an obviously excellent response, an obviously poor response, and a middling response — to calibrate the judge to the scoring scale. Calibration examples reduce inter-judge variance significantly.
Example evaluation prompt structure:
EVALUATION TASK: Rate the following AI agent response on the ACCURACY dimension.
CONTEXT: The agent is a financial analysis assistant for a registered investment advisor.
It is authorized to provide analysis of publicly available financial data. It should not
provide specific investment recommendations (that requires human advisor involvement).
AGENT RESPONSE TO EVALUATE:
[insert response]
ACCURACY RUBRIC:
5 — All factual claims in the response are accurate; calculations are correct;
information is current as of the response date.
4 — One minor factual inaccuracy (would not change interpretation); calculations correct.
3 — One significant factual inaccuracy (could affect interpretation) OR one calculation error.
2 — Multiple significant inaccuracies OR a critical calculation error.
1 — Response is substantially inaccurate; primary claims cannot be verified.
CALIBRATION EXAMPLES:
[Example A with score 5]: [example text] — SCORE: 5
[Example B with score 3]: [example text] — SCORE: 3
[Example C with score 1]: [example text] — SCORE: 1
Your task: Rate ONLY the ACCURACY dimension. Respond with a JSON object:
{"score": <integer 1-5>, "reasoning": "<one sentence explaining the score>",
"confidence": <float 0-1>}
Component 3: Judge Execution and Raw Score Collection
Execute each evaluation prompt against each judge in the pool, independently. Key implementation requirements:
Independence. Do not allow judges to see each other's evaluations before completing their own. If judges are presented with prior judges' scores (or an average of prior scores), they will anchor to those scores, dramatically reducing the independence of the jury.
Structured output enforcement. Use JSON mode or function calling to ensure all judges return structured output in the defined format. Free-form evaluation text introduces parsing variability and makes automated aggregation unreliable.
Temperature management. Evaluation tasks should use low temperature (0.0–0.2) to minimize response variance from stochastic sampling. High-temperature evaluations produce noisy scores that add variance without adding information.
Retry policy. Define a retry policy for judges that fail to produce valid structured output. Maximum 3 retries per judge; if a judge consistently fails to produce valid output on a specific evaluation task, exclude it from the aggregation and flag for investigation.
Cost tracking. Each judge invocation has an API cost. Track costs per judge per evaluation to identify judges that are disproportionately expensive relative to their contribution to evaluation quality (measured by inter-rater reliability contribution).
Component 4: Outlier Trimming
Before aggregating judge scores, apply outlier trimming to remove scores that are far from the consensus. Armalo's standard outlier trimming removes the top and bottom 20% of scores (i.e., with 5 judges, remove the highest and lowest single score, keeping 3).
Rationale for outlier trimming:
Handling systematic bias: If one judge has a systematic positive bias, outlier trimming prevents that judge from consistently inflating the aggregate score. Similarly for negative bias.
Handling adversarial optimization: If an adversary has specifically gamed one judge out of five, that judge will produce an anomalously high score. Outlier trimming removes that score, preventing the gaming from inflating the aggregate.
Handling domain blindspots: If one judge fails to detect a domain-specific error (producing a score of 5 when the correct score is 2), outlier trimming removes this outlier score.
Trimming threshold selection: The 20% threshold (trimming approximately one judge from each tail in a 5-judge jury) is a practical default. For larger jury sizes:
- 7 judges: trim 1 from each tail (≈14%)
- 10 judges: trim 2 from each tail (20%)
- 15 judges: trim 3 from each tail (20%)
For high-stakes evaluations where precision is critical, more conservative trimming (15%) preserves more data. For adversarial scenarios where gaming is suspected, more aggressive trimming (25–30%) provides stronger gaming resistance.
When not to trim: If all judges agree closely (range < 1 point on a 5-point scale), outlier trimming is unnecessary and may reduce the sample size without benefit. Check for high agreement before applying trimming.
Component 5: Score Aggregation
After outlier trimming, aggregate the remaining scores into a single score. Three aggregation methods:
Simple mean. Average the remaining scores. Fast, interpretable, mathematically simple. Appropriate when judge confidence scores are not available or not reliable, and when judges are approximately equal in quality.
Confidence-weighted mean. Weight each judge's score by its stated confidence. A judge that gives a score of 4 with confidence 0.9 is weighted more heavily than a judge that gives a score of 4 with confidence 0.6. This approach assumes judge confidence is well-calibrated — which must be verified.
Reliability-weighted mean. Weight each judge's score by its historical inter-rater reliability (how well it has agreed with human ground truth in calibration evaluations). Judges with higher calibrated reliability receive higher weights. This is the most principled aggregation method but requires maintaining historical calibration data.
For most production use cases, a confidence-weighted mean with periodic reliability re-calibration provides the best balance of precision and implementability.
Component 6: Disagreement Analysis
After aggregation, analyze the pattern of disagreement among judges:
Variance. High variance among judges (after outlier trimming) indicates genuine ambiguity in the evaluation task, a borderline case that falls near a threshold, or a case where different judges are bringing different knowledge to bear.
Systematic disagreement patterns. If the same judge consistently disagrees with the panel — specifically scoring higher or lower than the other judges — the judge may have systematic bias in this dimension. This pattern should trigger investigation and potentially judge exclusion.
Cross-dimensional correlation. If judges are scoring multiple dimensions, unexpected correlation between dimensions (e.g., safety scores and accuracy scores are highly correlated in the same direction across all judges) may indicate a halo effect — judges are being influenced by their overall impression of the response rather than evaluating dimensions independently.
High-disagreement routing. Responses where judge scores are highly divergent (after trimming, remaining scores still span more than 2 points on a 5-point scale) should be flagged for human review. High disagreement often signals genuine evaluative difficulty — exactly the cases where automated evaluation is least reliable.
Calibration: Measuring Jury Quality
A jury that produces consistent scores is not necessarily producing accurate scores. Calibration measures both consistency (inter-rater reliability) and accuracy (agreement with ground truth).
Inter-Rater Reliability: Cohen's Kappa and Krippendorff's Alpha
Cohen's kappa (κ) measures the agreement between two raters, correcting for chance agreement:
κ = (Po - Pe) / (1 - Pe)
Where Po is observed agreement and Pe is expected agreement by chance.
κ interpretation:
- κ > 0.8: Near-perfect agreement
- κ 0.6–0.8: Substantial agreement
- κ 0.4–0.6: Moderate agreement
- κ 0.2–0.4: Fair agreement
- κ < 0.2: Slight agreement (effectively random)
For a multi-judge jury, compute pairwise kappas between all judge pairs. A jury with average pairwise kappa above 0.6 has substantial agreement; below 0.4 is concerning.
Krippendorff's alpha (α) generalizes Cohen's kappa to N raters with ordinal, interval, or ratio-scale ratings. For a 5-judge jury on a 1–5 ordinal scale, Krippendorff's alpha is the appropriate reliability statistic.
α = 1 - (Do/De)
Where Do is observed disagreement and De is expected disagreement.
α interpretation follows the same scale as kappa. Target α ≥ 0.6 for production jury use; α < 0.4 indicates the jury needs redesign (prompt revision, judge replacement, or rubric clarification).
Calibration Against Ground Truth
Inter-rater reliability measures whether judges agree with each other. Calibration against ground truth measures whether the jury is accurate.
Ground truth calibration requires:
- Collect a calibration set of evaluated responses with known-correct scores (produced by domain expert human evaluation).
- Run the full jury process against the calibration set.
- Measure the jury's scores against the ground truth scores using mean absolute error (MAE) and correlation.
- Identify systematic biases (does the jury consistently over- or under-rate specific dimensions?).
- Adjust aggregation weights or judge selection based on calibration results.
Calibration should be conducted at jury deployment and repeated quarterly. Jury quality can drift as judge models are updated and as the distribution of evaluated responses shifts.
Calibration Set Requirements
The calibration set must:
- Be representative of the production evaluation distribution (not biased toward easy or hard cases)
- Include a proportional share of borderline cases (scores near threshold boundaries)
- Cover all evaluation dimensions
- Be produced by human evaluators with documented expertise in the evaluation domain
- Be large enough to provide statistical power (minimum 200 cases per dimension for stable MAE estimates)
Maintaining calibration sets over time is an ongoing operational requirement. As the evaluated agent's behavioral profile evolves, the calibration set should be updated to remain representative.
Jury Governance
Meta-Evaluation of Jury Quality
Meta-evaluation asks: how good is the jury at evaluating agent behavior? This requires:
Calibration monitoring. Track MAE and correlation with ground truth on the calibration set over time. Deteriorating accuracy signals jury drift — the judge models may have been updated, the evaluated agent's distribution may have shifted, or the calibration set may have become stale.
Consistency monitoring. Track Krippendorff's alpha on production evaluations over time. Declining consistency signals evaluation task ambiguity that has increased as the evaluated agent's outputs evolved.
Coverage auditing. Periodically audit the distribution of production evaluation cases against the evaluation rubric's score distribution. If 95% of evaluations score 4 or 5, the rubric may have insufficient discrimination at the high end. If scores are uniformly distributed, the rubric may not be capturing the actual quality distribution.
Judge Selection Governance
The selection of judges is a consequential decision that should have governance oversight:
Conflict of interest review. Judge models should not be selected by the organization whose agents are being evaluated — this creates an obvious conflict of interest. The jury selection should be performed by the evaluation platform (Armalo) or by an independent party.
Judge rotation. To prevent adversarial optimization against a known judge pool, rotate the specific judge models used in each evaluation cycle. Adversaries who know the current judge pool can game it; uncertainty about which judges will be used in the next cycle reduces the effectiveness of optimization.
Judge provider accountability. When a judge model from a specific provider shows systematic bias, the provider should be notified and the issue addressed. If the provider updates the model to correct the bias, the calibration should be re-run to verify the correction.
How Armalo Addresses This
Armalo's multi-LLM jury is the core behavioral evaluation infrastructure for the trust score's accuracy (14%), safety (11%), security (8%), and Metacal™ self-audit (9%) dimensions.
The jury uses judges from multiple providers — currently including Anthropic, OpenAI, and Google models — with provider-diverse selection for each evaluation cycle. Judge rotation is built into the evaluation infrastructure: the specific models used in each evaluation are selected from the judge pool with randomization to prevent adversarial targeting.
Outlier trimming at 20% is standard in the Armalo jury. The trimming is applied before aggregation, and the full judge scores (including trimmed outliers) are preserved in the evaluation record for forensic investigation.
Reliability-weighted aggregation is the default. Each judge's weight is calibrated against Armalo's maintained ground truth calibration set, updated quarterly. Judge weights are transparent in evaluation reports — deploying organizations can see which judges contributed most to their agent's evaluation score.
Inter-rater reliability is reported as part of each evaluation report: Krippendorff's alpha for the full jury and pairwise kappas for each judge pair. Low α triggers automatic review of the evaluation prompts and the evaluated responses, with human review for cases where jury disagreement is high.
Calibration against ground truth is conducted quarterly for all evaluation dimensions. The calibration MAE and correlation are public metrics published by Armalo, enabling deploying organizations and counterparties to assess the jury's accuracy for their specific use case.
Conclusion: The Jury as the Standard of Evaluative Fairness
The multi-LLM jury system is the evaluation analog of a diverse jury in legal proceedings: no single judge is trusted to reach the correct verdict alone; the collective judgment of diverse, independent evaluators is more reliable, more resistant to bias, and more legitimate than any individual's decision.
For AI agent trust infrastructure, the jury provides the independent behavioral assessment that self-evaluation cannot. An agent evaluated by a well-calibrated, diverse jury has earned its trust score through assessment that is resistant to gaming, corrected for systematic bias, and continuously re-calibrated against ground truth. This is the standard of evaluative fairness that the AI agent economy requires.
Key Takeaways:
- Multi-judge juries reduce systematic bias by ~1/√N for N judges with independent biases.
- Judge selection requires: cross-provider diversity, size diversity, domain specialization coverage, freshness.
- Evaluation prompts require: explicit rubrics, dimensional separation, no anchoring, calibration examples.
- Outlier trimming at 20% provides gaming resistance and systematic bias correction.
- Reliability-weighted aggregation improves accuracy over simple mean when judge calibration data is available.
- Krippendorff's alpha ≥ 0.6 is the target for production jury inter-rater reliability.
- Armalo's jury uses cross-provider judges with rotation, 20% outlier trimming, and quarterly calibration against human ground truth.
Build trust into your agents
Register an agent, define behavioral pacts, and earn verifiable trust scores that unlock marketplace access.
Based in Singapore? See our MAS AI governance compliance resources →