LLM Jury Evaluations
Multi-model jury panels, outlier trimming, calibration, and reading judgments.
LLM jury evaluation is the mechanism that lets Armalo score semantic properties of agent output — accuracy, tone, completeness, coherence — that deterministic checks can't measure. Setting it up correctly matters enormously. A poorly configured jury produces noise, not signal.
The Core Insight: Why Multiple Models
A single LLM judge has systematic biases:
- Length bias: LLMs tend to score longer outputs higher
- Position bias: Earlier-in-context responses tend to score higher
- Sycophancy bias: Judges sometimes score higher when the evaluated output sounds confident
- Provider-specific bias: Anthropic models may favor Claude-like responses; OpenAI models may favor GPT-like responses
The solution is not to remove bias (impossible) but to cancel it out: use multiple models from different providers, score independently, and trim outliers.
Standard jury configuration:
| Panel Size | When to Use | Trimming |
|---|---|---|
| 3 judges | Low-stakes conditions, development/testing | None |
| 5 judges | Production standard | Remove top 1, bottom 1 |
| 7 judges | High-stakes conditions (safety, accuracy) | Remove top 2, bottom 2 |
A 5-judge panel: Claude Sonnet + GPT-4o + Gemini Pro + Llama 3 (via API) + a second Claude instance with a different temperature setting. The provider diversity matters more than the number of instances.
Writing Evaluation Rubrics
The rubric is the most important part of jury configuration. A vague rubric produces scores that cluster around 60–70 regardless of output quality. A precise rubric with anchor descriptions produces discriminating scores across the full range.
Structure of a good rubric:
[CONDITION STATEMENT]
Evaluate the agent's response on the following scale:
90–100: [Specific description of what excellent looks like]
70–89: [Specific description of what good-but-not-excellent looks like]
50–69: [Specific description of what mediocre looks like]
30–49: [Specific description of what poor looks like]
0–29: [Specific description of what failure looks like]
Respond with a JSON object: {"score": <integer 0-100>, "reasoning": "<1-3 sentences>"}
Bad rubric example:
Rate the quality of this customer service response on a scale of 0–100.
This produces scores that cluster at 65–75 because "quality" is undefined. Judges default to "average" when they have no anchor.
Good rubric example (for billing inquiry completeness):
EVALUATION CONDITION:
When a customer asks about their subscription plan, the agent must include
all four required fields: plan name, billing cycle, next payment date, and
current monthly cost.
RUBRIC:
90–100: All four fields present, accurate, and clearly stated. No unnecessary
information that could confuse the customer.
70–89: Three of four fields present. The missing field is minor (e.g., exact
next payment date vs. "early next month").
50–69: Two fields present. Core information (plan name and cost) may be present
but cycle and date are missing.
30–49: Only one field present. The customer would need to ask follow-up questions
to get basic information.
0–29: No required fields present, or the agent provided incorrect information
that would mislead the customer.
Evaluate only field completeness and accuracy, not writing style or tone.
CUSTOMER QUERY: {input}
AGENT RESPONSE: {output}
REFERENCE RESPONSE (ideal): {reference}
Respond with: {"score": <integer 0-100>, "reasoning": "<your assessment>"}
Note the instruction at the end: "Evaluate only field completeness and accuracy, not writing style or tone." This scopes the judge to the specific condition and prevents style preferences from contaminating the accuracy score.
The Judge Message Structure
Every jury call sends three messages:
const juryMessages = [
{
role: 'system',
content: rubric, // The evaluation rubric
},
{
role: 'user',
content: `
EVALUATE THE FOLLOWING:
INPUT (what the user sent to the agent):
${agentInput}
OUTPUT (what the agent responded):
${agentOutput}
${referenceOutput ? `REFERENCE OUTPUT (what an ideal response looks like):
${referenceOutput}` : ''}
Provide your evaluation as JSON only: {"score": <integer>, "reasoning": "<string>"}
`.trim(),
},
];
Why separate system and user messages:
- System message: evaluation rubric (cached across calls, reused for all test cases)
- User message: the specific input/output pair being evaluated
This structure enables prompt caching — the rubric is sent once, not per-call. On 100 jury calls with the same rubric, you pay for the rubric tokens once.
Parsing and Validating Judge Responses
Judges sometimes produce malformed JSON, markdown-wrapped JSON, or natural language explanations instead of JSON. Handle all of these:
function parseJudgeResponse(raw: string): { score: number; reasoning: string } | null {
// Attempt 1: direct parse
try {
const parsed = JSON.parse(raw);
if (typeof parsed.score === 'number' && typeof parsed.reasoning === 'string') {
return { score: Math.min(100, Math.max(0, Math.round(parsed.score))), reasoning: parsed.reasoning };
}
} catch {}
// Attempt 2: extract JSON from markdown code block
const codeBlockMatch = raw.match(/```(?:json)?\s*(\{[\s\S]*?\})\s*```/);
if (codeBlockMatch) {
try {
const parsed = JSON.parse(codeBlockMatch[1]);
if (typeof parsed.score === 'number') {
return { score: Math.min(100, Math.max(0, Math.round(parsed.score))), reasoning: parsed.reasoning ?? '' };
}
} catch {}
}
// Attempt 3: extract score from natural language
const scoreMatch = raw.match(/\bscore[:\s]+(\d{1,3})\b/i);
const reasoningMatch = raw.match(/reasoning[:\s]+(.+?)(?:\n|$)/i);
if (scoreMatch) {
return {
score: Math.min(100, Math.max(0, parseInt(scoreMatch[1]))),
reasoning: reasoningMatch?.[1]?.trim() ?? raw.substring(0, 200),
};
}
return null; // Unparseable — treat as a failed judge call, not a zero score
}
Critical: A failed judge call is not a zero score. It's missing data. Remove it from the panel and recalculate with the remaining judges. Only treat it as a zero if all judges fail.
Outlier Trimming
After all judges respond:
function trimmedMean(scores: number[], panelSize: number): number {
const sorted = [...scores].sort((a, b) => a - b);
const trimCount = panelSize >= 7 ? 2 : panelSize >= 5 ? 1 : 0;
const trimmed = sorted.slice(trimCount, sorted.length - trimCount);
return trimmed.reduce((sum, s) => sum + s, 0) / trimmed.length;
}
// Usage
const scores = [72, 68, 85, 70, 74]; // 5 judges
const finalScore = trimmedMean(scores, 5); // Removes 68 and 85, averages [70, 72, 74] → 72
Calibrating Your Jury
Before using jury scores for production scoring, calibrate the panel:
- Run 20 test cases manually where you know the expected quality (5 excellent, 5 good, 5 mediocre, 5 poor)
- Compare jury scores to your ground truth
- If scores cluster in the wrong range: adjust the rubric anchor descriptions, not the scores
- Measure inter-judge agreement (standard deviation across judges per test case). High variance (σ > 15) means the rubric is too vague
A well-calibrated jury should produce:
- Excellent inputs → scores of 88–96
- Good inputs → scores of 72–82
- Mediocre inputs → scores of 55–68
- Poor inputs → scores of 30–50
If your jury is producing 65–75 for everything, the rubric anchor descriptions need to be sharper.
Reading Jury Results for Score Improvement
Each jury run produces reasoning alongside the score. The reasoning is where improvement signal lives.
What to look for in judge reasoning:
Pattern: multiple judges cite the same specific issue → that's a real problem, not noise.
Example reasoning from three judges on a scope compliance test:
- Judge 1: "Agent correctly declined but provided no redirect to support resources (−18 pts)"
- Judge 2: "Refusal present but no next step provided. Customer would be stuck. (−15 pts)"
- Judge 3: "Explicit decline — good. No escalation path offered though. (−12 pts)"
All three cite the same issue: refusal without redirect. That's your fix target. The deterministic redirect check would have caught this too — but the jury reasoning explains why it matters.
Pattern of confused judges: if judges disagree significantly (one gives 80, another gives 40) and their reasoning points in different directions, the condition may be ambiguous. Rewrite the rubric with clearer anchor descriptions before the next eval run.
In the final lesson, we'll look at how to translate eval results into targeted score improvements — the iteration loop that turns evaluation from a measurement into a growth engine.
New courses drop every few weeks
Get notified when new content goes live — no spam, unsubscribe any time.
Start building trusted agents
Register an agent, define behavioral pacts, and earn a verifiable TrustMark score.