The Ground Truth Problem: How Multi-LLM Jury Approximates Truth When None Exists
Was this customer support answer good? has no ground truth. Multi-LLM jury approximates it via consensus. The epistemological essay on when consensus approximates truth.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Most agent evaluation problems have no ground truth. Was this customer support answer good. Was this code review insightful. Was this medical summary helpful. There is no oracle to consult. Multi-LLM jury approximates ground truth through consensus, but the approximation quality varies enormously depending on the structure of the underlying question. This essay is the epistemological treatment: when consensus closely approximates truth, when it diverges from truth, when it diverges in predictable ways that can be corrected, and when it diverges in ways that should make us doubt the eval entirely. The reader artifact is the Truth Approximation Quality Score, a methodology for grading how well a multi-LLM jury can be expected to track truth on a given task class.
The Question Without An Answer
A model lab releases a benchmark for customer support agents. The benchmark has two thousand example conversations between simulated customers and agents being evaluated. Each conversation is rated by a panel of human reviewers on helpfulness, accuracy, tone, and appropriateness. The panel agreement is moderate, with Cohen's kappa in the 0.55 range across reviewers, which is mediocre but defensible for subjective judgments. Agents are scored against the panel's mean rating. The benchmark becomes widely cited.
A year later, a researcher does the obvious replication: she runs the same two thousand conversations past a fresh panel of human reviewers, drawn from a similar demographic, with the same training. The new panel's mean rating differs from the original panel's mean rating by an amount that, when applied to the leaderboard, shuffles the rankings substantially. Agents that ranked first under the original panel rank fifth under the new panel. Agents that ranked tenth rank third. The benchmark is, in some operational sense, measuring something. But what it is measuring is at least as much about the specific reviewers who happened to be on the panel as it is about the agents being scored. The replication exposes the dependency on the evaluator panel that the leaderboard's single number had hidden.
This is the ground truth problem. For a large class of evaluation tasks that matter to the agent economy, there is no oracle. There is no Platonic answer key against which to compare the agent's output. The best we have is consensus among reasonable judges, and consensus is itself a function of which judges we picked. If we pick a different panel, we get a different consensus. The single number the benchmark publishes is a snapshot of one specific consensus that happened to be measured, not a measurement of an underlying truth that exists independently of the measurement.
The ground truth problem is not new. It exists in every domain where evaluation depends on subjective judgment: art criticism, food critic ratings, peer review of academic papers, journalism awards, customer satisfaction surveys. The traditional response across these domains is to acknowledge the inherent inter-judge variance, build redundancy through multi-judge panels, and treat the consensus as a useful summary rather than a truth claim. The agent economy needs to do the same thing, only at much greater scale and with much less mature methodology.
The specific challenge for agent evaluation is that we are increasingly using LLM judges instead of human judges. The reasons are obvious: scale, cost, consistency, speed. The consequences are more subtle. LLM judges have all the inter-judge variance properties that human judges have, plus additional failure modes specific to language models, plus systematic biases that may correlate across LLMs in ways that human reviewer biases do not. A multi-LLM jury can produce a consensus, and the consensus can be useful, but the consensus is not truth. It is a constructed approximation of truth whose quality depends on properties of the underlying question and properties of the jury composition.
This essay is the systematic treatment of when LLM jury consensus approximates truth well, when it approximates truth badly, and when the gap between consensus and truth is so large that the eval should not be trusted at all. We will build a taxonomy of question structure, identify the conditions under which consensus is informative, build the Truth Approximation Quality Score as a way to grade evaluation tasks, and close with operational guidance for eval system designers and consumers about how to use jury-based evaluations responsibly.
The stakes are high. If we treat jury consensus as ground truth when it is not, we make procurement decisions on illusory evidence. If we refuse to use jury consensus because it is not ground truth, we lose the ability to evaluate the large class of agent tasks that matter most. The right move is the middle path: use jury consensus where it works, recognize where it does not, and design eval systems that distinguish the two cases explicitly rather than burying the distinction under an unjustified single number.
A Taxonomy Of Question Structure
Not all evaluation questions have the same epistemological structure. Some have ground truth that is checkable. Some have ground truth that is in principle checkable but operationally inaccessible. Some have no ground truth but have a stable consensus. Some have no ground truth and no stable consensus. The eval methodology that works for one structure does not necessarily work for another, and conflating them produces misleading scores. The taxonomy that follows distinguishes four types and is the foundation for the rest of this essay.
Type one is verifiable-truth questions. These are questions where the answer is either right or wrong, and the rightness can be checked against an external referent. Mathematical correctness, code execution outcomes, factual claims that can be verified against authoritative sources, structured data extraction whose output can be compared to ground-truth records. For these questions, jury consensus is unnecessary; you can check the answer directly. The role of LLM judges in these questions is at most as a high-throughput replacement for direct verification, with the verification result still serving as the gold standard. Jury disagreement on verifiable-truth questions is itself a signal of jury malfunction, since the truth exists and well-functioning judges should converge on it.
Type two is convergent-judgment questions. These are questions where ground truth does not exist as a single fact but where reasonable judges, given enough deliberation, will converge on a stable answer. Whether a piece of code follows a particular style guide, whether a document violates a particular policy, whether a translation preserves the meaning of the source. These questions have what we can call deliberative ground truth: not an oracle, but a stable equilibrium that careful deliberation tends to reach. For these questions, jury consensus is a useful proxy for the deliberative ground truth, and jury agreement is high when the eval is well-constructed. The quality of the consensus depends on whether the jury's individual judges have the relevant competence and whether the questions are framed precisely enough to allow convergence.
Type three is divergent-judgment questions. These are questions where reasonable judges disagree even after deliberation, and the disagreement reflects a genuine plurality of legitimate views. Whether a piece of writing has good tone for its intended audience, whether a customer support response is empathetic enough, whether a creative work is original. For these questions, there is no consensus to be approximated; there is only a distribution of judgments, with mean and variance that themselves carry information. Reporting a single consensus number for these questions is a category error; the right output is the distribution itself, perhaps summarized by a mean and a confidence interval, not a number that pretends to be a measurement of an underlying truth.
Type four is contextual-judgment questions. These are questions where the right answer depends on context that the eval may not have access to. Whether a particular agent action was appropriate depends on the operator's policies, the customer's prior history, the regulatory environment, and many other contextual factors. For these questions, evaluation by judges who lack the context produces verdicts that are systematically biased in ways the judges may not notice. The right response is either to reject the question as out of scope for the eval, or to provide the context to the judges in a structured way that makes the judgment auditable, or to evaluate against a synthetic context whose properties are explicit and whose conclusions are clearly limited to that synthetic context.
Most agent evaluation tasks span multiple types. A customer support evaluation has type-one questions about whether the agent followed required scripts, type-two questions about whether the agent applied policy correctly, type-three questions about whether the agent's tone was appropriate, and type-four questions about whether the agent's response was right for the specific customer. A well-designed eval decomposes the overall task into its constituent question types and applies the methodology appropriate to each. A poorly-designed eval lumps them together and produces a single number that hides the variance across types.
The operational consequence is that any composite score for a task that includes type-three or type-four questions has irreducible noise from the inter-judge variance on those components, no matter how well-constructed the jury. The right design is to disclose this noise explicitly in the score, perhaps as a confidence interval, rather than pretending the score is a point estimate. Eval systems that report only point estimates for composite scores are over-claiming precision, and the over-claim is most severe on tasks with substantial type-three or type-four components.
This taxonomy will reappear throughout the essay. The Truth Approximation Quality Score introduced later is essentially a methodology for grading how much of an evaluation task falls into each type, and adjusting the credibility of the resulting score accordingly. Eval systems that adopt the taxonomy will be more honest about what their scores mean. Eval systems that ignore it will continue to produce scores that look more authoritative than they are.
Why Multi-LLM Beats Single-LLM
The basic case for multi-LLM jury over single-LLM judge rests on three properties: variance reduction, bias decorrelation, and disagreement signal. Each is worth understanding precisely, because each can fail under specific conditions, and a jury system that takes the properties for granted ends up with the same problems as single-judge systems.
Variance reduction is the simplest. Any single LLM judge produces verdicts with sampling variance, conditioning variance from prompt phrasing, and stochastic variance from temperature settings. Averaging across multiple judges reduces this variance approximately as the inverse square root of the panel size, the same way that averaging multiple noisy measurements reduces measurement noise. A panel of five judges reduces variance to approximately 0.45 of single-judge variance. A panel of ten reduces it to 0.32. The variance reduction benefit is real and measurable, and it is the main reason multi-LLM panels are preferred for any eval that will be reported as a quantitative score.
The variance reduction benefit assumes that the variance across judges is approximately independent. If the judges share systematic biases, the variance reduction is much smaller, because the shared component does not cancel under averaging. This is the structural risk of multi-LLM panels: if all the judges are derived from similar training data, similar architectures, and similar fine-tuning regimes, they may share biases that no amount of averaging can remove. Variance reduction in the technical sense does not protect against systematic bias in the substantive sense.
Bias decorrelation is the second property and the more important one. The hope of multi-LLM panels is that different judges have different biases, and that the biases partially cancel when judgments are averaged. A judge from Provider A may systematically over-rate verbose responses; a judge from Provider B may systematically under-rate them. Averaging these judges produces a verdict that is less biased than either alone. The empirical evidence for bias decorrelation is mixed. Some bias dimensions decorrelate well across providers, including stylistic preferences and length biases. Other bias dimensions correlate strongly across providers, including biases toward particular framings of arguments, biases against particular cultural references, and biases toward particular structural conventions in responses. The correlated biases survive averaging and remain in the final score.
The operational implication is that jury composition matters in a substantive way. A panel of five judges all derived from one provider's model family is much less effective at bias decorrelation than a panel of five judges drawn from five different providers, even if the per-judge accuracy is similar. Eval systems that rotate judges across providers, fine-tuning regimes, and training data sources are doing better bias decorrelation than systems that draw all judges from a homogeneous source. The bias-decorrelation argument is the strongest practical reason to prefer multi-provider juries even when single-provider juries would be operationally simpler.
Disagreement signal is the third property and the one most underused. When jury members disagree on a verdict, the disagreement is information. It signals that the question is harder, more subjective, or more context-dependent than the questions where the jury agrees. A well-designed eval system surfaces the disagreement, treats it as a separate output alongside the consensus, and uses it to calibrate confidence in the score. A poorly-designed system trims the outliers, averages the survivors, and reports a number that pretends the disagreement did not exist. The trimming approach loses information; the surfacing approach preserves it.
Armalo's jury system trims the top and bottom twenty percent of jury verdicts to reduce single-judge gaming and outlier influence on the final score, but the disagreement structure of the full panel is retained and reported separately as a verdict-variance metric. This is the right design for the operational use case: the trimmed mean is the operational score, the variance is the calibration signal, both are published, and consumers can decide how much weight to give the score based on the variance. Eval systems that publish only the trimmed mean are throwing away information that operators need to make sound decisions.
The combined argument for multi-LLM panels is that they reduce variance, partially decorrelate bias, and produce a disagreement signal that calibrates confidence in the final score. All three properties depend on the jury being constructed thoughtfully, with judges drawn from diverse sources and disagreement being preserved rather than averaged away. A jury system that gets these design choices wrong is not much better than single-judge eval, and in some respects it is worse, because it carries the false authority of multi-judge methodology without delivering the substance.
When Consensus Approximates Truth Well
There are specific structural conditions under which multi-LLM jury consensus is a high-quality approximation of underlying truth. Recognizing these conditions is the basis for choosing where to apply jury-based evaluation and where to use other methods. The conditions are well-bounded question structure, broad jury competence on the question domain, accessible context, low subjective load, and structural agreement validation.
Well-bounded question structure means that the question has a defined answer space and the criteria for evaluating answers are explicit. "Does this code pass the unit tests" has a defined answer space (pass or fail) and explicit criteria. "Is this customer support response good" does not. The bounding can come from the question structure itself, as in pass/fail tests, or from carefully written rubrics that specify what dimensions to evaluate and how to score each. Well-bounded questions produce high jury agreement because the judges have a shared frame for the evaluation. Poorly-bounded questions produce low agreement because each judge implicitly applies their own frame.
Broad jury competence means that the judges, individually, are competent on the question domain. A jury of LLM judges asked to evaluate code quality should consist of models that are themselves capable code-quality evaluators, not models whose code training is shallow. Jury competence is sometimes the dominant factor in jury quality. A jury of competent judges produces better consensus than a larger jury of less competent judges, and the relationship is not linear; below a competence threshold, jury verdicts are dominated by noise rather than signal, and adding more incompetent judges does not help.
Accessible context means that the judges have the information they need to evaluate the question. If the question depends on an operator's policy, the policy needs to be in the prompt. If it depends on prior conversation history, the history needs to be in the prompt. If it depends on regulatory requirements, those need to be in the prompt or in retrievable supporting material. Judges asked to evaluate questions where critical context is missing produce verdicts based on what they can infer about the missing context, which introduces error and disagreement. Eval systems that systematically provide complete context get higher-quality consensus than systems that rely on judges to fill in gaps.
Low subjective load means that the question's answer depends primarily on objective or convergent factors, not on subjective preferences that vary across reasonable judges. Code correctness has low subjective load. Code style has higher subjective load. Code elegance has very high subjective load. Eval systems that target low-subjective-load questions get jury consensus that closely approximates truth. Eval systems that target high-subjective-load questions get jury consensus that approximates the average of judge preferences, which is not the same thing.
Structural agreement validation means that the eval system periodically checks whether jury consensus tracks an external referent on the subset of questions where such a referent exists. For verifiable-truth questions, the jury verdict can be compared to the verified answer; high agreement is a positive signal about the jury, low agreement is a warning. For deliberative-ground-truth questions, the jury verdict can be compared to a panel of human expert reviewers on a subsample; agreement above a threshold validates the jury for that question class. Without periodic validation against external referents, jury consensus drifts, and eval systems can become elaborately self-consistent without being externally meaningful.
When all five conditions hold, multi-LLM jury consensus is a high-quality approximation of underlying truth, and scores produced by such juries can be used with confidence in operational decisions. When some conditions fail, the score quality degrades in proportion to how many fail and how badly. When most fail, the score becomes more decorative than informative, and operators using such scores for procurement decisions are flying blind.
The operational discipline that follows is to grade each evaluation task against these five conditions before deciding what kind of eval methodology to use. Tasks that hit all five get jury-based scoring with confidence. Tasks that miss some get jury-based scoring with appropriate caveats and confidence intervals. Tasks that miss most get either a different methodology or an honest acknowledgment that the eval cannot produce a credible score for that task. This grading is the operational instantiation of epistemic humility about what jury-based eval can and cannot do, and it is the foundation of the Truth Approximation Quality Score.
When Consensus Diverges From Truth
There are also specific structural conditions under which multi-LLM jury consensus diverges systematically from truth. Recognizing these conditions is at least as important as recognizing the conditions under which consensus works, because the divergence cases are where naive use of jury-based eval produces the worst operational damage. The divergence cases include shared bias dominance, capability ceiling, contextual missingness, adversarial framing, and jury mode collapse.
Shared bias dominance is the case where all the judges share a bias that the question structure happens to amplify. If all judges over-rate verbose responses and the question is whether a response is well-written, the jury will systematically score verbose responses as better-written than concise ones, even if the convergent judgment of human experts would be the opposite. Shared biases survive averaging, and they survive the trim of outliers, and they survive jury rotation if the new judges share the bias. The defense is jury construction that draws judges from genuinely diverse sources and periodic external validation against human experts who do not share the bias structure. Without these defenses, shared bias dominance produces scores that are stable, internally consistent, and systematically wrong.
Capability ceiling is the case where the question requires capability the judges do not have. A jury asked to evaluate the correctness of advanced mathematical proofs cannot produce credible verdicts if the judges are not themselves competent mathematicians. The verdicts they produce will be either uniformly favorable, because they cannot find errors, or randomly varied, because they are guessing. The eval system that runs this jury will produce a score, the score will look like a score, but the score is not measuring proof correctness. It is measuring something closer to surface plausibility, which is a different thing entirely. The defense is to recognize the capability ceiling for each judge and to either restrict the eval to questions within the ceiling or to augment the jury with specialist judges for questions above the ceiling.
Contextual missingness is the case where the question depends on context that the eval did not provide and the judges cannot infer. A jury asked whether an agent's response was appropriate for a specific customer cannot produce a credible verdict without knowing the customer's history, preferences, and prior interactions. The verdicts will be biased toward general appropriateness, which may differ substantially from situational appropriateness. The defense is to either provide the missing context to the judges, restrict the eval to questions that do not depend on the missing context, or honestly label the eval as evaluating general appropriateness rather than situational appropriateness. The label change is itself a form of intellectual honesty: the eval is doing what it can with the information available, and it should report what it is actually measuring.
Adversarial framing is the case where the question is structured in a way that systematically biases judges toward a particular answer. "Is this customer support response excellent" biases the judges toward a higher rating than "is this customer support response adequate." "Was the agent helpful" biases higher than "did the agent fail to be helpful." Question framing matters substantially in jury verdicts, and adversarial framing can be used to manipulate scores upward or downward by ten or twenty points without changing the substance of what is being asked. The defense is question-framing standardization, with multiple framings used in parallel to detect framing-induced variance, and explicit framing audits as part of the eval methodology review.
Jury mode collapse is the most subtle and the most dangerous. It occurs when the jury, despite being composed of multiple judges, produces verdicts that are functionally equivalent to a single judge's verdict, because the judges have converged on a shared verdict pattern. This can happen because the judges share training data, share fine-tuning regimes, or have been exposed to similar evaluation contexts that have caused them to learn similar verdict-generation patterns. When jury mode collapse occurs, the multi-judge methodology becomes theatrical: the appearance of multi-judge evaluation without the substance. The defense is periodic measurement of inter-judge agreement on diverse question sets, with high agreement being a yellow flag that warrants investigation rather than a green flag of jury quality. Counterintuitively, very high jury agreement can be a sign of mode collapse rather than a sign of evaluation reliability.
The combined message is that jury-based eval is not a free lunch. It works under specific conditions and fails under specific conditions. The eval system designer's job is to build systems that work in the working conditions and that signal failure rather than producing misleading scores in the failure conditions. This requires intellectual discipline that single-number scoring conventions actively discourage. The scoring convention pressure is to produce a number, regardless of whether the underlying methodology supports a number. The discipline pressure is to refuse to produce a number when the methodology cannot support one and to produce confidence-interval scores when the methodology supports them only partially. The two pressures are in tension, and the resolution shapes how trustworthy the eval system actually is.
The Truth Approximation Quality Score (Reader Artifact)
The Truth Approximation Quality Score, TAQS, is a methodology for grading how well a multi-LLM jury can be expected to track underlying truth on a given evaluation task. The score runs from zero to one hundred. Tasks with scores above eighty are appropriate for jury-based eval with high confidence in the resulting scores. Tasks with scores between fifty and eighty are appropriate with appropriate caveats, including confidence intervals on output scores. Tasks below fifty should either be evaluated through other methodologies or restructured to improve the TAQS before jury-based scoring is applied.
TAQS has five components, each scored zero to twenty, summing to one hundred.
Component one, question boundedness, measures how well-defined the answer space and evaluation criteria are. A score of twenty indicates pass/fail or precisely-rubric-scored questions. A score of fifteen indicates questions with explicit multi-dimensional rubrics. A score of ten indicates questions with general guidance but no explicit rubrics. A score of five indicates questions where judges are expected to apply their own implicit standards. A score of zero indicates questions where even the question itself is open to interpretation. Most well-designed evaluation tasks score between ten and twenty here; tasks below ten are signaling that the question structure needs work before reliable evaluation is possible.
Component two, jury competence, measures how well the jury's individual judges can be expected to handle the question domain. A score of twenty indicates judges with demonstrated expert-level capability in the relevant domain. A score of fifteen indicates judges with strong general capability and adequate domain training. A score of ten indicates judges with general capability and limited domain training. A score of five indicates judges with marginal capability for the domain. A score of zero indicates judges who lack the capability to evaluate the question at all. Jury competence is often the binding constraint; if your judges are not competent, no amount of methodology can save the eval.
Component three, context completeness, measures how completely the relevant context is provided to the judges. A score of twenty indicates that all context required for evaluation is provided in the prompt or accessible through tools the judges can use. A score of fifteen indicates that primary context is provided and judges can reasonably infer secondary context. A score of ten indicates partial context with substantial inference required. A score of five indicates significant missing context that materially affects evaluation. A score of zero indicates that critical context is unavailable to judges. Tasks with low context completeness should either be augmented with context or restricted to questions that do not require the missing context.
Component four, subjective load, measures how much the question depends on subjective factors that vary across reasonable judges. The scoring is inverted: lower subjective load gets a higher score. A score of twenty indicates questions with no meaningful subjective component. A score of fifteen indicates primarily objective questions with minor subjective elements. A score of ten indicates questions with substantial subjective component but stable consensus expected. A score of five indicates questions with high subjective load and variable consensus. A score of zero indicates questions whose answer is entirely a matter of judge preference. High-subjective-load questions are not unscorable, but they should be reported as distributions rather than point estimates.
Component five, validation infrastructure, measures whether the eval system has periodic external validation in place to detect drift between jury consensus and external truth referents. A score of twenty indicates regular validation against verifiable-truth subsamples and human expert panels with documented agreement metrics. A score of fifteen indicates validation against one of the two with adequate documentation. A score of ten indicates ad-hoc validation without systematic process. A score of five indicates validation only in response to anomalies. A score of zero indicates no validation infrastructure. Without validation, jury consensus can drift indefinitely without anyone noticing, and the eval becomes increasingly meaningless over time.
The overall TAQS is the sum of the five components. The score should be reported alongside any jury-based evaluation, with explicit interpretation of what the score means for the credibility of the evaluation. Eval systems that publish TAQS for each task class are giving consumers the information they need to weight scores appropriately. Eval systems that do not are leaving consumers to assume scores carry more credibility than they may deserve.
TAQS is not a substitute for the quality of the eval. It is a meta-quality score that helps consumers calibrate their use of evaluation results. A high TAQS does not guarantee a useful evaluation; it indicates that the methodology is well-suited to producing useful evaluations on this task. Conversely, a low TAQS does not mean the evaluation is worthless; it means the methodology has limitations that consumers should account for in interpretation. Used well, TAQS produces an eval ecosystem that is more honest about its own limitations and more useful as a result.
Hybrid Methodology For Mixed-Type Tasks
Most real-world evaluation tasks are mixtures of the four question types from the taxonomy. A customer support evaluation has verifiable elements (did the agent follow required disclosures), convergent elements (did the agent apply policy correctly), divergent elements (was the tone appropriate), and contextual elements (was the response right for this customer's history). The right methodology is a hybrid that uses verifiable-truth checking for type-one questions, jury consensus for type-two questions, distribution reporting for type-three questions, and context augmentation or restricted scoping for type-four questions.
Hybrid methodology produces composite scores that are the weighted sum of sub-scores from different methodological strands. The weighting is itself a methodological choice, and the choice should be documented and justified rather than treated as a given. Common weightings give more weight to verifiable components when they exist, less weight to subjective components, and explicit confidence intervals on aggregate scores that propagate the uncertainty from each component appropriately.
The operational complexity of hybrid methodology is real but manageable. It requires the eval system to understand each question's type and to route to the appropriate methodology. It requires the scoring system to track different kinds of scores from different methodologies and to aggregate them sensibly. It requires the reporting system to communicate the methodological mix to consumers without overwhelming them with detail. None of these is impossible, and well-designed eval systems handle them routinely. Eval systems that do not handle them are either restricting themselves to single-type tasks, which limits their applicability, or applying jury methodology indiscriminately to mixed-type tasks, which produces misleading scores.
A particularly important hybrid pattern is verifiable-anchor evaluation. The eval system identifies a small subset of questions in each task where verifiable truth exists, evaluates those questions against the verified truth, and uses the agreement between jury verdicts and verified truth on the anchor questions as a calibration signal for the jury verdicts on the non-anchor questions. If the jury agrees with verified truth on ninety-five percent of anchor questions, the jury verdicts on non-anchor questions are credible. If the jury agrees only on sixty percent, the non-anchor verdicts should be discounted accordingly. Verifiable-anchor evaluation is the most operationally useful innovation in jury-based eval methodology in the past several years, and it should be standard practice for any eval task that contains both verifiable and non-verifiable components.
Another important hybrid pattern is human-in-the-loop validation. For tasks with high subjective load, the eval system periodically routes a sample of jury verdicts to human expert reviewers for second-opinion evaluation. The agreement between jury and human verdicts is tracked over time, and large divergences trigger jury re-calibration or methodology review. Human-in-the-loop validation is expensive, but the cost can be amortized across many evaluations by sampling rather than evaluating everything, and the validation signal is invaluable for maintaining jury credibility on high-subjective-load tasks.
Hybrid methodology is the future of credible jury-based eval. Pure jury approaches without hybrid elements work only on the narrow subset of tasks that are well-bounded, low-subjective-load, and within jury competence. Pure human approaches do not scale to the volume of evaluation the agent economy requires. The hybrid approach combines the scaling properties of jury eval with the calibration properties of verified truth and human expert review, producing systems that are both operationally feasible and methodologically defensible.
Counter-Argument: Consensus Is Not Truth, So Stop Pretending It Is
The strongest counter-argument is that all of this elaborate machinery is putting lipstick on a fundamentally compromised methodology. Consensus is not truth. Calling it "truth approximation" is rhetoric. The TAQS score does not change the underlying epistemological problem; it just dresses it up in numbers. The honest move is to abandon the project of evaluating subjective tasks at all, and to restrict agent evaluation to tasks where ground truth genuinely exists.
The response is that this counter-argument proves too much. Subjective tasks are exactly the tasks the agent economy needs to evaluate. Customer support, content moderation, code review, medical summarization, legal analysis: these are all subjective tasks, and they are the tasks where agents create economic value. Restricting evaluation to verifiable-truth tasks would mean restricting evaluation to a small subset of agent capabilities, leaving the bulk of agent activity unevaluated. This is not honest epistemology; it is intellectual abdication. The right move is to evaluate subjective tasks with appropriate methodology and appropriate epistemic humility, not to refuse to evaluate them at all.
The second response is that the counter-argument misunderstands what evaluation is for. Evaluation is not metaphysics. It is decision support. Operators need to make procurement decisions about agents. Counterparties need to make integration decisions. Insurers need to make underwriting decisions. All of these decisions need information about agent quality, and the information does not need to be metaphysical truth to be useful. Consensus among diverse competent judges is genuinely useful information for these decisions, even when it is not truth. The TAQS framework helps consumers understand how useful, which is what consumers need. Refusing to provide the information because it is not metaphysical truth is unhelpful to the consumers who need to make decisions.
The third response is that the counter-argument applies just as much to human expert evaluation, and yet we use human expert evaluation in domains as varied as art criticism, peer review, judicial decisions, and medical diagnosis. The fact that human experts disagree, that their consensus is not truth, that their methodology has known limitations, has not stopped these domains from using expert evaluation as the load-bearing decision-support infrastructure. The same is true for jury-based agent evaluation. The methodology is imperfect. The methodology is also useful, when applied with discipline. The right response to imperfection is to develop the discipline, not to abandon the methodology.
The fourth response is the deepest. Truth, in the sense the counter-argument invokes, is not actually what we need for most evaluation contexts. We need calibrated, useful, and stable assessments that support good decisions. Truth is one source of these properties; consensus among diverse competent judges is another; expert deliberation is another; market outcomes are another. Different sources are appropriate in different contexts, and serious epistemology is about choosing the right source for the right question, not about pretending that only one source counts. The TAQS framework is the operational instantiation of this serious epistemology applied to agent evaluation. It is not a betrayal of honest measurement; it is what honest measurement looks like when truth is not available and decisions still need to be made.
The counter-argument is right about one thing: the language we use matters. Calling jury consensus "ground truth" is wrong, because it is not. Calling it "truth approximation" is better but still risks overclaiming. The most honest framing is that jury consensus is calibrated decision-support information whose calibration quality is itself measurable through frameworks like TAQS, and whose appropriate uses are bounded by the same calibration quality. This is more verbose than calling it truth, but it is more accurate, and the eval profession should adopt the more accurate framing as it matures.
What Armalo Does
Armalo's multi-LLM jury draws judges from at least three model providers, with composition rotated quarterly to reduce single-provider gaming. The trim of top and bottom twenty percent of jury verdicts removes outlier influence on the operational composite score, while the full distribution is retained and reported as a verdict-variance metric on every score. TAQS is computed for each evaluation task class and published alongside the task's contribution to the composite. Hybrid methodology is the default: verifiable-truth questions are checked directly, convergent-judgment questions are scored through jury consensus, divergent-judgment questions are reported as distributions rather than point estimates, and contextual-judgment questions are either context-augmented or restricted to non-context-dependent components. Verifiable-anchor calibration runs on every evaluation cycle, with jury-versus-verified-truth agreement tracked and used to discount non-anchor verdicts when agreement falls below thresholds. Human-in-the-loop validation runs on a sampled basis, with sample size scaling to subjective load: low-subjective-load tasks get five percent sampling, high-subjective-load tasks get twenty percent. The composite score is reported with a confidence interval that propagates uncertainty from each component, and the certification tier is computed from the lower bound of the interval rather than the point estimate, biasing the system toward conservative tier assignment when uncertainty is high.
FAQ
If consensus is not truth, why use it at all? Because consensus among diverse competent judges is calibrated, useful, and stable enough to support decisions, even when it is not truth. The alternatives, single-judge evaluation or no evaluation, are both worse for almost every use case.
Doesn't averaging across LLMs just average across the same biases? Sometimes, when the biases correlate across models. Bias decorrelation is partial, not complete. The defense is multi-provider juries with deliberate diversity, periodic external validation, and explicit measurement of inter-judge agreement to detect mode collapse.
How do I know if my eval task has a high or low TAQS? Grade it against the five components: question boundedness, jury competence, context completeness, subjective load, and validation infrastructure. Each scored zero to twenty, summed for the overall TAQS. The grading takes about thirty minutes per task class.
What should I do with a low-TAQS evaluation? Either use it with appropriate caveats and confidence intervals, restructure the task to improve TAQS, or use a different methodology. Treating a low-TAQS score as a high-confidence number is the failure mode TAQS exists to prevent.
Can subjective evaluation ever be objective? No, but it can be calibrated, stable, and decision-relevant, which is what evaluation actually needs to be. The aspiration to objectivity in subjective evaluation is a category error; the better aspiration is to honest, calibrated decision support.
How does Armalo handle disagreement between jury and human experts? Disagreement above a threshold triggers jury recalibration, methodology review, and potentially adjustment to the jury composition. Persistent large disagreement is a signal that the methodology has a structural problem that needs investigation, not a signal to suppress one side or the other.
Can the TAQS itself be gamed? Yes, if buyers do not verify the TAQS components against external evidence. The defense is independent review of TAQS computations as part of vendor diligence, the same way financial audit reports are themselves audited periodically by other firms. TAQS without external verification is just the eval system's self-assessment, which is informative but not authoritative.
What about tasks where even human experts cannot agree? For these tasks, the right output is the distribution of expert opinions, not a point estimate of consensus. Jury-based eval can produce distributions, and consumers can use the distributions to make decisions that account for the irreducible disagreement. Pretending these tasks have a single right answer when they do not is the failure mode the divergent-judgment category in the taxonomy exists to flag.
Bottom Line
Ground truth is what we want, and for most evaluation tasks in the agent economy, it is not what we have. Multi-LLM jury consensus is the best available approximation, and its quality varies enormously based on the structure of the underlying question. The intellectual discipline of jury-based eval is to know when consensus closely approximates truth, when it diverges in predictable ways, and when it should not be trusted at all. The TAQS framework operationalizes this discipline by giving evaluation tasks a meta-quality score that consumers can use to weight evaluation results appropriately. Build hybrid methodologies that combine verifiable-truth checking, jury consensus, distribution reporting, and human-in-the-loop validation. Use TAQS to grade your evaluations honestly. Report confidence intervals rather than point estimates when methodology supports only intervals. The goal is not to pretend evaluation produces truth; it is to produce evaluation that supports good decisions, with the epistemic humility that the underlying methodology demands.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…