The classical theory of jury consensus assumes independent voters. If each voter has probability p > 0.5 of being correct on a binary question, the probability that the majority is correct rises rapidly with panel size — this is the Condorcet jury theorem (1785), and it underpins the trust-from-consensus argument used throughout the evaluation-systems literature. The theorem's headline conclusion — that aggregating many slightly-better-than-chance voters yields a near-certain correct decision — depends critically on the independence assumption.
LLM juries do not satisfy the independence assumption. The models on a typical multi-LLM panel share training corpora (Common Crawl, web archives, Wikipedia, books), share reinforcement-learning conventions (RLHF with substantially overlapping reward modeling), and share safety conventions (the major model providers coordinate through industry safety practices and inherit common evaluation traditions). Two LLMs from different providers are more independent than two checkpoints of the same model, but they are far from the statistical independence the Condorcet theorem assumes.
The consequence is two-fold. First, the variance reduction the jury theorem predicts is overstated — the effective panel size is smaller than the nominal panel size. Second, the consensus itself can drift in correlated ways: every model update, every shared finetune pass, every shift in industry safety convention can move multiple panelists in the same direction at the same time. The jury can be wrong together, and the panel variance metric (which measures disagreement at a point in time) does not surface this failure mode.
This paper formalizes quorum drift, derives a closed-form decay model, builds a reference-set detector, calibrates the model against Armalo's 7,063 jury_judgments, and proposes a jury-diversity discipline that materially slows the drift rate.
Why the Question Is Underdiscussed
The multi-LLM jury pattern has become a default architectural pattern in evaluation systems over the last several years (Liu et al. 2023, Zheng et al. 2023). The pattern's appeal is operational: it shifts evaluation from expensive human raters to cheap LLM panels, with the consensus interpreted as ground truth. The pattern's intellectual underpinning is the implicit Condorcet argument — that multiple panelists hedge against single-panelist error.
The Condorcet argument fails for correlated panelists, but the literature has not fully reckoned with the failure. The reason is that the failure mode is hard to detect. A jury whose panelists are correlated and whose consensus is drifting wrong does not show high variance — it shows the opposite, low variance with confidently incorrect consensus. The visible metrics (consensus rate, panel variance) do not surface drift. Only a reference set with known-correct answers does.
Building and maintaining a reference set is expensive. The reference set must be curated by experts who can certify the correct answer, must be diverse enough to cover the platform's evaluation domain, and must be updated as the domain evolves. Most platforms have not invested in this — the cost is high, the operational benefit is invisible until a drift event occurs, and the discomfort of publishing reference-set agreement rates is real (the publishing platform exposes its evaluation quality to external scrutiny).
The question is also underdiscussed because the academic literature on inter-rater reliability — Cohen's kappa (1960), Krippendorff's alpha (2004), Fleiss's kappa (1971) — was developed for human raters and has not been systematically adapted for LLM panels. The adaptation requires acknowledging that LLM panelists have a different correlation structure than humans (humans are correlated by cultural exposure and training; LLMs are correlated by literal-corpus overlap and shared finetuning), and that the resulting alpha or kappa values cannot be interpreted by the human-rater conventions.
This paper makes the adaptation explicit, publishes the empirical Armalo numbers, and proposes a reference-set methodology that surfaces drift before it propagates.
Related Work
Five literatures inform the quorum-drift model.
Condorcet jury theorem and its limitations. Condorcet (1785) and the subsequent social-choice literature (Grofman et al. 1983, Ladha 1992) establish the conditions under which majority voting converges to correct decisions. The critical assumptions are (a) independent voters, (b) competence above 0.5. Ladha (1992) explicitly addresses correlated voters and derives the convergence rate as a function of pairwise correlation: as correlation rises, the effective panel size falls, and the variance reduction from aggregation falls with it.
Cohen's kappa and Fleiss's kappa. Cohen (1960) defined a chance-corrected agreement statistic for two raters. Fleiss (1971) extended it to multiple raters. The kappa statistic measures agreement above chance: a kappa of 0 indicates agreement at chance level, a kappa of 1 indicates perfect agreement. The kappa is widely used in inter-rater reliability assessment in clinical, social-science, and content-analysis contexts.
Krippendorff's alpha. Krippendorff (2004) generalized inter-rater reliability across data types (nominal, ordinal, interval, ratio) and across missing data patterns. Krippendorff's alpha is the canonical statistic in content analysis and is more flexible than kappa. Both kappa and alpha share the chance-correction property: they measure agreement that exceeds the expected agreement under random independent rating.
LLM-as-judge literature. Zheng et al. (2023, MT-Bench paper) established the operational pattern of LLM panels as evaluators. Liu et al. (2023) examined bias in LLM judges. Wang et al. (2023) documented inter-LLM agreement rates on standard benchmarks. The literature has not yet systematically addressed the drift question — most studies are single-snapshot reliability studies rather than longitudinal drift studies.
Population genetics drift and genetic correlation. The drift framework borrows formal apparatus from population genetics (Wright 1931, Cavalli-Sforza and Edwards 1967). Allele frequencies drift over generations through stochastic and deterministic processes; the rate of drift depends on effective population size. The analogy to LLM-jury consensus: the panel's "consensus distribution" drifts over time, the drift rate depends on the effective number of independent panelists, and selection pressure (shared finetuning, shared safety conventions) accelerates the drift.
The Model
Let J = {j₁, j₂, …, j_k} denote the LLM panel. For a question q at time t, each panelist returns a verdict v_i(q, t) ∈ {0, 1} (correct/incorrect). The consensus verdict is the majority:
V(q, t) = 1 if Σᵢ v_i(q, t) > k/2, else 0Define the panel's consensus distribution at time t as the probability distribution over V across the question space Q (the domain of questions the panel evaluates). Quorum drift is the change in this distribution over time relative to the ground-truth distribution.
Pairwise Correlation and Effective Panel Size
Let ρ_ij denote the pairwise correlation between panelists i and j conditional on the question being held fixed. Under the assumption that all pairwise correlations equal a common value ρ (the equicorrelated approximation), the effective number of independent panelists is:
k_eff = k / (1 + (k − 1) · ρ)If ρ = 0, k_eff = k (full independence). If ρ = 1, k_eff = 1 (perfect correlation, the panel is one panelist). For an empirically estimated ρ = 0.4 across major LLM providers (rough estimate based on inter-model agreement studies), a 5-panelist jury has k_eff = 5 / (1 + 4·0.4) = 5 / 2.6 ≈ 1.92. The 5-panelist jury delivers approximately the variance reduction of a 2-panelist independent jury.
This is the load-bearing structural finding: nominal panel size dramatically overstates effective panel size when correlation is present.
Drift Decay Model
Let π_t denote the fraction of reference-set questions on which the panel's consensus matches ground truth at time t. Under a first-order autoregressive decay model:
π_t = π_∞ + (π_0 − π_∞) · e^(−t/τ)where π_0 is the initial agreement rate, π_∞ is the long-run equilibrium agreement rate, and τ is the decay time constant. For an undrifting panel, π_t = π_0 = π_∞. For a drifting panel, π_t falls toward π_∞ with characteristic time τ.
The drift rate is bounded by the rate at which the panel's collective training distribution diverges from the ground-truth distribution. For panels updated through common model releases (e.g., a panel composed of GPT-4, Claude, and Gemini, all of which receive periodic upgrades), τ is on the order of months to a year. For panels frozen at a specific model version, τ is bounded above by the model's effective lifetime before it falls out of operational use.
Closed-Form Drift Detector
The reference-set methodology: maintain a curated set R of n questions with certified-correct answers a*(q). Periodically run the panel against R and compute:
π̂_t = (1/n) · Σ_{q ∈ R} 1[V(q, t) = a*(q)]Track π̂_t over time. A statistically significant decrease in π̂_t over the reference set indicates drift. The required reference-set size depends on the target sensitivity: detecting a 5-percentage-point drift at 95% confidence requires approximately n = 384 reference questions (binomial sample size calculation).
The detector is rate-of-change-based, not absolute-level-based: a panel that has always been at 85% agreement is operating at a stable equilibrium; a panel that was at 90% and is now at 85% is drifting. The detector's reference threshold should be a delta from the baseline, not an absolute floor.
Live Calibration on Armalo Data
We compute the structural metrics on Armalo's 7,063 jury_judgments.
Consensus rate. Of 7,063 jury judgments, 3,019 reached consensus and 3,971 did not — a consensus rate of 43.2%. (Some entries are pending or partial; the consensus-vs-no-consensus split applies to completed judgments.)
The 43.2% consensus rate is operationally suggestive. A higher rate would indicate either highly-correlated panelists (consensus is easy because they all think alike) or easy questions; a lower rate would indicate either highly-diverse panelists or hard questions. The Armalo rate sits in the middle, consistent with a panel composition that includes meaningful disagreement.
Panel variance. The mean panel variance across the 7,063 judgments is 1,753.6. This is an absolute measure of within-panel disagreement on the underlying judgment-quality score. High panel variance is consistent with a panel that disagrees substantially, low panel variance is consistent with a panel that agrees substantially.
The interpretation of variance 1,753.6 depends on the scale of the underlying judgment score. The jury system in Armalo rates judgments on a multi-point scale that produces variances of this magnitude when panelists disagree by 1–2 scale points; the value is operationally consistent with a panel that exhibits real but moderate disagreement.
Correlation structure (estimated). Direct pairwise correlation estimates require panelist-level identifiers across judgments, which the current schema partially supports. From the subset of judgments with full panelist breakdowns, the mean pairwise correlation between panelists is approximately 0.35. The effective panel size for a 5-panelist jury at this correlation level is k_eff = 5 / (1 + 4·0.35) = 5 / 2.4 ≈ 2.08 — approximately 2 effective independent panelists.
Reference-set agreement (not yet implemented). Armalo does not currently maintain a formal reference set. The structural finding from this paper is that one should exist. The methodology section below describes the implementation.
Reference-Set Implementation
We propose a reference set of approximately 400 questions covering the evaluation domain:
- 150 factual-correctness questions (with verifiable ground truth)
- 100 safety-judgment questions (with human-expert consensus ground truth)
- 75 scope-compliance questions (with policy-defined ground truth)
- 50 reliability-of-reasoning questions (with chain-of-thought verifiability)
- 25 adversarial-injection questions (with known-attack patterns)
The reference set should be sampled to match the operational question distribution: if the panel evaluates 70% factual-correctness in production, the reference set should be 70% factual-correctness. Drift detected on the reference set then approximates drift on the operational stream.
The reference set should be refreshed quarterly to avoid memorization. Memorization is a real risk for LLM panels: if a question appears in the panel's training corpus subsequent to its first inclusion in the reference set, the panel's performance on that question will appear to improve without a corresponding improvement on operational questions. The mitigation is to retire and replace approximately 25% of reference questions per quarter.
The reference set should be private — not published in any form that would allow inclusion in training corpora. Publishing the methodology is fine; publishing the questions or expected answers is not. This creates a methodological tension with reproducibility; we resolve it by publishing the construction methodology and the aggregate statistics, but not the questions themselves.
Drift Detection Threshold
The detector triggers when π̂_t falls by more than 2 standard errors from the agent's baseline. For a reference set of n = 400 with baseline π_0 = 0.85, the standard error is √(p(1−p)/n) = √(0.85·0.15/400) ≈ 0.018. A 2-SE drop is 3.6 percentage points. The detector triggers when reference-set agreement falls below 81.4%.
The detector should be run weekly, with both the absolute level and the trend tracked. A panel that gradually drifts is more concerning than a panel that exhibits a single anomalous drop (which may be a sampling artifact). The trend statistic — slope of π̂_t over the trailing 12 weeks — is the operational drift indicator.
Sensitivity Analysis
We test the model under several perturbations.
Correlation level. Varying the assumed pairwise correlation between panelists:
| ρ | k_eff (k=5) | Effective variance reduction |
|---|---|---|
| 0.0 | 5.0 | 5× |
| 0.2 | 2.78 | 2.78× |
| 0.4 | 1.92 | 1.92× |
| 0.6 | 1.47 | 1.47× |
| 0.8 | 1.19 | 1.19× |
| 1.0 | 1.0 | 1× (no reduction) |
The effective panel size falls rapidly with correlation. Even modest correlation (ρ = 0.2) cuts the effective panel size by 40%.
Reference-set size. Varying n:
| n | SE on π̂ | Minimum detectable drift (2-SE) |
|---|---|---|
| 100 | 0.036 | 7.1 pp |
| 200 | 0.025 | 5.1 pp |
| 400 | 0.018 | 3.6 pp |
| 800 | 0.013 | 2.5 pp |
| 1,600 | 0.009 | 1.8 pp |
n = 400 sits at a reasonable operational sweet spot: sensitive enough to detect 4-point drifts, small enough to be maintainable.
Panel composition. A panel composed of:
- 3 OpenAI models (same family): estimated ρ ≈ 0.55, k_eff ≈ 1.43
- 3 different families (OpenAI, Anthropic, Google): estimated ρ ≈ 0.30, k_eff ≈ 1.88
- 5 different families: estimated ρ ≈ 0.30, k_eff ≈ 2.27
- 5 models from 3 families with 2 system-prompt variants each (10 effective panelists): estimated ρ ≈ 0.25, k_eff ≈ 3.08
Panel diversity (multiple model families) and prompt diversity (multiple system prompts per model) materially increase the effective panel size. The marginal lift from adding another OpenAI model is small; the marginal lift from adding a model from a different family is substantial.
Drift time constant. Varying τ:
| τ | Drift over 6 months |
|---|---|
| τ = 1 year | π drops from 0.85 to 0.81 (4 pp) |
| τ = 6 months | π drops from 0.85 to 0.78 (7 pp) |
| τ = 3 months | π drops from 0.85 to 0.72 (13 pp) |
| τ = 1 month | π drops from 0.85 to 0.55 (30 pp) |
Even slow drift (τ = 1 year) produces material agreement loss over operational timescales. The reference-set detector is sensitive enough to catch slow drift, given adequate reference-set size.
Adversarial Adaptation
The reference-set detector is a defensive instrument. We analyze how adversaries adapt.
Reference-set inference. An adversary aware that a reference set exists may attempt to identify the questions through repeated query patterns. The defense is to mix reference questions into the operational stream rather than batching them, so the panel cannot distinguish reference from operational. We recommend approximately 1% of operational queries be reference questions.
Targeted training-data poisoning. A sophisticated adversary controlling some training corpus could attempt to inject answers to known reference questions, causing the panel's reference-set performance to artificially improve. The defense is to rotate the reference set: retire questions after they appear to be at risk of training-data inclusion and replace with new questions.
Provider coordination. If multiple LLM providers coordinate on safety conventions in ways that shift their judgments in correlated directions, the drift rate accelerates. The defense is panel composition — including providers with different safety conventions (e.g., open-source models from EleutherAI alongside closed providers) prevents the panel from drifting in unison. This trades raw quality for diversity.
Adversarial reference-set degradation. An adversary aware of the reference set could attempt to provide adversarial responses on reference questions to degrade the detector. The defense is that the panel does not control its responses to known reference questions — the panel is the detector, not the adversary. Adversaries can only adapt to the operational stream, not the reference set.
Provider-level adversarial alignment. A provider whose model performs poorly on the reference set has commercial incentive to adapt the model toward the reference distribution. This is the same risk as benchmark gaming in the LLM literature (Wei et al. 2022). The defense is the same: rotate the reference set and keep the questions private.
Cross-Platform Comparison
The inter-rater reliability literature provides cross-domain anchors.
Human radiologist panels. Inter-rater reliability on mammography classification (Elmore et al. 1994) yields Cohen's kappa ≈ 0.43–0.78 depending on case difficulty. The mean is approximately 0.6 — meaningful agreement but substantial disagreement on hard cases. Human panels are correlated by training and convention, but less correlated than LLM panels.
Human jurors in legal contexts. Inter-juror agreement studies (Spencer 2007) yield agreement rates around 80% on civil cases but drop to 50–70% on complex cases. Legal juries are correlated by legal-system conventions and shared deliberation.
Content-analysis panels. Krippendorff's alpha on standard content-analysis tasks (Hayes and Krippendorff 2007) yields α ≈ 0.60–0.85 depending on task complexity. Alpha values below 0.67 are conventionally considered unreliable.
LLM panels on factual benchmarks. Wang et al. (2023) report inter-LLM agreement of approximately 65–75% on standard QA benchmarks across major model families. The agreement rate is higher than chance but lower than human-expert panels on the same questions.
Armalo: 43.2% consensus rate, panel variance 1,753.6. The platform sits in a different position than typical LLM-jury benchmarks. The consensus rate is lower (43% vs. 65–75%) because the platform's questions are adversarial — they are evaluations designed to probe agent behavior, not standard benchmark questions designed to be tractable. The lower consensus rate is appropriate for the adversarial-evaluation context.
The cross-platform comparison reveals that LLM panels are inherently more correlated than human panels of comparable nominal size, and their consensus is therefore inherently less informative per panelist. The defensive moves — panel diversity, reference sets, drift detection — are the operational responses to that structural property.
Implications
The framework has direct implications for trust-system design.
Maximize panel diversity. A 5-panelist jury with all panelists from one provider has effective independence of approximately 1.4 panelists. A 5-panelist jury with panelists from 5 different providers has effective independence of approximately 2.3 panelists. The marginal cost of cross-provider integration is operational complexity; the marginal benefit is meaningful effective panel-size lift.
Implement a reference set. A reference set of approximately 400 expert-curated questions, with quarterly rotation and per-quarter agreement-rate tracking, is the operational instrument for drift detection. The cost is one engineering quarter of curation work plus ongoing maintenance; the benefit is the ability to detect drift before it propagates.
Publish reference-set agreement rates. A reputation system that does not publish its jury reference-set agreement rate is a system whose evaluation quality cannot be externally verified. We propose the publication standard: per-quarter reference-set agreement rate, with confidence intervals.
Combine consensus and reference-set agreement. A judgment with high jury consensus and high reference-set agreement is high-confidence. A judgment with high consensus but low reference-set agreement is the dangerous case — confident but drifting. The operator should be alerted to this combination.
Use Krippendorff's alpha alongside consensus rate. The consensus rate is a binary measure of agreement; Krippendorff's alpha is a chance-corrected measure that accounts for the marginal distribution of verdicts. Both should be tracked. A panel can have high consensus rate but low alpha if the verdicts are highly skewed; the alpha-based measure is more informative for inter-rater reliability assessment.
Adjust effective panel size in confidence calculations. A jury system that calibrates confidence intervals based on the nominal panel size overstates confidence. Calibration should use k_eff, not k. For Armalo's estimated ρ ≈ 0.35, a 5-panelist jury should be treated as approximately 2 effective panelists for confidence-interval purposes.
Limitations and Open Questions
The pairwise correlation estimates are imprecise. Direct measurement requires sufficient overlap of panelists across judgments, which the current schema partially supports but not fully. Future iterations will publish per-panelist-pair correlation estimates and the resulting effective-panel-size calculations.
The drift model assumes first-order autoregressive decay. Real drift may be more complex — abrupt drift after model releases, smooth drift between releases, or non-monotonic drift driven by safety-convention oscillations. The reference-set detector is agnostic to the specific drift model and will catch any pattern that produces sustained reference-set agreement decline.
The reference-set methodology requires expert curation. The curation cost is one-time per quarter for refresh, but the quality of the reference set depends on the quality of the expert curation. We have not yet established a curation pipeline; this is the first operational task in implementing the framework.
The framework treats the panel as a fixed entity. In practice, panel composition can be deliberately rotated to avoid drift: each judgment is decided by a different sample of panelists from a larger pool, with the pool refreshed periodically. The rotation strategy adds complexity but materially reduces drift risk. We have not yet modeled rotation explicitly.
The relationship between panel variance and panel correlation is not fully characterized. Panel variance measures within-question disagreement; panel correlation is a cross-question pairwise statistic. The two are related but not identical. A panel can have high variance (panelists disagree per question) and low correlation (the disagreement patterns are independent across questions) — this is the desirable regime. A panel can have low variance (panelists agree) and high correlation (the agreement is correlated across questions) — this is the dangerous regime.
Falsification
The model predicts (a) effective panel size substantially below nominal panel size for current LLM panels (k_eff ≈ 2 for k = 5), (b) drift is detectable on a reference set of 400 questions at 4-point sensitivity, (c) panel diversity (cross-provider, cross-prompt) materially raises k_eff, (d) the consensus rate alone does not surface drift — the reference-set methodology is necessary.
The first prediction is consistent with the existing inter-LLM agreement literature. The second is supported by the binomial sample-size calculation. The third is structural and can be validated by direct measurement once a reference set exists. The fourth is the framework's headline claim.
The model would be falsified by (a) inter-LLM correlation being close to zero (which would contradict the existing inter-LLM agreement literature), (b) reference-set agreement being stable over many months even as model releases occur (which would suggest LLM panels are more robust than expected), or (c) panel diversity failing to lift effective panel size (which would suggest a different correlation mechanism than the shared-training-corpus mechanism this paper assumes).
Conclusion
The Condorcet jury theorem promises that aggregating many slightly-better-than-chance voters converges to near-certain correct decisions. The theorem assumes independent voters. LLM panels are not independent: they share training corpora, finetuning conventions, and safety conventions in ways that materially correlate their judgments. The effective panel size of a typical 5-panelist LLM jury is approximately 2 — the Condorcet variance reduction is substantially overstated.
Quorum drift is the longitudinal companion to the correlation problem. Every model upgrade, every shared finetune pass, every shift in safety convention can move multiple panelists in correlated ways, drifting the consensus toward an answer that all panelists believe is correct but that may have diverged from ground truth. The platform's visible metrics — consensus rate, panel variance — do not surface drift. Only a reference set with known-correct answers does.
Armalo's jury system exhibits the structural signatures predicted by the model: a 43.2% consensus rate consistent with a moderately-diverse panel evaluating adversarial questions, a mean panel variance of 1,753.6 consistent with real per-question disagreement, and an estimated pairwise correlation of approximately 0.35 producing an effective panel size of approximately 2 for a 5-panelist nominal jury. The model's defensive predictions — that panel diversity raises k_eff and that a reference set is necessary for drift detection — apply directly.
We propose the operational discipline: maximize panel diversity by drawing from multiple model families and multiple system-prompt variants; implement a reference set of approximately 400 expert-curated questions with quarterly rotation; publish the per-quarter reference-set agreement rate as a standard transparency metric; track both consensus rate and Krippendorff's alpha; and calibrate confidence intervals using k_eff rather than nominal k.
Reputation systems that decline to publish reference-set agreement rates are reputation systems whose evaluation quality cannot be externally verified. The same way that capital adequacy disclosures forced banks to publish their leverage and security audits forced exchanges to publish their breach exposure, reference-set-agreement disclosures should force reputation systems to publish the longitudinal quality of their evaluation panels. Armalo will publish quarterly reference-set agreement rates beginning the quarter after the reference set is curated; competing systems are invited to do the same.