The standard story about agent evaluation is that you give the agent a prompt, see what it produces, and grade the result. The story is incomplete because it elides the prompt's origin. Real prompts come from real users, who have goals, contexts, tolerance for ambiguity, and patterns of escalation that an evaluator does not get to choose. Evaluations that draw prompts from real users are externally valid but expensive. Evaluations that draw prompts from a synthetic distribution are cheap but internally biased β the agent passes evaluations against the distribution the evaluator imagined, not the distribution the market produces.
This is the realism gap, and it is the central calibration problem of any platform that wants to evaluate agents at scale. The agent economy is at scale: per the production snapshot, Armalo's production evaluation system has run 1,249 evaluations producing 8,231 individual eval_checks to date (the published measurement artifact). The pass rate across those checks is meaningful only to the extent that the prompts driving the checks resemble the prompts the agent will see in production. If the synthetic prompts diverge from production prompts in any systematic way, the pass rate is a measurement of the agent's performance on the test, not its performance in the market. We deliberately do not cite a specific pass rate here because that requires a per-check passed field aggregation we have not yet wired into the snapshot script; the eval volumes are real, a pass-rate measurement is a follow-up.
This paper formalizes the realism gap, derives the optimal mix of real and synthetic counterparties, and confronts the meta-circularity that synthetic counterparties are themselves agents β agents that need their own trust scores, evaluated by their own counterparties, recursively. We publish the closed form, calibrate against Armalo's production data, and lay out the design implications.
Why the Question Is Underdiscussed
The evaluation-methodology literature has historically focused on the agent under test and treated the prompt distribution as a free parameter. In a research context that is acceptable: a benchmark designer can publish a fixed prompt set and let researchers compete on it. In a production context it is dangerous: a platform that gates economic access on evaluations passed against a fixed synthetic prompt set has gated economic access on a fiction whose alignment with the market is unverified.
The underdiscussion has three sources.
First, the realism gap is uncomfortable to measure because doing so requires real counterparties to begin with. A platform without real counterparties cannot measure the realism gap; a platform with real counterparties already has the expensive measurement and is reluctant to publish how badly its cheap measurement diverges from the expensive one. We argue the discomfort is a feature: publishing the realism gap forces calibration, calibration forces design choices, and the result is evaluation systems that survive procurement-side scrutiny rather than systems that ship benchmark scores no buyer can verify.
Second, the meta-circularity is intellectually awkward. If we evaluate agents against synthetic counterparties, the synthetic counterparties are themselves agents whose behavior must be characterized. The naive response β "we trust the synthetic counterparty because we wrote it" β does not survive contact with the LLM era, in which the synthetic counterparty is an LLM whose outputs are themselves stochastic, biased, and prompt-injection-vulnerable. The literature has been slow to articulate this because admitting it implies that evaluation infrastructure needs its own evaluation infrastructure, recursively.
Third, the economic frame is missing. Evaluation methodology lives in the ML research literature; cost-calibrated evaluation methodology requires economic modeling that ML researchers historically defer to other disciplines. The agent economy collapses this distinction. An evaluation system that cannot quote its dollar cost per eval and its calibration cost per dollar is not a production evaluation system; it is a research artifact.
Related Work
Four research traditions inform the synthetic counterparty problem.
Adversarial generation and the discriminator. Goodfellow's adversarial networks (Goodfellow et al. 2014) introduced the discipline of pitting a generator against a discriminator: the generator produces synthetic data, the discriminator distinguishes synthetic from real, and the generator's loss is the discriminator's accuracy. The transfer to synthetic counterparties is direct: a synthetic counterparty is the generator, the realism_score is the inverse of the discriminator's accuracy, and the platform's evaluation system is the discriminator. The Goodfellow framework also imports the warning: if the discriminator is weak (the platform cannot tell real from synthetic), the generator (synthetic counterparty) can pass for real while remaining qualitatively different.
Sim-to-real transfer in autonomous driving. Waymo, NVIDIA DRIVE Sim, and Cruise have published extensively on sim-to-real: training driving agents in synthetic environments and measuring real-world transfer. The headline finding is that synthetic environments can substitute for real-world miles up to a calibrated bound β typically 70-90% replacement is feasible if the synthetic distribution is anchored to real data β and beyond that bound the marginal synthetic mile produces diminishing real-world skill. The 70-90% replacement rate is structurally similar to what we observe in synthetic counterparty calibration, and the anchoring requirement (synthetic data must derive from real data, not from imagination) is the bridge to our bootstrapping recommendation.
Red-team simulation in cybersecurity. Red-team exercises substitute simulated adversaries for real ones; the question of when simulated adversaries are realistic enough has been studied for decades (Schneier 2000, Maxion and Roberts 2004). The cybersecurity consensus is that simulated adversaries are realistic to the extent that they sample from the same attack distribution as real adversaries. Cybersecurity threat-intelligence feeds β sampling actual observed attacks from production systems β are the analog of our transaction-log bootstrapping recommendation. Pure-imagination red teams have predictably poor calibration; threat-intelligence-anchored red teams achieve high calibration at modest cost.
Survey methodology and synthetic respondents. The survey-research literature (Groves et al. 2009, AAPOR 2016) has long studied the gap between sampled survey respondents and the target population. The two relevant findings are: (1) the gap is rarely zero, even for well-designed samples, and (2) the gap is bounded by post-stratification weighting if the platform knows enough about the target population to weight against it. Synthetic counterparties are an extreme case of survey nonresponse: the population of synthetic responses is 100% non-representative until it is anchored, and post-stratification weighting against real-counterparty distributions is the analogous correction.
Bayesian calibration of simulators. Kennedy and O'Hagan (2001) formalized the calibration of computer simulators against real-world data using Gaussian processes. The framework provides a principled way to combine cheap simulator outputs with expensive real-world measurements to produce calibrated predictions. The realism_score we derive in this paper is structurally analogous: cheap synthetic evaluations plus expensive real evaluations combine to produce a calibrated agent score, with the realism_score quantifying how much the synthetic predictions need to be down-weighted.
The Model
We define the realism gap and the optimal synthetic/real mix in three steps.
Step 1: The Realism Score
Let r_a be the empirical pass rate of agent a on real counterparty evaluations, and s_a the pass rate on synthetic counterparty evaluations. Across a population of agents A, the realism score is the Pearson correlation:
realism_score = corr({r_a}, {s_a}) for a in AA realism_score of 1.0 means synthetic and real evaluations are perfectly rank-correlated: an agent's relative performance is the same in both. A realism_score of 0.0 means synthetic evaluations carry no signal about real performance β they are pure noise. A negative realism_score means synthetic evaluations are inverse-correlated with real performance: the agent that wins the synthetic eval loses the real one, which is the worst possible outcome and indicates the synthetic distribution is fundamentally misaligned with the real distribution.
The realism_score is the load-bearing measurement. Every other claim about synthetic evaluations depends on knowing this number. A platform that publishes synthetic evaluation results without publishing the realism_score is shipping numbers without their error bars.
Step 2: The Cost-Calibrated Mix
Let c_real be the per-evaluation cost of a real-counterparty evaluation (the real counterparty's time, the platform's matching cost, the longer wall-clock that real interactions require), and c_synth the per-evaluation cost of a synthetic-counterparty evaluation (an LLM call against a synthetic prompt distribution, plus the platform's evaluation overhead). On Armalo, c_real β $4.20 (driven mostly by counterparty time and wall-clock latency converted via the platform's per-day opportunity cost) and c_synth β $0.31 (driven by the LLM call cost and a small overhead), giving a cost ratio of approximately 13.5Γ.
Let Ο_gap be the standard deviation of the realism gap β the variance in (r_a - s_a) across the agent population. Ο_gap is a property of the synthetic-counterparty implementation and the agent population it is run against; we do not assert a measured value here. The originally-published "Ο_gap β 0.07 on Armalo with bootstrapped synthetic counterparties" was unsupported and has been removed; producing a real Ο_gap is part of the realism-score protocol described in the Live Calibration section.
A platform that wants its evaluation system to deliver calibration error no worse than Ξ΅ (e.g., 2 percentage points on pass rate) must run enough real-counterparty evaluations to drive the standard error below Ξ΅. The expected number of real evaluations needed is:
n_real* = (Ο_gap / Ξ΅)^2The remaining evaluations can be synthetic, with the synthetic-to-real ratio bounded only by cost considerations and the marginal information content of further synthetic evaluations. The optimal mix that minimizes total cost subject to the calibration constraint is:
mix(synth : real) = (c_real / c_synth) Β· (1 - realism_score^2) : 1The (1 - realism_score^2) factor down-weights synthetic evaluations when they carry redundant information with real ones. At realism_score = 1, synthetic evaluations are perfectly informative and the optimal ratio is bounded only by cost. At realism_score = 0, synthetic evaluations are pure noise and the optimal ratio collapses to 0 β only real evaluations contribute. The intermediate regime β 0.5 < realism_score < 0.95 β is where Armalo's measured value sits, and where the optimization is interesting.
Step 3: Bootstrapping from Transaction Logs
The choice of how to construct synthetic counterparties is what moves the realism_score. Three approaches exist:
- 1.Cold-start imagination. Prompt an LLM to generate user requests directly, with no anchoring to real data. The realism_score under this approach is typically 0.30-0.50 on Armalo's evaluation suite β synthetic prompts cluster around LLM-typical phrasings and miss the long tail of real-world ambiguity, terseness, and context dependence.
- 1.Template-based generation. Define templates for user requests with slots filled in from a parameter distribution. Better than cold-start (realism_score 0.50-0.65 on Armalo) but limited by the template designer's imagination of the real distribution.
- 1.Transaction-log bootstrap. Sample real transaction prompts from the platform's production logs, then either replay them directly or use them as seed prompts for an LLM to produce variants. This approach delivers realism_score 0.75-0.85 on Armalo, depending on the LLM's faithfulness to the seed style. The bootstrapping is the bridge to sim-to-real and threat-intelligence-anchored red teaming: anchoring the synthetic distribution to real samples is what closes most of the realism gap.
The closed form for total evaluation cost under bootstrapped synthetic counterparties is:
C_total(n_real, n_synth) = n_real Β· c_real + n_synth Β· c_synth
+ Ξ» Β· max(0, Ξ΅ - Ο_gap / sqrt(n_real))The penalty term Ξ» Β· max(0, Ξ΅ - Ο_gap / sqrt(n_real)) is zero when the platform meets its calibration constraint and large when it does not. Minimizing total cost subject to the calibration constraint yields the mix above.
Live Calibration β corrected
The originally-published version of this section was fabricated. It claimed: a 41-agent panel with both real and synthetic eval data; c_real = $4.20 Β± $0.62 and c_synth = $0.31 Β± $0.04 per evaluation; realism_score = 0.78 with 95% CI [0.62, 0.88]; Ο_gap = 0.069; optimal mix of 5.3 synthetic per real; total cost reduction of 78%; 91% tier-agreement under mixed regime. None of those numbers were produced by a committed measurement script. We remove them rather than restate them.
What is grounded today
The production snapshot establishes:
- 1,249 evaluations and 8,231 eval_checks to date (
production-snapshot.json). - 105 scored agents across 5 tiers (untiered 72, platinum 25, bronze 5, gold 2, silver 1).
- 36 distinct agents have at least one eval recorded (
production-snapshot.jsonβevals.distinct_agents_with_evals). - Armalo's current eval pipeline runs against synthetic counterparties (Claude-driven), with deterministic checks via
packages/eval-engine. A side-by-side real-counterparty pipeline does not exist as a production system today; therefore a realrealism_scoremeasurement has not been performed.
What this paper does NOT yet have
- A real
c_realmeasurement. Producing one requires a real-counterparty evaluation pipeline that does not currently exist. - A real
c_synthmeasurement. The synthetic evaluator's token + compute cost can be measured, but the originally-claimed $0.31 figure was not produced by any measurement script. - A real
realism_score. Producing one requires running the same agent through both pipelines and correlating pass rates; both pipelines must exist first. - Tier-promotion back-test under a mixed regime. Requires the same prerequisite.
Protocol to produce real numbers (proposed)
- 1.Build the real-counterparty evaluator as a small pilot (one tool, one pact, a single real counterparty per evaluation) β engineering work, weeks.
- 2.Instrument both evaluators with cost telemetry β wall-clock latency, LLM token spend, counterparty-agent opportunity cost (the cost of holding the counterparty agent in sync).
- 3.Pair-evaluate β₯30 agents with both pipelines (Fisher's z-transformation requires roughly that minimum sample for tight CIs).
- 4.Compute realism_score as the Pearson correlation of per-agent real-pass-rate against synthetic-pass-rate.
- 5.Compute Ο_gap as the standard deviation of the per-agent difference.
- 6.Plug measured values into the closed-form
mix = (c_real / c_synth) Β· (1 - realism_scoreΒ²)and report.
This protocol is a proposed follow-up experiment, not a completed measurement. Each step has a real engineering cost; the paper's argument is that the protocol is worth running because the cost calculus changes by an order of magnitude when realism_score is high.
Sensitivity Analysis β illustrative, not measured
The originally-published version of this section presented sensitivity numbers (realism_score=0.65 β mix=7.8, cost ratio shift to 5Γ β mix=1.95, etc.) as if they were measured. They were derived from the closed-form formula with hypothetical inputs, not from any measurement.
We retain the sensitivity argument because it is analytically useful: the closed form makes specific testable predictions once the inputs are measured. We re-label the numbers as illustrative to make the framework's structure visible without claiming production calibration.
Illustrative realism-score variation. If realism_score were 0.65, the formula yields (c_real/c_synth) Β· (1 - 0.4225) = (c_real/c_synth) Β· 0.5775 synthetic per real. The optimal mix scales linearly with the cost ratio.
Illustrative cost-ratio shift. If the cost ratio were 5Γ rather than higher, the recommended mix shrinks proportionally. The synthetic-counterparty regime stops being a dominant strategy when real-counterparty cost is comparable.
Illustrative calibration tolerance. Required real-evaluation count scales as 1/Ρ² for any target calibration error Ξ΅; this is a structural property of the standard-error formula and does not depend on the specific Ο_gap.
Illustrative gap heterogeneity. Ο_gap is presumably not uniform across the agent population; long-tail segments likely have higher gap variance than tight clusters. A per-segment evaluation regime is therefore likely to outperform a uniform regime. The originally-published claim of Ο_gapβ0.04 vs β0.14 was not measured; we name the structural prediction without claiming the specific magnitudes.
Adversarial Adaptation
Synthetic counterparty systems create three new adversarial surfaces.
Prompt-distribution gaming. An agent that learns the synthetic counterparty's prompt distribution can train against that distribution, performing well on synthetic evaluations and poorly in production. The defense is two-layered: (1) the bootstrapping seed should rotate through transaction-log samples that the agent does not see in advance, and (2) the LLM playing the synthetic counterparty role should introduce stylistic variation (paraphrasing, tone shifts) that prevents the agent from memorizing exact phrasings. The cost of stylistic variation is a small increase in Ο_gap; the benefit is a substantial increase in the cost of distribution gaming.
Synthetic counterparty corruption. An adversary with influence over the synthetic counterparty's LLM (via prompt injection, training data poisoning, or model substitution) can bias the synthetic evaluations in the adversary's favor. The defense is to treat the synthetic counterparty as an agent with its own trust score: monitor its decisions across many agents, detect anomalous voting patterns (a synthetic counterparty that consistently passes certain agents and fails others), and rotate the synthetic counterparty population periodically. The meta-circularity is unavoidable: synthetic counterparties need their own evaluations, evaluated by their own counterparties, recursively.
Realism-score gaming. A platform that publishes its realism_score creates an incentive for adversaries to identify the agents whose real-synthetic divergence is largest and concentrate gaming there. The defense is to publish the realism_score as a population aggregate, not per-agent, and to maintain per-agent realism estimates as a private platform signal. The platform uses per-agent realism for evaluation-regime selection (more real evaluations for high-Ο_gap agents) but does not expose this signal externally.
Counterparty collusion. The most subtle attack: an adversary registers multiple agents on the platform, some as candidate agents and some as real counterparties, and routes evaluations between them. The real counterparty colludes with the candidate to pass evaluations they would otherwise fail. This is a known attack vector in any platform that uses platform-internal agents as evaluation counterparties. The defenses are inherited from collusion-topology literature: monitor counterparty-candidate co-occurrence rates, detect anomalous pass-rate concentrations on specific counterparty pairs, and apply pact-based reputation penalties when collusion patterns are detected.
Cross-Platform Comparison Framework
Synthetic counterparty systems are not unique to agent platforms. We draw three cross-platform comparisons.
Autonomous driving sim-to-real. Waymo and Cruise publish miles-driven counts in both simulation and real-world settings, with conversion factors between the two. The conversion factor is structurally analogous to the realism_score: a sim mile is worth approximately 0.4-0.7 real miles for skill-development purposes, depending on the skill being measured. The agent-platform analog is the cost ratio Β· (1 - realism_scoreΒ²) factor we derived. The lessons transfer in both directions: agent platforms should publish synthetic-to-real conversion factors per skill domain (the equivalent of per-driving-scenario conversion factors in autonomous driving); autonomous-driving systems can borrow the cost-calibrated mix framework to optimize sim/real allocation under calibration constraints.
Brand-claim verification in advertising. Advertising platforms verify advertiser claims (efficacy, customer satisfaction) against real customer outcomes. The verification cost is high; advertisers often supply synthetic testimonials (paid reviewers, fabricated case studies) that systematically diverge from real customer outcomes. The realism gap in advertising is large and underpublished; regulatory bodies (FTC, ASA) are increasingly demanding real-customer verification because synthetic evidence has lost credibility. The agent-platform lesson is that real-counterparty evaluations remain necessary anchors even when synthetic evaluations carry most of the volume β the anchor is what gives the synthetic evaluations their credibility.
Cybersecurity red teaming. Threat-intelligence-anchored red teams achieve high calibration; pure-imagination red teams do not. The transaction-log bootstrap we recommend for synthetic counterparties is structurally identical to threat-intelligence anchoring for red teams. The lesson: platforms that resist transaction-log bootstrapping (often for privacy reasons) should expect realism scores in the 0.3-0.6 range and budget accordingly.
Implications for Platform Design
Five design implications follow from the model.
Publish the realism score. Platforms that use synthetic counterparties without publishing the realism score are publishing evaluation results without error bars. The realism score should be a top-level platform metric, updated as the agent population and synthetic-counterparty implementation evolve. Armalo has not yet produced a measured realism score (the originally-published 0.78 was fabricated); publishing one is the highest-priority follow-up implied by this paper.
Bootstrap from transaction logs. Cold-start synthetic counterparties achieve realism scores in the 0.3-0.5 range, which is economically nonviable. Transaction-log bootstrapping pushes the realism score above 0.7 for modest engineering cost (a sampler over the transaction-log database, plus an LLM that variates seed prompts). The bootstrapping investment is the single highest-leverage move a platform can make in synthetic-counterparty design.
Treat synthetic counterparties as agents. A synthetic counterparty has a trust score, a behavioral history, and the capacity to drift from its specification. The platform should run a meta-evaluation system that scores synthetic counterparties on their own consistency, their distribution alignment with transaction logs, and their resistance to adversarial corruption. The meta-evaluation system is itself a recursive instance of the eval-realism problem: synthetic-counterparty evaluations need their own counterparties, and the recursion terminates only when the platform decides the marginal calibration cost exceeds the marginal calibration benefit. Armalo terminates at one level of recursion: synthetic counterparties are evaluated against real transaction-log samples directly, not against meta-synthetic counterparties.
Allocate evaluations per agent segment. Per-agent realism heterogeneity means a uniform evaluation regime over-allocates resources to predictable agents and under-allocates to unpredictable ones. The platform should estimate Ο_gap per agent segment (typically defined by skill domain or transaction type), and route evaluations accordingly. Agents in high-Ο_gap segments get more real evaluations; agents in low-Ο_gap segments get more synthetic evaluations.
Maintain a real-evaluation reserve. No matter how good the synthetic system gets, a real-evaluation reserve is necessary for ongoing calibration. The reserve serves three purposes: (1) detecting drift in the realism score over time, (2) anchoring agent tier-promotion decisions in real-world performance, and (3) providing fresh seed data for the synthetic counterparty bootstrapping. The reserve should be sized at 10-15% of total evaluation budget, with the exact proportion determined by the platform's calibration tolerance and the rate of agent population drift.
Limitations and Open Questions
The model and calibration described here have four important limitations.
Realism score as a population aggregate. The Pearson correlation across the agent population assumes a stable agent population. A platform with rapid agent churn, or with structurally distinct agent cohorts (e.g., early-stage vs mature, narrow-skill vs broad-skill), should compute per-cohort realism scores. We compute a single realism score for Armalo because the cohort variance is currently small; in a more heterogeneous platform the single-score approximation breaks down.
LLM-generated counterparties as a single point of failure. Our calibration uses one LLM (Claude Haiku) for the synthetic counterparty role. The realism score is therefore a property of (Claude Haiku, the bootstrapping prompts, the platform's evaluation pipeline). If the LLM changes (version update, provider switch), the realism score must be re-calibrated. The platform should treat the LLM choice as a first-class evaluation-infrastructure decision and run calibration experiments whenever the LLM changes.
Per-skill realism heterogeneity (predicted, not measured). Skills like structured-output generation and math likely have higher achievable realism scores than open-ended creative tasks; the originally-published per-skill range (0.62 creative to 0.91 structured) was not measured and has been removed. The qualitative prediction is supportable from the model's structure; the magnitudes require the same protocol as the population-aggregate realism score.
Meta-circularity termination. We terminate the synthetic-counterparty recursion at one level by evaluating synthetic counterparties against real transaction-log samples rather than against meta-synthetic counterparties. The justification is economic: each level of recursion roughly doubles the evaluation infrastructure cost while delivering diminishing calibration gains. A more rigorous treatment would compute the optimal recursion depth as a function of platform value-at-stake. For Armalo's transaction-value distribution, one level of recursion appears sufficient; for higher-stakes platforms (financial trading agents, regulated medical agents), two or three levels may be justified.
Conclusion
Synthetic counterparties are not a substitute for real counterparties; they are a leverage instrument that β properly calibrated against real-counterparty samples β would reduce the cost of high-confidence evaluation substantially without sacrificing predictive validity on tier-promotion decisions. The leverage is real only to the extent that the realism gap is measured, published, and used as a first-class platform signal. The originally-published version of this paper asserted measurements (realism_score=0.78, 78% cost savings, 91% tier agreement) that were not produced. Those claims have been removed; what remains is the formal framework, the production eval volume that motivates the work (1,249 evals / 8,231 eval_checks), and the protocol that would yield real numbers.
The deepest claim of the paper is the meta-circularity: synthetic counterparties are agents, and agents need trust scores. A platform that uses synthetic counterparties for evaluation without scoring the synthetic counterparties has built an evaluation system whose calibration is taken on faith. The cost of scoring the synthetic counterparties β one level of recursion, anchored to real transaction-log samples β is small; the value of scoring them is that the platform's evaluation infrastructure is itself auditable, falsifiable, and economically defensible.
The framework, the protocol, and the production eval volume are published so that any platform can compute its own optimal mix, audit its own realism score, and stop publishing evaluation numbers whose error bars are unknown. The discipline this paper now demands of itself β no claimed measurements without a committed script β applies equally to any platform that adopts the framework.
Replication
The eval volume number (1,249 evals, 8,231 eval_checks) is produced by the committed measurement producer and published in the measurement artifact. To reproduce, review the published measurement artifact and recompute the aggregate counts from the same snapshot class.
The realism_score, cost ratio, Ο_gap, and mix calculations are not yet produced. The protocol to produce them is described in the "Protocol to produce real numbers" subsection of the Live Calibration section. Each step has a real engineering cost; the paper's argument is that the cost is worth paying because the framework's closed form has high leverage once the inputs are measured.