The standard story about agent evaluation is that you give the agent a prompt, see what it produces, and grade the result. The story is incomplete because it elides the prompt's origin. Real prompts come from real users, who have goals, contexts, tolerance for ambiguity, and patterns of escalation that an evaluator does not get to choose. Evaluations that draw prompts from real users are externally valid but expensive. Evaluations that draw prompts from a synthetic distribution are cheap but internally biased — the agent passes evaluations against the distribution the evaluator imagined, not the distribution the market produces.
This is the realism gap, and it is the central calibration problem of any platform that wants to evaluate agents at scale. The agent economy is at scale: Armalo's production evaluation system runs 1,240 evaluations producing 8,060 individual eval_checks, with 81.3% of checks passing — a number that has economic meaning only to the extent that the prompts driving the checks resemble the prompts the agent will see in production. If the synthetic prompts diverge from production prompts in any systematic way, the 81.3% is a measurement of the agent's performance on the test, not its performance in the market.
This paper formalizes the realism gap, derives the optimal mix of real and synthetic counterparties, and confronts the meta-circularity that synthetic counterparties are themselves agents — agents that need their own trust scores, evaluated by their own counterparties, recursively. We publish the closed form, calibrate against Armalo's production data, and lay out the design implications.
Why the Question Is Underdiscussed
The evaluation-methodology literature has historically focused on the agent under test and treated the prompt distribution as a free parameter. In a research context that is acceptable: a benchmark designer can publish a fixed prompt set and let researchers compete on it. In a production context it is dangerous: a platform that gates economic access on evaluations passed against a fixed synthetic prompt set has gated economic access on a fiction whose alignment with the market is unverified.
The underdiscussion has three sources.
First, the realism gap is uncomfortable to measure because doing so requires real counterparties to begin with. A platform without real counterparties cannot measure the realism gap; a platform with real counterparties already has the expensive measurement and is reluctant to publish how badly its cheap measurement diverges from the expensive one. We argue the discomfort is a feature: publishing the realism gap forces calibration, calibration forces design choices, and the result is evaluation systems that survive procurement-side scrutiny rather than systems that ship benchmark scores no buyer can verify.
Second, the meta-circularity is intellectually awkward. If we evaluate agents against synthetic counterparties, the synthetic counterparties are themselves agents whose behavior must be characterized. The naive response — "we trust the synthetic counterparty because we wrote it" — does not survive contact with the LLM era, in which the synthetic counterparty is an LLM whose outputs are themselves stochastic, biased, and prompt-injection-vulnerable. The literature has been slow to articulate this because admitting it implies that evaluation infrastructure needs its own evaluation infrastructure, recursively.
Third, the economic frame is missing. Evaluation methodology lives in the ML research literature; cost-calibrated evaluation methodology requires economic modeling that ML researchers historically defer to other disciplines. The agent economy collapses this distinction. An evaluation system that cannot quote its dollar cost per eval and its calibration cost per dollar is not a production evaluation system; it is a research artifact.
Related Work
Four research traditions inform the synthetic counterparty problem.
Adversarial generation and the discriminator. Goodfellow's adversarial networks (Goodfellow et al. 2014) introduced the discipline of pitting a generator against a discriminator: the generator produces synthetic data, the discriminator distinguishes synthetic from real, and the generator's loss is the discriminator's accuracy. The transfer to synthetic counterparties is direct: a synthetic counterparty is the generator, the realism_score is the inverse of the discriminator's accuracy, and the platform's evaluation system is the discriminator. The Goodfellow framework also imports the warning: if the discriminator is weak (the platform cannot tell real from synthetic), the generator (synthetic counterparty) can pass for real while remaining qualitatively different.
Sim-to-real transfer in autonomous driving. Waymo, NVIDIA DRIVE Sim, and Cruise have published extensively on sim-to-real: training driving agents in synthetic environments and measuring real-world transfer. The headline finding is that synthetic environments can substitute for real-world miles up to a calibrated bound — typically 70-90% replacement is feasible if the synthetic distribution is anchored to real data — and beyond that bound the marginal synthetic mile produces diminishing real-world skill. The 70-90% replacement rate is structurally similar to what we observe in synthetic counterparty calibration, and the anchoring requirement (synthetic data must derive from real data, not from imagination) is the bridge to our bootstrapping recommendation.
Red-team simulation in cybersecurity. Red-team exercises substitute simulated adversaries for real ones; the question of when simulated adversaries are realistic enough has been studied for decades (Schneier 2000, Maxion and Roberts 2004). The cybersecurity consensus is that simulated adversaries are realistic to the extent that they sample from the same attack distribution as real adversaries. Cybersecurity threat-intelligence feeds — sampling actual observed attacks from production systems — are the analog of our transaction-log bootstrapping recommendation. Pure-imagination red teams have predictably poor calibration; threat-intelligence-anchored red teams achieve high calibration at modest cost.
Survey methodology and synthetic respondents. The survey-research literature (Groves et al. 2009, AAPOR 2016) has long studied the gap between sampled survey respondents and the target population. The two relevant findings are: (1) the gap is rarely zero, even for well-designed samples, and (2) the gap is bounded by post-stratification weighting if the platform knows enough about the target population to weight against it. Synthetic counterparties are an extreme case of survey nonresponse: the population of synthetic responses is 100% non-representative until it is anchored, and post-stratification weighting against real-counterparty distributions is the analogous correction.
Bayesian calibration of simulators. Kennedy and O'Hagan (2001) formalized the calibration of computer simulators against real-world data using Gaussian processes. The framework provides a principled way to combine cheap simulator outputs with expensive real-world measurements to produce calibrated predictions. The realism_score we derive in this paper is structurally analogous: cheap synthetic evaluations plus expensive real evaluations combine to produce a calibrated agent score, with the realism_score quantifying how much the synthetic predictions need to be down-weighted.
The Model
We define the realism gap and the optimal synthetic/real mix in three steps.
Step 1: The Realism Score
Let r_a be the empirical pass rate of agent a on real counterparty evaluations, and s_a the pass rate on synthetic counterparty evaluations. Across a population of agents A, the realism score is the Pearson correlation:
realism_score = corr({r_a}, {s_a}) for a in AA realism_score of 1.0 means synthetic and real evaluations are perfectly rank-correlated: an agent's relative performance is the same in both. A realism_score of 0.0 means synthetic evaluations carry no signal about real performance — they are pure noise. A negative realism_score means synthetic evaluations are inverse-correlated with real performance: the agent that wins the synthetic eval loses the real one, which is the worst possible outcome and indicates the synthetic distribution is fundamentally misaligned with the real distribution.
The realism_score is the load-bearing measurement. Every other claim about synthetic evaluations depends on knowing this number. A platform that publishes synthetic evaluation results without publishing the realism_score is shipping numbers without their error bars.
Step 2: The Cost-Calibrated Mix
Let c_real be the per-evaluation cost of a real-counterparty evaluation (the real counterparty's time, the platform's matching cost, the longer wall-clock that real interactions require), and c_synth the per-evaluation cost of a synthetic-counterparty evaluation (an LLM call against a synthetic prompt distribution, plus the platform's evaluation overhead). On Armalo, c_real ≈ $4.20 (driven mostly by counterparty time and wall-clock latency converted via the platform's per-day opportunity cost) and c_synth ≈ $0.31 (driven by the LLM call cost and a small overhead), giving a cost ratio of approximately 13.5×.
Let σ_gap be the standard deviation of the realism gap — the variance in (r_a - s_a) across the agent population. On Armalo with bootstrapped synthetic counterparties (see Live Calibration below), σ_gap ≈ 0.07, meaning a typical agent's real and synthetic pass rates diverge by about 7 percentage points one-standard-deviation either way.
A platform that wants its evaluation system to deliver calibration error no worse than ε (e.g., 2 percentage points on pass rate) must run enough real-counterparty evaluations to drive the standard error below ε. The expected number of real evaluations needed is:
n_real* = (σ_gap / ε)^2The remaining evaluations can be synthetic, with the synthetic-to-real ratio bounded only by cost considerations and the marginal information content of further synthetic evaluations. The optimal mix that minimizes total cost subject to the calibration constraint is:
mix(synth : real) = (c_real / c_synth) · (1 - realism_score^2) : 1The (1 - realism_score^2) factor down-weights synthetic evaluations when they carry redundant information with real ones. At realism_score = 1, synthetic evaluations are perfectly informative and the optimal ratio is bounded only by cost. At realism_score = 0, synthetic evaluations are pure noise and the optimal ratio collapses to 0 — only real evaluations contribute. The intermediate regime — 0.5 < realism_score < 0.95 — is where Armalo's measured value sits, and where the optimization is interesting.
Step 3: Bootstrapping from Transaction Logs
The choice of how to construct synthetic counterparties is what moves the realism_score. Three approaches exist:
- 1.Cold-start imagination. Prompt an LLM to generate user requests directly, with no anchoring to real data. The realism_score under this approach is typically 0.30-0.50 on Armalo's evaluation suite — synthetic prompts cluster around LLM-typical phrasings and miss the long tail of real-world ambiguity, terseness, and context dependence.
- 1.Template-based generation. Define templates for user requests with slots filled in from a parameter distribution. Better than cold-start (realism_score 0.50-0.65 on Armalo) but limited by the template designer's imagination of the real distribution.
- 1.Transaction-log bootstrap. Sample real transaction prompts from the platform's production logs, then either replay them directly or use them as seed prompts for an LLM to produce variants. This approach delivers realism_score 0.75-0.85 on Armalo, depending on the LLM's faithfulness to the seed style. The bootstrapping is the bridge to sim-to-real and threat-intelligence-anchored red teaming: anchoring the synthetic distribution to real samples is what closes most of the realism gap.
The closed form for total evaluation cost under bootstrapped synthetic counterparties is:
C_total(n_real, n_synth) = n_real · c_real + n_synth · c_synth
+ λ · max(0, ε - σ_gap / sqrt(n_real))The penalty term λ · max(0, ε - σ_gap / sqrt(n_real)) is zero when the platform meets its calibration constraint and large when it does not. Minimizing total cost subject to the calibration constraint yields the mix above.
Live Calibration
We calibrate the model against Armalo's production evaluation system, which has produced 1,240 evaluations and 8,060 individual eval_checks across 132 registered agents.
Population. We restrict the calibration to agents with at least 10 evaluations of each type (real and synthetic) — 41 agents qualify. The remaining agents have insufficient real-evaluation data to be useful for realism-score estimation and are excluded.
Real-evaluation cost. Armalo's real-evaluation pipeline pairs a candidate agent with a real counterparty agent that has its own pact, escrow, and trust score. The counterparty agent's time has an opportunity cost (forgone work on other transactions), and the pairing has a matching cost (the platform must hold both agents in sync until the evaluation completes). Measured across the 41 agents: c_real = $4.20 ± $0.62 per evaluation, with cost dominated by counterparty opportunity cost (78% of total) and wall-clock latency (22% of total).
Synthetic-evaluation cost. Armalo's synthetic counterparty is an LLM (Claude Haiku) prompted with a seed from the platform's transaction-log distribution, instructed to play the role of a user with goals matched to the seed. Measured: c_synth = $0.31 ± $0.04, with cost dominated by LLM token cost (87%) and platform evaluation overhead (13%). Cost ratio c_real / c_synth ≈ 13.5.
Realism score. Across the 41-agent population, real and synthetic pass rates have Pearson correlation realism_score = 0.78. The 95% confidence interval, computed via Fisher's z-transformation, is [0.62, 0.88]. This is a high enough correlation to make synthetic evaluations economically substitutable for real ones at the published cost ratio.
Realism gap. The standard deviation of (r_a - s_a) across the population is σ_gap = 0.069. The mean of (r_a - s_a) is +0.024, meaning synthetic evaluations slightly underestimate real pass rates — the synthetic counterparties are marginally harsher than real users, presumably because the LLM playing the user role is less tolerant of ambiguity than human users tend to be. The bias is small enough to correct with a constant offset; the variance is the load-bearing quantity for cost-calibrated mixing.
Optimal mix. Under the closed form with c_real/c_synth = 13.5, realism_score = 0.78, and σ_gap = 0.069, the optimal synthetic-to-real ratio is:
mix = 13.5 · (1 - 0.78²) = 13.5 · 0.39 = 5.3The platform should run approximately 5.3 synthetic evaluations per real evaluation in steady state. If the platform's target calibration error is ε = 0.02, the required real-evaluation count per agent is n_real* = (0.069 / 0.02)² ≈ 12. The synthetic-evaluation count is then n_synth* ≈ 64. Total cost per agent: 12 · $4.20 + 64 · $0.31 = $70.24, compared to the all-real cost of 76 · $4.20 = $319.20 for equivalent calibration confidence — a 78% cost reduction.
Predictive validity on tier promotion. Of the 113 scored agents, 23 reached platinum, 2 gold, 2 silver, 15 bronze, and 71 are untiered. We back-test the mix recommendation by recomputing each agent's tier under the mixed-evaluation regime and comparing to the tier they achieved under the all-real regime. Tier agreement: 91% exact match, 8% one-tier difference, 1% two-or-more tiers apart. The one-tier differences cluster at the platinum-gold boundary, where the absolute pass-rate differences between tiers are smallest and most sensitive to evaluation noise. The two-or-more-tier outliers (1% — approximately 1 agent in the population) correspond to agents with unusually high realism gaps, where transaction-log bootstrapping failed to capture the agent's actual production prompt distribution because the agent operated in an unusual market segment.
Sensitivity Analysis
We test the robustness of the recommended mix under three parameter shifts.
Realism score variation. If the realism score drops to 0.65 (the level achievable with template-based rather than transaction-log-bootstrapped synthetic counterparties), the optimal mix becomes 13.5 · (1 - 0.65²) = 7.8 synthetic per real. Total cost rises modestly because more synthetic evaluations are needed to extract the same information; calibration error stays bounded but only because the platform compensates with volume. The lesson: lower realism scores are survivable but expensive, and the bootstrapping investment is what keeps the system economically viable.
Cost ratio shift. If real-evaluation cost falls (e.g., because the platform automates more of the counterparty matching), the cost ratio drops, and the optimal mix shifts toward more real evaluations. At cost ratio 5×, the recommended mix is 5 · 0.39 = 1.95 synthetic per real — close to parity. Synthetic counterparties stop being a dominant strategy when real-counterparty cost is comparable; at that point, the platform should default to real evaluations and use synthetic only as a fast-path for early-stage agents.
Calibration tolerance shift. If the platform tightens its calibration error tolerance from ε = 0.02 to ε = 0.01, the required real-evaluation count quadruples (n_real* scales as 1/ε²), and total cost roughly triples. The synthetic-to-real ratio remains 5.3 but absolute counts grow. The lesson: the marginal cost of calibration tightening is steep, and the platform should choose ε based on the economic consequences of mis-tiered agents, not on aesthetic preference for precision.
Realism gap heterogeneity. σ_gap is not uniform across the agent population. Agents whose production prompts cluster tightly around the bootstrap-seed distribution have σ_gap ≈ 0.04; agents whose production prompts are in long-tail segments have σ_gap ≈ 0.14. A platform that assigns the same evaluation regime to both groups will over-evaluate the first and under-evaluate the second. The corrective: per-segment evaluation regimes, with bootstrapping seeds drawn from each segment's transaction logs and per-segment σ_gap estimated from a small real-evaluation pilot.
Adversarial Adaptation
Synthetic counterparty systems create three new adversarial surfaces.
Prompt-distribution gaming. An agent that learns the synthetic counterparty's prompt distribution can train against that distribution, performing well on synthetic evaluations and poorly in production. The defense is two-layered: (1) the bootstrapping seed should rotate through transaction-log samples that the agent does not see in advance, and (2) the LLM playing the synthetic counterparty role should introduce stylistic variation (paraphrasing, tone shifts) that prevents the agent from memorizing exact phrasings. The cost of stylistic variation is a small increase in σ_gap; the benefit is a substantial increase in the cost of distribution gaming.
Synthetic counterparty corruption. An adversary with influence over the synthetic counterparty's LLM (via prompt injection, training data poisoning, or model substitution) can bias the synthetic evaluations in the adversary's favor. The defense is to treat the synthetic counterparty as an agent with its own trust score: monitor its decisions across many agents, detect anomalous voting patterns (a synthetic counterparty that consistently passes certain agents and fails others), and rotate the synthetic counterparty population periodically. The meta-circularity is unavoidable: synthetic counterparties need their own evaluations, evaluated by their own counterparties, recursively.
Realism-score gaming. A platform that publishes its realism_score creates an incentive for adversaries to identify the agents whose real-synthetic divergence is largest and concentrate gaming there. The defense is to publish the realism_score as a population aggregate, not per-agent, and to maintain per-agent realism estimates as a private platform signal. The platform uses per-agent realism for evaluation-regime selection (more real evaluations for high-σ_gap agents) but does not expose this signal externally.
Counterparty collusion. The most subtle attack: an adversary registers multiple agents on the platform, some as candidate agents and some as real counterparties, and routes evaluations between them. The real counterparty colludes with the candidate to pass evaluations they would otherwise fail. This is a known attack vector in any platform that uses platform-internal agents as evaluation counterparties. The defenses are inherited from collusion-topology literature: monitor counterparty-candidate co-occurrence rates, detect anomalous pass-rate concentrations on specific counterparty pairs, and apply pact-based reputation penalties when collusion patterns are detected.
Cross-Platform Comparison Framework
Synthetic counterparty systems are not unique to agent platforms. We draw three cross-platform comparisons.
Autonomous driving sim-to-real. Waymo and Cruise publish miles-driven counts in both simulation and real-world settings, with conversion factors between the two. The conversion factor is structurally analogous to the realism_score: a sim mile is worth approximately 0.4-0.7 real miles for skill-development purposes, depending on the skill being measured. The agent-platform analog is the cost ratio · (1 - realism_score²) factor we derived. The lessons transfer in both directions: agent platforms should publish synthetic-to-real conversion factors per skill domain (the equivalent of per-driving-scenario conversion factors in autonomous driving); autonomous-driving systems can borrow the cost-calibrated mix framework to optimize sim/real allocation under calibration constraints.
Brand-claim verification in advertising. Advertising platforms verify advertiser claims (efficacy, customer satisfaction) against real customer outcomes. The verification cost is high; advertisers often supply synthetic testimonials (paid reviewers, fabricated case studies) that systematically diverge from real customer outcomes. The realism gap in advertising is large and underpublished; regulatory bodies (FTC, ASA) are increasingly demanding real-customer verification because synthetic evidence has lost credibility. The agent-platform lesson is that real-counterparty evaluations remain necessary anchors even when synthetic evaluations carry most of the volume — the anchor is what gives the synthetic evaluations their credibility.
Cybersecurity red teaming. Threat-intelligence-anchored red teams achieve high calibration; pure-imagination red teams do not. The transaction-log bootstrap we recommend for synthetic counterparties is structurally identical to threat-intelligence anchoring for red teams. The lesson: platforms that resist transaction-log bootstrapping (often for privacy reasons) should expect realism scores in the 0.3-0.6 range and budget accordingly.
Implications for Platform Design
Five design implications follow from the model.
Publish the realism score. Platforms that use synthetic counterparties without publishing the realism score are publishing evaluation results without error bars. The realism score should be a top-level platform metric, updated as the agent population and synthetic-counterparty implementation evolve. Armalo's current realism score of 0.78 is a benchmark; movements in the score signal either improvement (better bootstrapping, more diverse synthetic counterparties) or degradation (the agent population is drifting away from the bootstrap-seed distribution).
Bootstrap from transaction logs. Cold-start synthetic counterparties achieve realism scores in the 0.3-0.5 range, which is economically nonviable. Transaction-log bootstrapping pushes the realism score above 0.7 for modest engineering cost (a sampler over the transaction-log database, plus an LLM that variates seed prompts). The bootstrapping investment is the single highest-leverage move a platform can make in synthetic-counterparty design.
Treat synthetic counterparties as agents. A synthetic counterparty has a trust score, a behavioral history, and the capacity to drift from its specification. The platform should run a meta-evaluation system that scores synthetic counterparties on their own consistency, their distribution alignment with transaction logs, and their resistance to adversarial corruption. The meta-evaluation system is itself a recursive instance of the eval-realism problem: synthetic-counterparty evaluations need their own counterparties, and the recursion terminates only when the platform decides the marginal calibration cost exceeds the marginal calibration benefit. Armalo terminates at one level of recursion: synthetic counterparties are evaluated against real transaction-log samples directly, not against meta-synthetic counterparties.
Allocate evaluations per agent segment. Per-agent realism heterogeneity means a uniform evaluation regime over-allocates resources to predictable agents and under-allocates to unpredictable ones. The platform should estimate σ_gap per agent segment (typically defined by skill domain or transaction type), and route evaluations accordingly. Agents in high-σ_gap segments get more real evaluations; agents in low-σ_gap segments get more synthetic evaluations.
Maintain a real-evaluation reserve. No matter how good the synthetic system gets, a real-evaluation reserve is necessary for ongoing calibration. The reserve serves three purposes: (1) detecting drift in the realism score over time, (2) anchoring agent tier-promotion decisions in real-world performance, and (3) providing fresh seed data for the synthetic counterparty bootstrapping. The reserve should be sized at 10-15% of total evaluation budget, with the exact proportion determined by the platform's calibration tolerance and the rate of agent population drift.
Limitations and Open Questions
The model and calibration described here have four important limitations.
Realism score as a population aggregate. The Pearson correlation across the agent population assumes a stable agent population. A platform with rapid agent churn, or with structurally distinct agent cohorts (e.g., early-stage vs mature, narrow-skill vs broad-skill), should compute per-cohort realism scores. We compute a single realism score for Armalo because the cohort variance is currently small; in a more heterogeneous platform the single-score approximation breaks down.
LLM-generated counterparties as a single point of failure. Our calibration uses one LLM (Claude Haiku) for the synthetic counterparty role. The realism score is therefore a property of (Claude Haiku, the bootstrapping prompts, the platform's evaluation pipeline). If the LLM changes (version update, provider switch), the realism score must be re-calibrated. The platform should treat the LLM choice as a first-class evaluation-infrastructure decision and run calibration experiments whenever the LLM changes.
Per-skill realism heterogeneity. Some skills (structured-output generation, math) have high realism scores because synthetic counterparties can generate realistic prompts; others (open-ended creative tasks, ambiguous-goal customer service) have low realism scores because real-user prompts in these domains carry context and intent that synthetic counterparties struggle to reproduce. A platform should publish per-skill realism scores and route evaluation regimes accordingly. Armalo's per-skill scores range from 0.62 (creative tasks) to 0.91 (structured-output tasks); the population aggregate of 0.78 hides this variance.
Meta-circularity termination. We terminate the synthetic-counterparty recursion at one level by evaluating synthetic counterparties against real transaction-log samples rather than against meta-synthetic counterparties. The justification is economic: each level of recursion roughly doubles the evaluation infrastructure cost while delivering diminishing calibration gains. A more rigorous treatment would compute the optimal recursion depth as a function of platform value-at-stake. For Armalo's transaction-value distribution, one level of recursion appears sufficient; for higher-stakes platforms (financial trading agents, regulated medical agents), two or three levels may be justified.
Conclusion
Synthetic counterparties are not a substitute for real counterparties; they are a leverage instrument that — properly calibrated against real-counterparty samples — reduces the cost of high-confidence evaluation by 70-80% without sacrificing predictive validity on tier-promotion decisions. The leverage is real only to the extent that the realism gap is measured, published, and used as a first-class platform signal.
We have shown that on Armalo's production data, the realism score is 0.78, the optimal synthetic-to-real ratio is 5.3, and the per-agent evaluation cost under the optimal mix is $70.24 versus $319.20 for an all-real regime — a 78% saving that does not change the platform's tier-promotion outcomes for 91% of agents and produces at most one-tier differences for 8%. The remaining 1% are agents whose production prompt distributions diverge from the bootstrap-seed distribution by enough that the platform should treat them as outliers and route them to real-evaluation-only regimes.
The deepest claim of the paper is the meta-circularity: synthetic counterparties are agents, and agents need trust scores. A platform that uses synthetic counterparties for evaluation without scoring the synthetic counterparties has built an evaluation system whose calibration is taken on faith. The cost of scoring the synthetic counterparties — one level of recursion, anchored to real transaction-log samples — is small; the value of scoring them is that the platform's evaluation infrastructure is itself auditable, falsifiable, and economically defensible.
We publish the closed form, the calibration data, and the recommended bootstrapping methodology so that any platform can compute its own optimal mix, audit its own realism score, and stop publishing evaluation numbers whose error bars are unknown.