The reputation-system industry has trended steadily toward lower per-evaluation cost over the past five years. Human-judged evaluations at $5+ per pass have been displaced by LLM-judged evaluations at $0.05; LLM-judged evaluations are increasingly being supplemented by deterministic check pipelines that run for fractions of a cent. The trajectory has been celebrated as a productivity win — more evaluation coverage per dollar — and the per-eval cost reduction has been treated as the primary metric of evaluation infrastructure quality.
This paper argues that the trajectory is the wrong one, or more precisely, that the trajectory has been measured against the wrong objective. The right metric for evaluation infrastructure is not cost per eval but cost per Sybil-pass, where Sybil-pass is the attacker's cost of producing one evaluation that passes the platform's tier criteria. The Sybil-pass cost has a closed-form expression in terms of the per-eval cost and the empirical pass rate, and the relationship has the structural property that low per-eval cost combined with high pass rate produces a Sybil-pass cost that is orders of magnitude below the rentable-attack threshold.
We derive the closed form: min_attacker_cost = c_eval × (1/p_eval). We calibrate against Armalo's 8,060 individual eval_checks at 81.3% pass rate (6,556 passes / 1,504 failures), the 1,240 evaluations at 91.5% completion, and the platform's mixed-architecture eval pipeline. We show that the platform's current configuration produces a min_attacker_cost low enough that Sybil enumeration is profitable across most plausible attacker budgets. We then run the analysis across three eval architectures — human-judged at $5/eval, LLM-judged at $0.05/eval, and deterministic at $0.0001/eval — and reach the central finding: deterministic evals, typically considered the most rigorous because they apply objective criteria, are the most adversarially insecure component because their near-zero cost makes brute-force enumeration cheap.
The argument inverts the dominant product-management instinct. Evaluation cost is not the friction to be minimized; it is the security property to be calibrated. A platform that has driven per-eval cost to zero has not built a more efficient trust infrastructure; it has built a trust infrastructure that no longer constrains adversaries.
Why the Question Is Underdiscussed
The eval-economics question has been underdiscussed for three reasons that are independent and reinforcing.
First, evaluation infrastructure has been measured against operational metrics rather than security metrics. The dominant questions in eval design have been: how do we increase coverage, how do we reduce false positives, how do we lower latency, how do we expand to more capability domains. These are operational metrics; they treat the evaluation as a measurement device whose primary failure modes are noise and bias. The security metric — how expensive is it for an adversary to pass — has been treated as a downstream concern handled by other parts of the trust infrastructure (bonds, attestations, scores).
Second, the closed-form economic relationship between per-eval cost and adversarial Sybil cost has not been widely published in the agent-evaluation literature. The relationship is straightforward — it transfers directly from the cost-of-attempt accounting in CAPTCHA economics — but the explicit application to agent evaluations has been left implicit. Without the closed form, platforms have not had the analytical infrastructure to evaluate the trade-off between per-eval cost and Sybil resistance.
Third, the conclusion is institutionally uncomfortable. A platform that has invested heavily in deterministic check pipelines, or that has marketed its low evaluation costs as a competitive advantage, faces direct disconfirmation when the closed form is applied. The conclusion that low-cost evals are structurally insecure is bad news for platforms whose architecture decisions have been optimizing for cost.
We argue that the closed-form analysis must be made explicit because the alternative is platforms that publish reputation scores backed by evaluation infrastructure that fails the security test the scores claim to enforce. The agent economy is moving toward economic transactions backed by reputation; the load-bearing security property of those transactions is the cost asymmetry of the evaluation layer. Platforms that ship reputation scores without verifying the underlying eval economics are platforms that ship false reassurance.
Related Work
Motoyama et al. (2010), "Re: CAPTCHAs — Understanding CAPTCHA-Solving Services." The foundational paper on cost-of-attempt economics for adversarial bypass. The headline insight: the right metric is cost per successful bypass, not cost per attempt. Attackers pay for failed attempts and the cost is amortized over the pass rate. The exact mechanism that defines our eval-compute-tax formula.
Pinto and Castro (2016) on adversarial cost models for security systems. Generalizes the CAPTCHA framework to security systems more broadly. The adversary's cost per successful event is the per-attempt cost divided by the per-attempt success probability. We adopt this framework directly.
Stiglitz and Weiss (1981), "Credit Rationing in Markets with Imperfect Information." The cost-of-screening literature. Banks face a similar trade-off: cheap screening (credit-bureau-only checks) produces high false-positive rates and low Sybil resistance; expensive screening (human underwriting) produces low rates and high resistance. The banking industry resolved the trade-off historically by tiering — cheap screening for small loans, expensive screening for large loans. Reputation systems should consider the same architecture.
Anderson (2009) on identity-forgery economics. Estimates of the cost of forging various credentials. Driver's licenses at $50–$200; passports at $500–$5,000; bank accounts at $20–$200. The pattern: the credential's adversarial cost approximately matches the value of the transactions it enables. Reputation systems should match this calibration — eval cost should rise with the value of transactions the eval-passed tier enables.
Mukherjee et al. (2013) on review-system attack economics. Empirical cost estimates for online review forgery. The dominant cost is the per-review labor (writing reviews is cheap on click farms but not free) and the per-account setup cost. The relationship between per-review cost and total Sybil cost replicates our framework in the review-system setting.
The recent literature on adversarial robustness in machine learning. Adversarial examples (Szegedy, Goodfellow, et al.) demonstrate that ML classifiers can be fooled at low cost when the input space is high-dimensional and the classifier is not robust. The analog for evaluation systems is that high pass rates and low per-eval cost expand the adversary's effective input space at low cost.
Buterin (2020) on cryptoeconomic security. Proof-of-stake security depends on the cost of acquiring sufficient stake to attack the network. The cost has a closed-form expression: stake cost × network value at risk. We borrow the closed-form discipline; the adversarial security of evaluation infrastructure should be expressible in similar form.
The eval-compute-tax framework synthesizes these traditions specifically for agent reputation evaluation, with calibration against the structural properties of the typical evaluation pipeline.
The Model
Let c_eval be the per-attempt cost of one evaluation, including platform compute, evaluator labor (if any), and operational overhead. Let p_eval be the empirical pass rate — the fraction of evaluation attempts that result in a passing outcome.
The minimum attacker cost per passing evaluation is:
min_attacker_cost = c_eval × (1/p_eval)The intuition is direct. The attacker pays c_eval per attempt and gets a pass with probability p_eval. Under independent trials (a simplification we relax below), the expected number of attempts per pass is 1/p_eval, producing the expected cost per pass given above.
The structural property is that low c_eval combined with high p_eval produces low min_attacker_cost, which is the attacker's effective cost floor for accumulating evaluation passes. If a platform requires N passing evaluations to reach a target tier, the attacker's evaluation-cost component of Sybil-attack cost is:
SybilEvalCost(N) = N × c_eval × (1/p_eval)For the attacker to find the Sybil attack profitable, SybilEvalCost must remain below the expected fraudulent-revenue benefit. The platform's job is to keep min_attacker_cost — and therefore SybilEvalCost — high.
The Three Levers
The platform has three levers to manipulate min_attacker_cost:
Lever 1: Raise c_eval directly. Expensive evaluations cost the platform more per legitimate eval, but they raise attacker cost proportionally. The trade-off is operational expense versus adversarial security.
Lever 2: Lower p_eval directly. Harder evaluations produce lower pass rates. The attacker's per-pass cost rises as 1/p_eval. The trade-off is more failed attempts by honest agents versus adversarial security.
Lever 3: Increase N (required passing evaluations per tier). More required evaluations multiply the per-pass cost. The trade-off is honest-agent friction versus adversarial security.
The three levers can be combined. A platform that doubles c_eval, halves p_eval, and doubles N would raise SybilEvalCost by a factor of 8 — at the cost of more friction across all three dimensions for honest agents.
Departures from the Independence Assumption
The independence assumption (1/p_eval expected attempts per pass) is too generous to the attacker. In practice, evaluations are not independent.
Eval rotation. When the platform rotates among a pool of eval items, an attacker who has tested one eval and learned its passing strategy must re-learn for each new eval. The effective pass rate, conditional on per-attempt learning, is lower than the platform's empirical pass rate. This raises min_attacker_cost above the formula floor.
Correlated failures. Some eval failure modes are persistent — an agent that fails an eval is likely to fail it again on retry without intervention. The conditional pass rate given prior failure is lower than the empirical pass rate. This pushes min_attacker_cost above the floor as well.
Adversarial sample selection. Evals chosen specifically to detect adversarial agents have lower p_eval against adversarial samples than against the general population. If the platform's overall p_eval is 0.8 but its adversarial p_eval is 0.3, the attacker's effective cost is c_eval × (1/0.3), not c_eval × (1/0.8).
These departures from independence are the platform's friend. They raise min_attacker_cost above the formula floor. The floor is the conservative bound; the actual attacker cost is at least this large.
What "Pass" Means for Adversaries
The model assumes the attacker is trying to pass an evaluation that legitimate agents also pass. The semantics of "pass" matter. An eval that produces a continuous score (rather than binary pass/fail) requires reformulation: the attacker's goal is to reach the score threshold, and the effective pass rate is the conditional probability of reaching the threshold given an attempt. For Armalo's 8,060 eval_checks at 81.3% pass rate, the binary formulation is appropriate.
For evals that involve graded outcomes (an agent might "partially pass" with some checks succeeding and others failing), the framework should be applied to the highest-leverage check — the one most likely to be the binding constraint on overall pass. Empirically, the binding check varies across agent types.
Live Calibration via the Armalo Platform
We calibrate the eval-compute-tax framework against Armalo's run-time data.
Platform-wide pass rate. 8,060 eval_checks total, 6,556 passes, 1,504 failures. Pass rate p_eval = 6,556 / 8,060 = 0.813. This is the platform-wide pass rate; subset pass rates (e.g., adversarial subset, capability-specific subset) are not separately available without further query.
Evaluation completion rate. 1,240 evaluations total, 1,135 completed, 105 failed. Completion rate is 91.5%. This is a different metric from check-level pass rate — it captures whether the evaluation produced any terminal outcome, regardless of pass/fail. The 8.5% failure-to-complete rate suggests some evaluation infrastructure issues or eval-runtime problems that prevent completion; these are not directly relevant to the attacker-economics calculation but are operationally important.
Per-eval cost estimation. Armalo's evaluation pipeline includes multiple architectures:
- LLM-judged checks. Cost roughly $0.05–$0.50 per check depending on model and complexity, given current LLM API pricing.
- Deterministic checks. Cost roughly $0.0001–$0.001 per check (compute-only).
- Jury-judged checks. Multi-LLM jury panel costs approximately $0.10–$1.00 per check, with the multiplier from running multiple judge models in parallel.
For the calibration we use a blended estimate of c_eval ≈ $0.25 per check, weighted across the platform's check mix.
min_attacker_cost calibration. Applying the formula:
min_attacker_cost = $0.25 × (1/0.813) = $0.31 per passing checkThis is the floor cost per passing check. To accumulate enough passing checks to reach a tier requiring N passing checks, the attacker pays approximately $0.31 × N.
Tier-specific calibration. If we assume tier promotion requires 100 passing checks (a stylized assumption — the platform's actual tier-promotion rules involve more complex logic), the eval-cost component of platinum-tier Sybil construction is approximately 100 × $0.31 = $31. This is the eval-cost contribution to total Sybil cost (companion Sybil Tax paper calibrates total platinum cost at $4,609; the eval-cost contribution is a modest fraction of the total).
The deterministic-check concern. If we restrict the calibration to deterministic checks only — c_eval ≈ $0.0001 — min_attacker_cost for deterministic-check-pass is approximately $0.0001 × (1/0.813) = $0.00012 per pass. For an attacker who needs only deterministic-check passes (say, the platform's tier criteria put 80% weight on deterministic checks and 20% on LLM judges), the floor cost of accumulating deterministic passes is essentially zero. The deterministic-check layer provides essentially no adversarial-resistance contribution.
This is the surprising finding. The intuition that deterministic checks are "more rigorous" because they apply objective criteria is correct from a measurement perspective. From an adversarial-economics perspective, the opposite is true: deterministic checks are the cheapest for the platform to run and therefore the cheapest for the attacker to brute-force. Their security contribution is essentially zero.
The platform's eval security comes almost entirely from the LLM-judged and jury-judged layers, where c_eval is two to three orders of magnitude higher.
Cross-Platform Comparison
We compute min_attacker_cost across three illustrative eval architectures:
| Architecture | Typical c_eval | Typical p_eval | min_attacker_cost per pass |
|---|---|---|---|
| Human-judged evaluation | $5.00 | 0.70 | $7.14 |
| LLM-judged evaluation | $0.05 | 0.80 | $0.0625 |
| Deterministic check | $0.0001 | 0.85 | $0.000118 |
| Multi-LLM jury panel | $1.00 | 0.75 | $1.33 |
The ranking is stark. Human-judged evaluations produce a Sybil-resistance floor that is roughly 100× higher than LLM-judged and 60,000× higher than deterministic. The deterministic layer offers essentially no adversarial-cost contribution.
The industry's drift toward cheaper evaluation architectures — from human to LLM, from LLM to deterministic — has been a drift toward lower Sybil resistance per evaluation. The drift has been correctly perceived as more cost-efficient and incorrectly perceived as comparably secure.
We are not claiming that human-judged evaluations are the answer for all evaluation needs. They are slow, they have their own biases, and they do not scale to high-volume evaluation pipelines. The point is that the trade-off between cost and security has been one-sided in recent years, with little explicit attention to the security side. A more balanced eval architecture would maintain a high-cost layer (human or multi-LLM jury) as the binding security constraint, with cheaper layers serving operational measurement rather than primary adversarial gating.
Sensitivity Analysis
| Perturbation | New min_attacker_cost | Comment |
|---|---|---|
| Pass rate falls from 0.81 to 0.50 | $0.50 per LLM pass (vs $0.0625 baseline) | 8× security improvement at substantial honest-agent friction |
| c_eval rises 10× (LLM moves from $0.05 to $0.50) | $0.625 per pass | 10× security improvement at 10× operational cost |
| Required passes per tier doubles | 2× total Sybil cost | Honest-agent friction doubles in parallel |
| Eval rotation pool expands 5× | Effective pass rate drops; cost rises ~2× | Modest security improvement at modest honest friction |
| Adversarial-subset pass rate explicit at 0.3 | $0.167 per adversarial-pass (vs $0.0625 general) | The attacker's effective cost is the adversarial floor |
The pass rate is the most leverage-rich lever. Cutting pass rate from 81% to 50% raises min_attacker_cost by 1.62× on its own — comparable to a 60% increase in c_eval but achievable through eval-difficulty tuning rather than cost increase. Platforms that have not deliberately calibrated their pass rate to a Sybil-resistance target have likely set it too high.
The c_eval lever is more honest-agent-neutral. Raising the per-eval cost (e.g., through more sophisticated LLM judges) raises Sybil cost proportionally without raising honest-agent friction in the same way that lowering pass rate does. Honest agents pay the higher c_eval in the form of platform fees (if passed through) or platform losses (if absorbed), but they do not face additional failure events.
The most-leverage-with-least-friction combination is mixed: raise c_eval at the load-bearing layer (the layer that actually gates promotion) while keeping low-cost layers for operational measurement. The platform's overall eval economics is the weighted combination across layers; the security floor is determined by the highest-leverage layer.
Adversarial Adaptation
Attackers have multiple strategies for reducing their effective per-pass cost below the formula floor.
Pattern learning across attempts. Within a single eval, an attacker observes the eval's structure and learns features that correlate with passing. The attacker's per-attempt cost falls below c_eval as learning amortizes setup costs. Defense: eval randomization within categories.
Cross-eval transfer. Knowledge gained on one eval transfers to others within the same category. The attacker pays full c_eval only on the first eval in each category; subsequent evals are cheaper. Defense: cross-category diversity in tier requirements.
Pre-computed answer libraries. For deterministic evals with limited state space, the attacker pre-computes passing answers and serves them on demand at near-zero marginal cost. Defense: eval randomization, time-bounded answer validity, or refusal to use deterministic evals as binding security gates.
Failure-mode masking. Attackers attempting to mimic specific tier behavior may learn the failure modes that classify them as low-tier and route around them. The platform's eval suite must include checks specifically designed to detect mimicry, not only checks that measure capability.
Sybil portfolio sharing. An attacker running multiple Sybil agents amortizes per-attempt setup costs across the portfolio. The shared cost is small at deterministic-check layers (where compute is essentially free) but larger at human-judged layers (where evaluation is bespoke per agent). Defense: portfolio-detection through cross-agent correlation analysis.
None of these adaptations defeat the formula's floor, but they reduce it by factors that depend on platform design. A platform that has not explicitly defended against these adaptations is operating at a fraction of the formula's theoretical security.
Implications for Platform Design
The eval-compute-tax analysis implies several design principles.
Treat per-eval cost as a security parameter, not a friction. The instinct to minimize per-eval cost should be tempered by explicit analysis of how the cost contributes to min_attacker_cost. A reduction in per-eval cost that does not preserve min_attacker_cost above the platform's adversarial-cost target is a reduction in security.
Maintain a high-cost binding layer. The eval architecture should include at least one layer where c_eval is high enough to produce meaningful Sybil resistance — typically human-judged or multi-LLM jury. This layer should be the binding constraint on tier promotion, even if cheaper layers participate as supplementary signals.
Treat deterministic checks as measurement, not gating. Deterministic checks are valuable for measuring capability with high reproducibility, but their adversarial-cost contribution is essentially zero. They should not be the load-bearing tier-promotion criterion.
Calibrate pass rate to a Sybil-cost target. The pass rate should be set such that c_eval × (1/p_eval) × required_passes exceeds the platform's adversarial-cost target. If the platform wants min_attacker_cost above $5,000 for top-tier promotion and runs $0.50 evals, the pass rate must satisfy: 0.50 × (1/p_eval) × N > 5000, or p_eval × N > 10,000. For N = 100 required passes, p_eval < 0.01 — much stricter than typical platforms operate. This is the right kind of constraint to think about explicitly.
Implement eval randomization and rotation. Repeated attempts at the same eval reduce its effective security. Rotation across a large pool raises the attacker's expected attempts per pass without raising platform cost proportionally.
Track adversarial-subset pass rates separately. The platform's overall pass rate (81.3% on Armalo) is the friendly-distribution metric. The pass rate against deliberately adversarial inputs is the security metric. Platforms should run periodic adversarial-batch evaluations and publish the pass rate separately.
Publish min_attacker_cost as a transparency disclosure. The closed-form min_attacker_cost should be published per tier as a routine transparency artifact, alongside the Sybil Tax disclosure (see companion paper). Buyers can then evaluate whether the platform's eval economics support the transactions they want to mediate through it.
Limitations and Open Questions
The independence-of-attempts assumption underestimates attacker cost. Eval rotation, learning curves, and adversarial-subset effects all push the true attacker cost above the formula floor. The formula is a lower bound, not a precise estimate. Platforms should plan around the formula bound but expect actual attack cost to be higher.
The framework treats c_eval as a constant. In practice, c_eval varies across evaluation types within a platform. The aggregated platform-level c_eval is a weighted average; the security floor is determined by the load-bearing layer's c_eval, not by the average. Platforms should compute min_attacker_cost layer by layer, not in aggregate.
The model assumes the attacker is economically rational. Non-economic attackers (state actors, dedicated ideological opponents) may pursue attacks at costs above the formula bound. The formula is the deterrent for rational adversaries; non-rational adversaries are deterred by detection, not by cost.
The required-passes-per-tier parameter is platform-specific and depends on the platform's tier-promotion rules. Armalo's exact tier-promotion criteria involve composite scoring across multiple dimensions, not a simple count of passing checks. The N in our formula is stylized; precise calibration requires modeling the tier-promotion logic explicitly.
The 8,060 eval_check sample is moderately small. Pass-rate estimates have non-negligible standard error at this scale. As the platform accumulates more eval data, the pass-rate calibration will become tighter. The structural conclusion (cheap evals are adversarially insecure) is robust to pass-rate variation within plausible ranges.
The current treatment focuses on Sybil attacks (forging good-quality agents). The same framework applies in inverse to inversion attacks (manufacturing bad reputation against existing agents — see companion paper); a low-cost evaluation infrastructure makes it cheap for adversaries to manufacture failed-evaluation events against targets.
The deterministic-check critique is structural but not absolute. Some deterministic checks have high adversarial difficulty (cryptographic verifications, behavioral consistency across long time horizons) that compensates for their low per-attempt cost. The blanket statement "deterministic checks are insecure" is too strong; the precise statement is that deterministic checks with high pass rates and low per-attempt cost are insecure.
Conclusion
The closed-form relationship between per-evaluation cost and adversarial Sybil-attack cost is min_attacker_cost = c_eval × (1/p_eval). The relationship is straightforward but has not been widely applied to agent evaluation infrastructure design. The application yields several conclusions that run against the dominant product-management intuition.
Per-eval cost is a security property, not a friction. Lower per-eval cost lowers Sybil resistance proportionally. Platforms that have driven per-eval cost toward zero have not built more efficient trust infrastructure; they have built trust infrastructure that no longer constrains adversaries.
Deterministic evaluations are the most adversarially insecure component in the typical evaluation pipeline. Their near-zero per-attempt cost combined with high pass rates produces a Sybil-resistance floor essentially at zero. The intuition that deterministic checks are rigorous because they apply objective criteria is correct from a measurement standpoint but inverted from an adversarial-economics standpoint.
Armalo's current platform configuration — 8,060 eval_checks at 81.3% pass rate, 1,240 evaluations at 91.5% completion — produces a per-pass eval cost floor of approximately $0.31 under a blended c_eval estimate. The deterministic subset of this floor is far lower. The platform's overall Sybil resistance from the eval layer is moderate and depends substantially on the LLM-judged and jury-judged components. As the platform's evaluation mix shifts (e.g., toward deterministic checks for operational efficiency), the security floor must be tracked explicitly and the binding layer maintained.
The platform-design implications: maintain at least one high-cost evaluation layer that serves as the binding tier-promotion constraint; treat low-cost layers as measurement, not gating; calibrate pass rates to explicit Sybil-cost targets; implement eval rotation and randomization; track adversarial-subset pass rates separately from general pass rates; publish min_attacker_cost as a routine transparency disclosure.
The cross-platform comparison framework is straightforward. Reputation systems should publish, per evaluation layer: the per-attempt cost, the empirical pass rate against the general distribution, the empirical pass rate against adversarial samples, and the resulting min_attacker_cost. Platforms that cannot publish these have not measured them. Platforms that have measured them and produced low min_attacker_cost have shipped reputation infrastructure whose adversarial security depends on factors outside the eval layer — and should make those factors explicit.
The broader lesson is that the cost of trust infrastructure is the trust. Cheap trust is structurally weaker than expensive trust, holding all else equal. The trade-off between operational efficiency and adversarial security has been treated as if it were one-sided; the closed-form analysis shows that it is not. Platforms designing evaluation infrastructure for the agent economy should resist the cost-minimization instinct and calibrate explicitly to the adversarial security their downstream users actually need.
Reproducibility. The pass-rate calibration uses 8,060 individual eval_check records and 1,240 evaluation records from the live Armalo production database as of 2026-05-12. The 81.3% check-level pass rate is the ratio 6,556 / 8,060, directly inspectable from the eval_checks table. The per-architecture c_eval estimates are illustrative; precise calibration against a specific platform's eval mix requires querying the cost per check by check type and producing a weighted average.