Eval Cost Engineering: How To Run Rigorous Evaluation Without Burning Your Budget
Five judges, one hundred cases, forty cents a judgment is two hundred dollars per evaluation. Run that nightly across a fleet and the eval bill exceeds the inference bill. Here is how to spend less without measuring less.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
A five-judge jury at $0.40 per judgment over a 100-case eval suite costs $200 per eval run. Run that nightly across a fleet of fifty agents and the monthly eval bill exceeds many teams' inference bill. Most teams respond by cutting eval frequency or judge count, which trades cost for blindness. The right response is cost engineering: tiered evaluation that screens cheap and deepens expensive, judge tier mixing that uses frontier models only where they pay off, batch parallelism that amortizes overhead, and explicit budget allocation that treats eval cost as an investment rather than waste. This essay lays out the patterns and ships an Eval Cost Model template you can fill in with your own numbers.
The morning the eval bill exceeded the inference bill
The morning that prompted this essay was the morning we noticed our evaluation cost line had crossed our inference cost line. Inference is the line every AI team obsesses over. It is the cost of actually serving customers β every prompt, every completion, every tool call. We watch it daily. We had spent six months optimizing it: cache hits, prompt compression, model tier selection. Inference cost per customer interaction was down 40 percent quarter over quarter.
Meanwhile, evaluation cost had been quietly climbing. We had added agents to the fleet. We had expanded eval suites. We had moved from three-judge to five-judge panels for better calibration. We had increased eval cadence from weekly to nightly because we wanted to catch regressions faster. Each change individually was justified. Stacked, they multiplied. The line item that had been a quarter of inference last summer was now larger than inference. The eval bill, the cost of measuring whether the agents were behaving correctly, was costing more than running the agents themselves.
The finance partner was polite about it. The CFO was not. The conversation went something like: are we measuring well, or are we measuring expensively? And: if we cut the eval budget in half, what specifically breaks?
Those are good questions. They forced a discipline we had been avoiding, which is treating evaluation cost as a first-class engineering problem rather than overhead that grows because the engineers pushing the eval system forward do not feel the bill.
This essay is the result. It is not a plea for cheaper evaluation; that path leads to worse measurements and shipped regressions. It is a layout of the cost levers, which ones move the needle, and what the trade-offs actually are. The goal is to spend the eval budget where it produces real signal and not waste it where it does not.
We will walk through the structure of eval cost (what dollars are buying), the four major cost-reduction techniques (tiered evaluation, judge tier mixing, batch parallelism, smart sampling), the failure modes of each (where the savings come at the cost of measurement quality), and a working Cost Model template that lets you reason about your own situation in concrete dollars rather than abstract trade-offs. By the end you will have a defensible budget per agent, per eval suite, and a clear-eyed view of where your eval dollars are actually going.
Where eval dollars actually go
Before we cut anything, we have to know what we are buying. Most teams have only a vague sense of how their eval bill breaks down. The first exercise is to itemize it.
A single evaluation event has roughly five cost components. Judge inference is the largest in most setups: the API cost of running the panel of judge LLMs against each case in the suite. Evidence preparation is the second: the cost of running the agent under test on the cases to produce the evidence the judges will read. Storage is third: the durable cost of keeping the evidence, judgments, and verdicts in your event log. Compute infrastructure is fourth: the cost of the worker pool that orchestrates the eval, runs deterministic checks, and aggregates results. Human review is fifth: the cost of the people who handle escalations and label validation sets.
For most teams, judge inference is between 60 and 85 percent of total eval cost. Evidence preparation is the next largest, between 10 and 30 percent. The other three components together usually come in under 10 percent. This is the most important fact: if you want to cut your eval bill, you cut judge inference. The other levers are real but small in comparison.
Within judge inference, the cost is the product of three factors: judges per case, cases per eval, and dollars per judgment. The first two are policy choices. The third is determined by which judge models you use, how long the evidence is, how long the judgment output is, and how the panel is configured.
Most teams I talk to have not tried to itemize. They have a monthly bill from each LLM provider and a vague attribution to which projects spent what. The first concrete action of cost engineering is to instrument the eval pipeline so every judgment carries metadata about which case, which judge model, how many input and output tokens, and what the resulting cost was. Without this instrumentation, you are flying blind. You can guess at the cost levers but you cannot measure their effect.
Once instrumented, the picture usually surprises. We found that two of our five judges were responsible for over half the cost because they were frontier-tier models on the longest evidence captures. We found that one of our eval suites was 40 percent of total spend despite covering a lower-stakes capability. We found that 15 percent of all judgments were on cases where the deterministic checks had already produced a clear pass or fail and the LLM judgment was redundant. None of those facts were visible before instrumentation. All of them suggested specific cost reductions.
This is the pre-work that has to happen before any of the techniques in this essay produce real savings. Skip it and you are guessing.
Tiered evaluation: cheap screen, expensive deep dive
The largest single technique in the cost engineer's toolkit is tiered evaluation. The principle is straightforward: not every case needs the full panel. Most cases are easy. Easy cases can be screened by a cheap model and passed without further evaluation if they screen clean. Only cases that fail the screen, or that fall into specific high-stakes categories, get the full expensive panel.
The tier-one screen is typically a single mid-range model that produces a fast, cheap pass/fail-with-reasoning judgment. The cost per case at this tier is something like $0.02 to $0.05, an order of magnitude under the full panel. The tier-one screen does not produce a final verdict; it produces a triage decision. Cases that screen as clear-pass are recorded as such. Cases that screen as clear-fail are recorded as such. Cases that screen as ambiguous escalate to tier two.
The tier-two evaluation is the full panel as you would have run it without tiering. Five judges, real cost per case, full verdict aggregation. But it now runs on only the fraction of cases that the tier-one screen flagged as ambiguous. If your tier-one screen is reasonable, that fraction is between 20 and 40 percent. The rest were resolved cheaply.
The expected cost reduction is dramatic. If 70 percent of cases screen clean at tier one and 30 percent escalate to tier two, your effective cost per case is roughly 0.7 Γ $0.04 + 0.3 Γ $2.00 = $0.628 instead of $2.00. A 70 percent reduction in evaluation cost without dropping any judges from the panel for the cases that needed them.
The failure mode is obvious: a sloppy tier-one screen that lets through cases the full panel would have flagged. The mitigation is calibration. The tier-one screener has to be tuned against ground truth. You run the tier-one screen and the full panel in parallel on a calibration set, measure how often the screen agrees with the panel, and tune the screen's pass/fail thresholds to minimize the false-clear rate (cases the screen passed that the panel would have failed). False-clears are the costly errors; false-escalations just send a clean case to expensive evaluation, which costs extra dollars but does not corrupt the verdict.
A workable target is a false-clear rate under 2 percent and a false-fail rate (clear cases that the screen flagged for escalation, costing you the full panel for nothing) under 25 percent. Tighter thresholds on the screen reduce the false-clears but raise the false-fails, which raises cost. Looser thresholds do the opposite. The right balance is empirical.
There is a subtlety here. Some cases should always go directly to tier two regardless of screen result. Cases involving safety-sensitive operations, financial transactions above a threshold, or capabilities the screen has not been calibrated for should bypass the screen entirely. The tier-one screen is for routine cases where its calibration is well-established. The bypass list is short but important; without it you risk running the cheap screen on cases where its calibration does not apply.
Another subtlety: if the tier-one screen escalates a case to tier two, the screen's verdict should not be discarded. It feeds into the tier-two aggregation as one more vote, with appropriate weight. The screen's judgment is signal, even if it is the cheapest signal. Throwing it away wastes information.
With tiered evaluation in place and well-calibrated, you have your largest cost lever. The 60-85 percent of your eval bill that is judge inference becomes the 20-40 percent of cases that actually run the full panel, scaled by the panel cost. Everything else is the cheap screen.
Judge tier mixing within the panel
The second technique operates inside the panel itself. Not every judge in a five-judge panel needs to be a frontier-tier model. The right composition is a mix of tiers that produces good calibration at lower aggregate cost.
The naive panel composition uses five frontier-tier judges. Each judgment costs the same. Aggregate cost per case is five times the per-judge cost. The panel is well-calibrated but expensive.
The tiered panel composition uses something like one frontier judge, two mid-tier judges, and two efficient-tier judges. The aggregate cost per case is much lower. The question is whether the calibration suffers.
In our measurement, a mixed panel produces verdicts that are very close to a frontier-only panel on routine cases and somewhat noisier on hard cases. The aggregation policy compensates: weight the frontier judge more heavily in the trimmed mean, give the efficient-tier judges less weight, and use the variance in the panel as a signal for whether to escalate to a frontier-only panel for re-judgment.
The key insight is that not all judges in a panel are doing the same job. A frontier judge is contributing depth β its verdict is more reliable on the hardest cases. Mid-tier and efficient judges are contributing diversity β their verdicts add coverage of different reasoning paths and serve as cross-checks. You do not need maximum depth from every panel slot; you need enough depth to handle the hardest cases and enough diversity to catch errors any single judge would miss.
A practical composition for a five-judge panel: one frontier judge (your most expensive, most reliable model), two mid-tier judges from different providers (diversity benefit), two efficient-tier judges (cost-amortizing the panel size). The frontier judge gets the highest weight in aggregation. The efficient judges contribute diversity but their verdicts are cross-checked against the frontier judge.
The panel cost in this composition is roughly 1.0 + 0.4 + 0.4 + 0.1 + 0.1 = 2.0 in normalized units, versus 5.0 for a frontier-only panel. A 60 percent cost reduction at the panel level.
The calibration cost is the harder thing to measure. We assess it by running both compositions in parallel on a calibration set, computing per-case verdict differences, and analyzing where the mixed panel disagrees with the frontier panel. The disagreements are concentrated on hard cases. For those cases, we add a fallback rule: if the panel variance is above a threshold (judges disagree more than expected), re-judge with a frontier-only panel. This catches the calibration risk at modest extra cost on the small fraction of cases where it matters.
The net result is panel-level cost down 60 percent and verdict quality on routine cases unchanged. On hard cases, quality is maintained by the fallback to frontier-only panels. The aggregate effect is meaningful savings without measurable verdict degradation.
Batch parallelism: amortize the orchestration overhead
The third technique is operational rather than algorithmic. It is the engineering of how you actually run the panel.
The default implementation of an eval pipeline runs cases sequentially: case one through the panel, then case two, then case three. This is operationally simple but wastes time. Each case incurs orchestration overhead β request setup, response handling, evidence loading. With per-case overhead of a few hundred milliseconds and per-judge inference latency of several seconds, the sequential pipeline is bottlenecked by the slowest judge on each case.
Batch parallelism re-arranges the work. Instead of running case one through five judges sequentially and then case two through five judges, you fire all five judges on case one in parallel, all five judges on case two in parallel, and dispatch many cases concurrently. The pipeline becomes a fan-out structure with each case dispatching a parallel set of judge requests.
The latency reduction can be 5x or more for typical case sizes. The cost reduction is more modest but real: orchestration overhead amortizes across cases. The bigger win is throughput, which lets you run larger eval suites in the same wall-clock time and frees engineers from optimizing pipeline latency.
The limiting factors are concurrency limits on the judge APIs (each provider has rate limits) and the cost of holding evidence in memory across many concurrent cases. The tuning is straightforward: dispatch as many cases concurrently as the rate limits and memory budget allow, with backoff when rate limits are hit. Most providers have generous batch concurrency that exceeds what a single eval pipeline naturally produces, so the bottleneck is usually memory or orchestration, not API limits.
A further refinement: if your judges support batch APIs (not all do), use them. Batch APIs amortize per-request overhead at the provider's end and often offer cost discounts. The cost reduction from batch APIs is typically 30-50 percent versus per-request pricing. The latency increases (batch jobs are not real-time), so use batch for offline eval runs and per-request for any eval that needs interactive response.
Batch parallelism is the cheapest technique on this list to implement. It is engineering work, not algorithm work. It also stacks with the other techniques: a tiered evaluation pipeline with a mixed panel can run its tier-two evaluations in parallel batches at no measurement cost. The savings compound.
Smart sampling: do not eval every interaction
The fourth technique is the most controversial and the easiest to misuse. The idea is to evaluate a sample of agent interactions rather than every interaction.
The motivation: in some applications, the agent runs hundreds of thousands of interactions per day. Evaluating every interaction is operationally impossible regardless of cost. You have to sample. The question is how.
The naive sampling pattern is uniform random β pick 1 in 100 interactions and run them through the eval. This works in the sense that it produces a sample, but it has two problems. First, it under-samples rare-but-important interaction types. If a particular tool call sequence happens 1 percent of the time but is the highest-stakes category, uniform sampling gives you very few examples of it. Second, it over-samples routine interactions that all look similar, wasting budget on cases where the eval verdict is highly predictable.
Stratified sampling fixes both problems. You partition interactions into strata by characteristics that matter (tool category, customer tier, output length, novelty score). You sample at different rates within each stratum: high rates for high-stakes or novel strata, lower rates for routine strata. The total number of cases evaluated stays within budget but the cases are distributed where you actually want signal.
A more sophisticated pattern is importance-weighted sampling. You assign each interaction an importance score based on a cheap predictive model β does this look like a case where the agent might fail, where the customer is high-value, where the output is unusual? Cases with high importance are sampled at near-100 percent rates; cases with low importance are sampled at 1-percent rates or lower. The total sample is heavily skewed toward cases that matter, with the routine cases providing baseline coverage.
The importance scorer is itself a model. It can be cheap β a small classifier or even a heuristic ruleset β but it has to be accurate enough that the high-importance bucket actually contains the cases that warrant evaluation. Calibration of the importance scorer is critical. A scorer that misses important cases makes the entire stratification useless.
The failure mode of smart sampling is that it produces an evaluation history that is biased by construction. The agent's score reflects performance on the sampled cases, not on the population. If the sampling is well-calibrated this is fine β the sample is representative of the population. If the sampling is biased the score is biased.
A practical guardrail: maintain a baseline uniform-random sample alongside the importance-weighted sample. The baseline tells you how the agent performs across all interactions, even routine ones. The weighted sample tells you how the agent performs on the cases you cared most about. Both go into the score, with appropriate aggregation.
For agents in production with high interaction volumes, smart sampling is unavoidable. The technique is in how you design the strata and the importance model, not whether you sample.
Storage and infrastructure: small but real
The other cost components β storage, compute infrastructure, human review β are smaller but worth optimizing once the big levers are pulled.
Storage cost grows with the eval event log. Every evaluation event stores evidence, per-judge raw judgments, verdict aggregates, and metadata. Over years of operation, this becomes a meaningful storage bill. The optimization is tiered storage: recent events in fast hot storage, older events in cold storage with longer retrieval latency. Compression on the cold tier (evidence is highly compressible JSON) is straightforward. Most teams see 70-80 percent storage cost reduction from a hot/cold tiering policy.
Compute infrastructure for the eval pipeline is usually a small line item. The pipeline is mostly waiting on LLM responses; the worker pool needs to be large enough to fan out the requests but does not need much CPU or memory. A right-sized worker pool that scales with eval load is usually under 5 percent of total eval cost. The optimization is autoscaling β do not run a peak-load worker pool 24/7 if eval load is bursty.
Human review is the most expensive per-case but the lowest in aggregate volume in a well-tuned pipeline. The cost reduction is in efficient escalation routing (only escalate cases that actually need human review) and in good tooling for reviewers (reviewing a case with all evidence pre-loaded and a structured rubric is much faster than reviewing from scratch). Good tooling can cut human review time per case by 50 percent or more.
The Eval Cost Model template
Here is the named artifact this essay produces. The Cost Model is a working template you can fill in with your own numbers to produce a defensible budget for your eval program. It has six rows.
Row one: judge inference cost. Equals (cases per eval) Γ (judges per case) Γ (average cost per judgment). Each factor breaks down into subfactors. Cases per eval depends on the suite size and any sampling rate. Judges per case depends on panel size, escalation rate, and tier-one screen pass-through rate. Cost per judgment depends on judge model mix, input token average, and output token average.
Row two: evidence preparation cost. Equals (cases per eval) Γ (average evidence generation cost). Evidence generation cost is the cost of running the agent under test to produce the artifact the judges will read. For agents with expensive inference, this is non-trivial. For agents with cheap inference, this is negligible.
Row three: storage cost. Equals (events per month) Γ (event size) Γ (storage cost per gigabyte). Apply hot/cold tiering: assume 30 days hot at fast-tier price and the rest at cold-tier price.
Row four: compute infrastructure cost. Equals (worker pool size) Γ (instance cost per hour) Γ (utilization fraction) Γ (hours per month).
Row five: human review cost. Equals (cases requiring review per month) Γ (review time per case) Γ (reviewer cost per hour).
Row six: aggregate cost per evaluation. Equals sum of rows one through five divided by evaluations per month.
Fill in numbers for your situation. The first time you do this, the result will be larger than you expected. That is normal. The exercise's value is not in the headline number; it is in seeing which rows are big and which are small, and in being able to predict the cost effect of any change you propose.
Use the model to answer questions: what does it cost to add a sixth judge to the panel? (Row one, recompute.) What does it cost to double the eval cadence? (Multiply rows one and two by two; rows three and four scale modestly; row five depends on escalation rate.) What savings do we get from a tier-one screen? (Reduce the judges-per-case factor in row one by the screen pass-through rate.) What does it cost to maintain a six-month replay-aware retention? (Multiply row three by six.)
The model is a back-of-envelope calculator. It is not a precise forecast. It is enough to make budget conversations concrete. Without it, those conversations are vibes-based and produce vibes-based decisions.
Where to invest savings: do not just bank them
The instinct after cost-cutting is to bank the savings as reduced expense. This is sometimes the right answer but often is not. The savings are an opportunity to invest in evaluation rigor that you previously could not afford.
Consider three reinvestment options.
First, raise eval cadence on critical paths. If you cut 60 percent of cost on a routine eval suite, you can afford to run a high-stakes eval suite three times a week instead of weekly. Higher cadence catches regressions faster, which means problems are caught and fixed before they harm customers, which has its own dollar value.
Second, expand the case count on existing suites. Adding cases improves coverage. If your suite is 100 cases and you can now afford 250 at the same total cost, the per-case verdict is unchanged but the agent's score is more reliable because it is computed on more evidence. Larger samples reduce score variance, which makes certifications more defensible.
Third, add new suites for capabilities you previously could not afford to evaluate. Most agents have a long tail of capabilities that go un-evaluated because the budget was concentrated on the headline use cases. Saved budget can fund evaluation of the long tail, which catches regressions in capabilities your headline suites miss.
The choice between banking savings and reinvesting is a strategic call that depends on the agent's stage of development. Early-stage agents need rigor expansion; the savings should be reinvested. Mature agents on stable capability surfaces can bank some of the savings. Mixed agents need a deliberate split.
The cost engineer's job is not just to cut costs. It is to ensure the budget is producing maximum signal per dollar. Cutting waste is the prerequisite. Investing the savings well is the actual job.
The counter-argument
The sharpest counter-argument is that aggressive cost engineering on evaluation produces measurement systems that look efficient but produce worse signal. Tiered evaluation with calibration drift produces silently miscalibrated verdicts. Mixed-tier panels with too few frontier judges produce verdicts that miss subtle quality issues. Smart sampling with biased importance scoring produces a score that systematically misrepresents the agent's true performance.
All three of those failure modes are real. We have seen each of them in early adopters of these techniques. The defense is not that they cannot happen; it is that they are detectable and correctable if you measure for them.
The detection mechanism is the calibration set. The same labeled validation set you built for calibration measurement (see the calibrated refusal essay) is the set you use to validate that cost engineering has not corrupted your verdicts. Every cost-engineering change should be validated against the calibration set before deploying. If the change moves the verdict distribution meaningfully on calibration cases, the change has corrupted the measurement and needs to be reverted or retuned.
This adds operational overhead. Every cost change carries a calibration check. The check is itself a cost. The overall cost reduction has to be net of the validation overhead.
In our experience, the net is positive. The calibration validation is a small fraction of the savings, and the discipline of validating every change tends to catch corruption before it reaches production. Teams that skip the validation step do produce worse signal β the counter-argument's failure modes are real for them. Teams that validate every change keep the signal intact while capturing most of the savings.
The counter-argument is right about what happens if you do this badly. It is wrong about what happens if you do it well. The discipline is the point.
What Armalo does
Armalo's evaluation infrastructure runs tiered evaluation by default. A tier-one screener using a single mid-tier judge produces a fast triage decision. Cases that screen as ambiguous escalate to tier two, which runs the full panel. The screener's calibration is validated weekly against a labeled calibration set; calibration drift triggers re-tuning before it affects production verdicts.
The full panel uses a mixed-tier composition: one frontier judge weighted highly, mid-tier judges contributing diversity, efficient judges providing additional coverage. Panel composition is per evaluation suite β high-stakes suites use heavier compositions, routine suites use lighter ones. Composition changes are validated against calibration sets before deployment.
Evaluation runs are batched. The pipeline dispatches cases concurrently within the rate limits of the judge providers. Batch APIs are used for offline overnight runs; per-request APIs are reserved for interactive evaluations triggered by buyer requests through the trust oracle.
The Eval Cost Model is run quarterly to size the budget for each agent's evaluation program. Cost projections feed into agent operator dashboards so operators can see what their evaluation is costing and what they would save by adjusting cadence or coverage.
Storage uses a hot/cold tiering policy with 30-day hot retention and longer cold retention with compression. Old evaluation events remain queryable but at higher latency.
Human review uses a structured tooling surface that pre-loads evidence and rubric. Reviewer time per case has been reduced approximately 50 percent versus the unstructured workflow.
Frequently asked questions
How big should the tier-one screen pass-through rate be? Depends on case mix. For a routine suite with mostly easy cases, 60-75 percent pass-through is typical. For a high-stakes suite where most cases warrant the full panel, 30-50 percent is more typical. Tune empirically.
Can a single judge serve as the entire panel for cost reasons? No. A single-judge "panel" loses the diversity benefit and the trim-policy protection against outlier verdicts. The minimum useful panel is three judges, ideally from different providers. Below that you are gambling on one model.
Are batch APIs always cheaper? Usually yes per-token, but they have higher latency. Use batch APIs for offline runs where latency does not matter; use per-request APIs for interactive evals.
How do I validate that smart sampling is unbiased? Maintain a uniform-random baseline sample alongside the importance-weighted sample. Compare verdicts across the two samples; if they diverge, the importance scoring is biased and needs retuning.
Should I cut a judge to save money? Only if you can demonstrate that the four-judge panel has comparable calibration to the five-judge panel on your validation set. Otherwise the savings come at the cost of measurement quality.
What is the break-even point for adding a sixth judge? The cost is the per-judge cost times case count. The benefit is the reduction in verdict variance. Compute the variance reduction empirically; if the variance is already tight enough for your decision needs, the sixth judge is overspending.
Does cost engineering apply to deterministic checks? Deterministic checks are essentially free per case (just CPU time). The cost engineering question is only for LLM-judged checks. Lean on deterministic checks where you can; their cost is negligible compared to LLM judgments.
How do I justify eval cost to a CFO who wants it cut? Frame it as detection cost versus failure cost. A regression caught in eval costs the eval budget. A regression caught in production costs customer trust, refunds, and remediation. The eval budget is insurance; the question is what level of coverage justifies the premium.
Bottom line
Evaluation is expensive because measurement is expensive. The right response is not to measure less; it is to spend the budget where it produces real signal and not waste it where it does not. Tiered evaluation cuts cost on routine cases. Mixed-tier panels capture most of the verdict quality at a fraction of the cost. Batch parallelism amortizes overhead. Smart sampling concentrates evaluation on the cases that matter. The Eval Cost Model lets you reason about the trade-offs in dollars rather than vibes. None of the techniques are exotic; what makes them work is the discipline of validating every change against a calibration set so that you do not silently corrupt your verdicts in the pursuit of savings. Spend less, measure better, and keep the bill defensible. The agents you evaluate, and the buyers reading their scores, are entitled to both.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦