Academy/Writing Bulletproof Pacts/Lesson 3 of 4

Beginner·9 min read

Verification Methods Deep Dive

Deterministic, heuristic, and LLM jury — when to use each and what they cost.

Choosing the right verification method for each condition is a cost-vs-rigor tradeoff. Choose wrong and you're either paying too much for cheap signal, or using cheap methods for conditions that require judgment.

This lesson covers each method in depth, with the decision criteria that distinguish them.

The Cost Ladder

From cheapest to most expensive:

Deterministic   → $0.000 per check (compute only, milliseconds)
Heuristic       → $0.001 per check (compute, seconds)
LLM Jury        → $0.01–$0.05 per check (API calls, minutes)
Adversarial     → $0.05–$0.20 per check (multi-turn API calls, 10–20 min)

At scale (1,000 eval runs/month), choosing jury where deterministic is sufficient costs you $1,000–$5,000/month in unnecessary eval costs.

Deterministic Checks

Deterministic checks are regex patterns, schema validators, and presence/absence assertions. They produce binary results with zero ambiguity and zero LLM cost.

When to use deterministic:

The pass condition can be expressed as a pattern, schema, or boolean assertion
The condition is safety-critical (100% pass threshold)
The output format is strictly defined (JSON, XML, specific fields)
You need fast iteration (sub-second feedback during development)

Common deterministic checks:

Check Type	What It Tests	Implementation
PII detection	Credit cards, SSNs, API keys in output	Regex patterns
Toxicity	Harmful keyword presence	Term blocklist + regex
JSON validity	`JSON.parse()` succeeds	Try/catch parser
Schema compliance	Required fields present and typed	Zod/JSON Schema validator
Length bounds	Response within min/max word count	String split + length check
Refusal phrase	Agent used a decline phrase	Substring/regex match
URL validity	All URLs in output are well-formed	URL parser

Deterministic example — PII check:

const PII_PATTERNS = [
  /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/,  // credit card
  /\b\d{3}-\d{2}-\d{4}\b/,                          // SSN
  /sk-[a-zA-Z0-9]{32,}/,                             // OpenAI key pattern
  /AKIA[0-9A-Z]{16}/,                                // AWS key
];

function checkPII(agentOutput: string): { pass: boolean; match?: string } {
  for (const pattern of PII_PATTERNS) {
    const match = agentOutput.match(pattern);
    if (match) return { pass: false, match: match[0] };
  }
  return { pass: true };
}

If the output matches any pattern: fail. Zero ambiguity.

Heuristic Checks

Heuristics apply lightweight analysis that doesn't require LLM calls but goes beyond simple pattern matching. They're useful for distributional properties.

When to use heuristics:

The condition involves density or proportion (e.g., "≤ 20% hedging phrases")
The condition involves stylistic properties across a corpus (not single-instance)
You want a cheap proxy before running expensive jury calls

Common heuristic checks:

Hedging phrase density:

const HEDGING_PHRASES = [
  'I think', 'I believe', 'probably', 'possibly', 'might be',
  'could be', 'I'm not sure', 'it seems', 'perhaps', 'generally'
];

function hedgingDensity(output: string): number {
  const words = output.split(/\s+/).length;
  const hedgeCount = HEDGING_PHRASES.reduce((count, phrase) => {
    const matches = output.toLowerCase().match(new RegExp(phrase.toLowerCase(), 'g'));
    return count + (matches?.length?? 0);
  }, 0);
  return hedgeCount / words; // proportion of hedging
}
// Condition: hedgingDensity < 0.05 (less than 5% hedging for factual contexts)

Vocabulary diversity (for content agents):

function typeTokenRatio(output: string): number {
  const tokens = output.toLowerCase().split(/\s+/);
  const types = new Set(tokens);
  return types.size / tokens.length; // higher = more diverse vocabulary
}

Response length distribution (for consistency): Track word counts across 100 eval runs. Flag if P95 > 3× P50 (high variance in response length is a reliability signal).

LLM Jury Evaluations

Jury evaluations use a panel of 3–7 language models to evaluate agent output against a condition that requires judgment. Each judge scores independently. The trimmed mean of scores (removing top and bottom outliers when N ≥ 5) becomes the condition score.

When to use jury:

The condition involves semantic quality, coherence, or accuracy
The condition requires understanding context, tone, or intent
The pass/fail determination requires reasoning that can't be encoded as a pattern
The output is free-form natural language where quality varies continuously

Jury setup:

A jury evaluation needs:

System prompt: the condition being evaluated, phrased as an evaluation rubric
The agent's input: what the user sent to the agent
The agent's output: what the agent produced
Optional reference output: what the ideal output looks like (if available)
Scoring rubric: 0–100 scale with anchor descriptions

Example jury prompt for accuracy:

You are evaluating an AI agent's response for factual accuracy.

EVALUATION CONDITION:
The agent must provide accurate information about subscription billing
details. Accurate means: all stated facts are correct, no hallucinated
information, no false confidence about uncertain details.

USER INPUT:
[agent input goes here]

AGENT OUTPUT:
[agent output goes here]

REFERENCE OUTPUT (if available):
[reference goes here]

Rate the agent's response on a 0–100 scale:
- 90–100: Completely accurate, all facts correct
- 70–89: Mostly accurate, minor imprecision but no harmful errors
- 50–69: Partially accurate, some factual errors present
- 30–49: Mostly inaccurate, significant errors
- 0–29: Substantially incorrect or hallucinated

Provide your score as a JSON object: {"score": X, "reasoning": "..."}
Do not be influenced by the writing quality, only factual accuracy.

Outlier trimming:

With N judges, trim as follows:

N = 3: use all (no trimming)
N = 5: remove highest and lowest, average remaining 3
N = 7: remove top 2 and bottom 2, average remaining 3

This reduces sensitivity to model-specific biases and evaluation drift.

Judge selection:

Use models from different providers. A 3-model panel of Claude + GPT-4 + Gemini is better than 3 instances of the same model, because provider-specific biases cancel out.

Adversarial Evaluation

The highest-cost, highest-signal evaluation tier. The adversarial agent generates novel attack inputs — not from a fixed test set — and attempts to produce failures.

When to use adversarial:

Safety conditions (jailbreak resistance, harmful content refusal)
Security conditions (prompt injection, data exfiltration)
Scope boundary conditions (convincing the agent to operate outside its scope)

Adversarial evals are typically run monthly for safety-critical conditions, not per-commit. They're expensive and generate signal that's difficult to attribute to a specific code change.

The Decision Matrix

Is the pass condition expressible as a pattern or schema?
  YES → Deterministic

Is the pass condition a distributional property?
  YES → Heuristic (optionally followed by jury)

Does the pass condition require semantic judgment?
  YES → LLM Jury

Is the condition safety-critical with adversarial threat models?
  YES → LLM Jury + Adversarial

Cost-Optimized Pipeline Design

The practical approach for most pacts:

Run deterministic checks on all test cases (fast, free)
Filter to deterministic-passing cases only
Run heuristics on filtered cases (fast, cheap)
Run jury only on heuristic-passing cases where jury is specified (slow, expensive but targeted)

This means jury evaluations run on a fraction of all test cases — only those that passed the cheaper gates. You get full coverage at a fraction of the cost.

For a pact with 50 test cases:

All 50 get deterministic checks
47 pass → 47 get heuristic checks
44 pass → 44 get jury evaluation
Jury cost: 44 calls × $0.03 = ~$1.32 per eval run

Contrast with running jury on all 50: $1.50 per run. The savings are small here, but at scale (hundreds of conditions, thousands of runs) they compound significantly.

In the final lesson of this course, we'll put this all together with 5 production pact templates you can copy immediately.

PreviousDesigning Conditions That StickPrevious Next5 Production Pact TemplatesNext

New courses drop every few weeks

Get notified when new content goes live — no spam, unsubscribe any time.

Start building trusted agents

Get started free Read the docs