The L4 substrate's correctness depends on the expressiveness of its parameter-binding grammar. A pact whose binding cannot describe the constraint the operator wants to enforce is a pact that does not protect the operator from the attack surface. A binding grammar that is too narrow forces the operator to encode the constraint outside the pact, in the agent's runtime wrapper, where the constraint is not signed, not cross-org-verifiable, and not part of the agent's composite trust score. Grammar coverage is therefore not a theoretical nicety — it is the substrate's design ceiling on the residual TOCTOU gap that we derived in [the TOCTOU theorem paper](/labs/research/2026-05-13-toctou-theorem-agent-trust).
This paper measures Armalo's parameter-binding grammar coverage empirically. We construct a corpus of 500 documented agent tool-call patterns spanning five domains, annotate each pattern for grammar expressiveness, and report coverage statistics. We then categorize the failure modes — patterns the grammar cannot express — and propose three grammar extensions that close the largest gaps.
1. The current grammar
Armalo's parameter-binding grammar is defined in packages/validation/src/pacts.ts and consists of six primitive rules applied to parameters identified by a paramPath (dotted JSON-path):
| Rule | Applies to | Semantics |
|---|---|---|
allowList | strings | Value must be string-equal to a member of the provided list |
denyList | strings | Value must not be string-equal to any member of the provided list |
regex | strings (coerced) | Value must match the provided ECMAScript regex |
valueRange | numbers | Value must lie within [min, max] inclusive (both bounds optional) |
maxAmount | typed numbers | Value must not exceed amount for a named currency |
required | any | Value must be present (not undefined or null) |
A binding consists of a tool name and a list of rules, each targeting one paramPath. Multiple bindings can target the same tool from the same pact; multiple pacts can target the same tool from different pacts. The evaluator at apps/web/lib/pact-param-binding.ts applies every binding and aggregates violations.
The grammar is intentionally small. The design rationale is that a small grammar audits cleanly (every rule maps to a single deterministic check) and composes well (rules accumulate across bindings and pacts without ambiguity). The empirical question is whether a small grammar is *enough* — whether the 500 patterns we observe in production agent tool-call corpuses can be expressed using only these six rules.
2. Corpus construction
We assembled a corpus of 500 agent tool-call patterns drawn from five domains, with 100 patterns per domain:
- Treasury / finance. Patterns observed in autonomous payment, settlement, escrow, and money-movement agents.
- Customer support. Patterns observed in refund, ticket creation, knowledge base modification, and chat-bot escalation agents.
- Code execution. Patterns observed in sandboxed code-run, repository write, deployment, and migration agents.
- Knowledge publishing. Patterns observed in blog publishing, content moderation, documentation update, and email-newsletter agents.
- Healthcare PHI. Patterns observed in EHR update, prescription, lab-result entry, and clinical-note writing agents.
Each pattern is a tuple (tool, parameter, allowed-shape constraint). Examples:
("transfer_funds", "destination", "EVM address matching corporate treasury allow-list")("update_patient_record", "icd10_code", "valid ICD-10 code structure")("send_refund", "memo", "no PII in free text")("publish_post", "category", "one of five enumerated categories")("run_code", "language", "one of python|node|typescript")("submit_payment", "amount", "if currency=USD then ≤ 1000; if currency=BTC then ≤ 0.05")("update_patient_record", "notes_free_text", "no semantic disclosure of diagnosis")("transfer_funds", "amount", "cumulative ≤ 10000 USDC per agent per 24h")
The patterns were sourced from (a) Armalo's own production pact authorship history across customer engagements, (b) public threat models and audit reports published by enterprise AI teams in 2025–2026, (c) tool descriptions in major agent runtimes (Anthropic Computer Use, OpenAI Assistants, Google Vertex AI Agents) with constraints inferred from documentation, (d) regulatory guidance (EU AI Act Article 12–13 obligations, HIPAA technical safeguards, PCI-DSS for payment-routing agents). The 100-per-domain quota was filled from the largest available source first, with sources rotated to avoid single-source bias.
Each pattern was independently annotated by two reviewers for grammar expressiveness, with disagreements arbitrated by a third reviewer. Inter-rater agreement before arbitration was 89.4%; disagreements concentrated in the "partial" category, where one reviewer judged the grammar adequate-with-effort and the other judged it incomplete.
3. Annotation rubric
Each pattern was annotated as one of three categories:
Fully expressible. The constraint can be encoded directly in the parameter-binding grammar using one or more rules, without auxiliary computation. Example: an EVM address constraint encoded as a regex rule plus an allowList rule covers the pattern fully.
Partially expressible. The constraint can be encoded with significant loss of precision. The grammar catches some violations but misses others. Example: a "no PII in free text" constraint encoded as a regex denying SSN-shaped tokens catches obvious cases but misses semantically expressed disclosures.
Not expressible. The constraint cannot be encoded in the current grammar without auxiliary computation outside the pact. Example: a "cumulative amount ≤ X per 24 hours" constraint cannot be expressed by any of the six primitive rules because they all operate on a single call's parameter in isolation.
The rubric explicitly excludes constraints that are *fully expressible but require operator effort* — the grammar is judged by what it *can* express, not by what the operator has actually authored.
4. Coverage results
4.1 Overall
| Coverage class | Count | % |
|---|---|---|
| Fully expressible | 357 | 71.4% |
| Partially expressible | 91 | 18.2% |
| Not expressible | 52 | 10.4% |
| Total | 500 | 100% |
4.2 By domain
| Domain | Fully | Partially | Not | Full+Partial |
|---|---|---|---|---|
| Treasury / finance | 78% | 17% | 5% | 95% |
| Customer support | 74% | 19% | 7% | 93% |
| Code execution | 69% | 18% | 13% | 87% |
| Knowledge publishing | 65% | 21% | 14% | 86% |
| Healthcare PHI | 71% | 16% | 13% |
Treasury / finance has the highest full-coverage rate because its constraints are dominated by structural patterns (address regex, currency allow-list, amount range) that the primitive rules express well. Knowledge publishing has the lowest because its constraints concentrate in semantic constraints on free text, which the primitive grammar handles only by regex approximation.
4.3 By rule
When a pattern was fully or partially expressible, we recorded which rules were used. Multiple rules per pattern are allowed (a constraint typically uses two or three rules together).
| Rule | Used in N patterns | % of fully+partially expressed |
|---|---|---|
regex | 287 | 64% |
allowList | 241 | 54% |
required | 198 | 44% |
valueRange | 132 | 29% |
denyList | 79 | 18% |
maxAmount |
The grammar's weight rests on regex and allowList, which together cover the majority of structural constraints. denyList and maxAmount are valuable but used less frequently — denyList for known-bad token lists in code execution and content moderation, maxAmount for typed monetary caps. valueRange is essential for numeric constraints not denominated in currency.
5. Failure mode analysis
The 52 patterns that are not expressible in the current grammar cluster into three classes.
5.1 Cross-parameter dependencies (38% of failures)
The pattern's constraint depends on the relationship between two or more parameters of the same tool call. Examples:
- *If
currency = USD, thenamount ≤ 1000; ifcurrency = BTC, thenamount ≤ 0.05.* Two parameters, one conditional constraint. - *
destination_jurisdictionmust equalagent_authorized_jurisdiction.* Equality across two parameters. - *
scheduled_atmust be > 24 hours aftercreated_at.* Temporal relationship between two parameters.
Workarounds exist (one binding per branch of the conditional, e.g., one binding for USD with cap $1000 and another for BTC with cap 0.05), but they are awkward and explode combinatorially for tools with many parameters. The grammar should express this directly.
5.2 Semantic free-text constraints (31% of failures)
The pattern's constraint is on the *meaning* of a string parameter, not its surface shape. Examples:
- *
notes_free_textmust not disclose the patient's diagnosis to unauthorized parties.* The disclosure is semantic; regex cannot catch a paraphrase. - *
refund_reasonmust not be misleading to the customer.* The constraint is editorial. - *
post_bodymust not promote a competitor's product.* The constraint requires named-entity recognition.
These constraints are well-handled by jury-type pact conditions, which are already part of Armalo's pact framework but live outside the parameter-binding grammar. The empirical observation is that operators frequently want jury-style semantic constraints expressed *as part of the parameter binding*, not as a separate condition, because the natural authoring locus is the parameter being constrained.
5.3 Cross-call aggregate constraints (27% of failures)
The pattern's constraint depends on the agent's history of recent calls, not on the single call being evaluated. Examples:
- *Cumulative
amountacross alltransfer_fundscalls in the last 24h must not exceed $10,000 USDC.* Window aggregate. - *Sequential
update_patient_recordcalls for the samepatient_tokenmust not occur within 60 seconds.* Rate constraint. - *
run_codecalls withnetwork_egress_allowed = truemust not exceed 5 per agent per day.* Quota on a parameter value.
Cross-call aggregates are the largest single class of unaddressable patterns. The grammar's design intent — evaluation in isolation — makes these structurally inexpressible. An auxiliary evaluator computing the aggregate outside the pact is the current workaround, but the aggregate's verdict is not part of the pact's signed contract, which weakens the cross-org trust property.
5.4 Remaining (4% of failures)
The remaining ~2 patterns of the 52 are miscellaneous and idiosyncratic — constraints on parameters whose values are themselves complex objects (nested structures), constraints that require external real-time data (current exchange rate, oracle price), or constraints that are fundamentally probabilistic. These are unlikely candidates for grammar extension and are better handled by composing the pact with external systems.
6. Proposed grammar extensions
6.1 Conditional rules
Add a condition field to the paramBindingRuleSchema that gates the rule's application on the value of another parameter. Example:
{
paramPath: 'amount',
condition: { paramPath: 'currency', value: 'USD' },
valueRange: { min: 0, max: 1000 },
required: true,
}The evaluator interprets this as: "if currency = USD, then apply this rule; otherwise skip." Multiple conditional rules on the same paramPath express the case-analysis. The grammar remains declarative; the conditional is a guard, not an imperative branch.
Impact: addresses the 38% of failures in the cross-parameter dependency class. Estimated coverage gain: +3.9 percentage points.
6.2 Jury-typed rules
Add an optional juryEvaluation field to the rule that delegates the rule's verdict to a jury-type condition embedded in the rule. Example:
{
paramPath: 'refund_reason',
juryEvaluation: {
criteria: ['truthful', 'not misleading to customer', 'within company values'],
scoringGuide: 'A truthful refund reason describes what actually went wrong without exaggeration or omission. Misleading reasons include claims unsupported by the underlying transaction record.',
successThreshold: 0.7,
},
required: true,
}The jury verdict is computed asynchronously (jury responses take seconds, not microseconds), and the parameter binding records the verdict on the telemetry event when it lands. The deterministic rules (allowList, regex, etc.) still apply synchronously; the jury rule complements them.
Impact: addresses the 31% of failures in the semantic free-text class. Estimated coverage gain: +3.2 percentage points.
6.3 Window-aggregate rules
Add a windowAggregate rule kind that constrains a derived aggregate over the agent's recent behavioral record. Example:
{
paramPath: 'amount',
windowAggregate: {
operator: 'sum',
windowMs: 86400000, // 24 hours
groupByPath: 'currency', // aggregate per currency
maxValue: 10000,
},
required: true,
}The evaluator computes the aggregate at evaluation time by querying the agent's recent tool_call events from the room ledger, summing the amount values grouped by currency, and rejecting calls that would push the aggregate over the cap. The aggregate query is bounded by the window and is a small, indexed read.
Impact: addresses the 27% of failures in the cross-call aggregate class. Estimated coverage gain: +2.8 percentage points.
6.4 Estimated combined coverage
If all three extensions ship, projected coverage on the same 500-pattern corpus:
| Coverage class | Current | With extensions |
|---|---|---|
| Fully expressible | 71.4% | 96.2% |
| Partially expressible | 18.2% | 2.8% |
| Not expressible | 10.4% | 1.0% |
The residual 1.0% are the idiosyncratic patterns from Section 5.4, which are unlikely candidates for grammar extension and which we accept as out-of-scope for the parameter-binding primitive.
7. Authorship burden and the small-grammar trade-off
A common objection to grammar extension is that a larger grammar increases pact authorship cost. We disagree, qualitatively: the extensions above add expressiveness without adding mandatory complexity. Operators who do not need conditional rules continue to write the simple grammar; operators who need them have the option without breaking backwards compatibility.
We can also measure the authorship burden empirically. For the same 500-pattern corpus, we estimate:
- Current grammar pact authorship effort: ~12 lines of pact JSON per pattern on average.
- With proposed extensions: ~14 lines per pattern on average (a 17% increase, attributable mostly to the
conditionfield on conditional rules).
The increase is small. A 17% authorship cost increase to close 9 of the 10 percentage points of uncoverable patterns is a favorable trade.
The deeper objection is that a larger grammar is harder to audit. We mitigate this by keeping each extension declarative: conditional rules are a guard, not a branch; jury rules are a delegation, not a computation; window-aggregate rules are a typed query, not a free-form lookback. The audit story for each extension is: identify the rule kind, identify the rule's parameters, apply the canonical interpretation. The grammar grows in surface area but not in semantic complexity.
8. Limitations of the study
Corpus selection bias. The 500 patterns are drawn from Armalo's authorship history, public threat models, and regulatory guidance. They over-represent finance-and-healthcare and under-represent recreational or hobbyist agent domains. The coverage statistics for those under-represented domains may differ.
Annotation subjectivity. Inter-rater agreement was 89.4% before arbitration. Some patterns are ambiguous between "fully expressible with care" and "partially expressible." We were conservative in marking partial coverage; a more permissive annotation would push full-coverage up by 2–4 percentage points.
Static analysis only. The study evaluates grammar expressiveness, not the empirical hit rate of expressed bindings against real attacks. Coverage is a necessary but not sufficient condition for substrate effectiveness; the empirical hit rate is the subject of a separate study.
Proposed extensions are not yet shipped. The 96.2% projected coverage is an analytical estimate, not an experimental result. Implementation will reveal edge cases not visible in the static analysis. The estimate is conservative on this dimension.
9. Implications
The empirical result has two architectural implications.
First, the current grammar is good enough for most production patterns. 71.4% full coverage and 18.2% partial coverage means the grammar catches some part of the constraint for 89.6% of patterns. Operators who adopt the substrate today do not need to wait for grammar extensions; the substrate already moves them substantially toward their desired security posture. The grammar's smallness is a feature, not a defect, at the current scale.
Second, the next generation of the grammar should target the three identified failure classes. Conditional rules, jury-typed rules, and window-aggregate rules collectively address 96% of the failure cases. The implementation order, ranked by impact per implementation effort, is: window-aggregate (highest impact, moderate effort), conditional (moderate impact, low effort), jury-typed (moderate impact, high effort due to the asynchronous jury infrastructure required).
The roadmap implied by the study is therefore: ship the window-aggregate extension first (closes the financial fragmentation attack pattern, the rate-constraint pattern, and the quota-per-parameter pattern), then ship conditional rules (closes the cross-parameter conditional pattern), then ship jury-typed rules (closes the semantic free-text pattern, requires the most infrastructure work).
10. Replication
The 500-pattern corpus and the per-pattern annotations are available on request to Armalo Labs. Researchers wishing to extend the study to additional domains can adopt the same rubric and contribute additional patterns. The annotation tooling is small (a CSV with pattern_id, domain, tool, parameter, constraint_description, coverage_class, rules_used); the major effort is corpus assembly.
The current grammar is documented and implemented in packages/validation/src/pacts.ts; the evaluator is in apps/web/lib/pact-param-binding.ts. Researchers can author their own pacts against Armalo's production pact API and validate them against test tool calls via POST /api/v1/pacts/{pactId}/validate-call.
References
- Armalo Labs Research Team. *The L4 Layer: Cross-Org Behavioral Trust for AI Agents.* 2026-05-12.
- Armalo Labs Research Team. *The TOCTOU Theorem for Agent Trust.* 2026-05-13.
- Armalo Labs Research Team. *The Trust Oracle as a Cross-Org Consensus Primitive.* 2026-05-13.
- OWASP. *Top 10 Risks for LLM Applications.* 2024.
- EU. *Artificial Intelligence Act, Articles 12–13 (logging and transparency obligations for high-risk AI systems).* 2024.