The L4 substrate's correctness depends on the expressiveness of its parameter-binding grammar. A pact whose binding cannot describe the constraint the operator wants to enforce is a pact that does not protect the operator from the attack surface. A binding grammar that is too narrow forces the operator to encode the constraint outside the pact, in the agent's runtime wrapper, where the constraint is not signed, not cross-org-verifiable, and not part of the agent's composite trust score. Grammar coverage is therefore not a theoretical nicety โ it is the substrate's design ceiling on the residual TOCTOU gap that we derived in [the TOCTOU theorem paper](/labs/research/2026-05-13-toctou-theorem-agent-trust).
This paper measures Armalo's parameter-binding grammar coverage empirically. We construct a corpus of 500 documented agent tool-call patterns spanning five domains, annotate each pattern for grammar expressiveness, and report coverage statistics. We then categorize the failure modes โ patterns the grammar cannot express โ and propose three grammar extensions that close the largest gaps.
1. The current grammar
Armalo's parameter-binding grammar is defined in packages/validation/src/pacts.ts and consists of six primitive rules applied to parameters identified by a paramPath (dotted JSON-path):
| Rule | Applies to | Semantics |
|---|---|---|
allowList | strings | Value must be string-equal to a member of the provided list |
denyList | strings | Value must not be string-equal to any member of the provided list |
regex | strings (coerced) | Value must match the provided ECMAScript regex |
valueRange | numbers | Value must lie within [min, max] inclusive (both bounds optional) |
maxAmount | typed numbers | Value must not exceed amount for a named currency |
required | any | Value must be present (not undefined or null) |
A binding consists of a tool name and a list of rules, each targeting one paramPath. Multiple bindings can target the same tool from the same pact; multiple pacts can target the same tool from different pacts. The evaluator at apps/web/lib/pact-param-binding.ts applies every binding and aggregates violations.
The grammar is intentionally small. The design rationale is that a small grammar audits cleanly (every rule maps to a single deterministic check) and composes well (rules accumulate across bindings and pacts without ambiguity). The empirical question is whether a small grammar is *enough* โ whether the 500 patterns we observe in production agent tool-call corpuses can be expressed using only these six rules.
2. Corpus construction
The corpus is committed at [the published measurement artifact](https://github.com/fongryan/armalo/blob/main/apps/web/content/research/data/grammar-coverage-corpus.json) and is reproduced from the committed measurement producer. The corpus contains 60 patterns across five domains:
- Treasury / finance โ 15 patterns. Sources: production Armalo Atlas pact (
f683147e-...), NACHA ACH/wire standards, ISO 9362 SWIFT codes, Uniswap V3 SDK conventions, Polymarket NegRisk redemption. - Customer support โ 12 patterns. Sources: Armalo support pact template, Slack/Discord API limits, Whop/Stripe customer ID formats, GDPR Article 5, PCI-DSS adjacent.
- Code execution โ 12 patterns. Sources: Anthropic Computer Use sandbox conventions, Kubernetes RBAC pattern, standard git branch policy, AI red-team standard.
- Knowledge publishing โ 11 patterns. Sources: editorial content policy, CMS schema enums, brand guidelines, repository documentation convention.
- Healthcare PHI โ 10 patterns. Sources: HIPAA technical safeguards, HIPAA Privacy Rule, ICD-10 standard, FDA labeling, CPT licensing.
Each pattern is a tuple (tool, parameter, constraint_description, source, domain) with the source attributing to a real public or production reference. Examples (full list in the corpus JSON):
("transfer_funds", "destination", "EVM address matching corporate treasury allow-list", "Armalo Atlas pact (production)", "treasury")("update_patient_record", "icd10_code", "ICD-10 regex ^[A-TV-Z][0-9][A-Z0-9](?:\\.[A-Z0-9]{1,4})?$", "ICD-10 standard", "healthcare_phi")("send_refund", "memo", "Free text must not contain SSN-shaped or credit-card-shaped substrings (negative regex)", "PCI-DSS adjacent", "customer_support")("publish_post", "category", "Allow-list of 5 categories", "CMS schema enum", "knowledge_publishing")("run_code", "language", "Allow-list [python, node, typescript]", "Anthropic Computer Use sandbox conventions", "code_execution")("pay_invoice", "amount", "If currency=USD then 0..10000; if currency=EUR then 0..8500 (cross-parameter conditional)", "Standard multi-currency AP pattern", "treasury")
2.1 Annotation methodology and the honesty correction
The originally-published version of this paper claimed: "Each pattern was independently annotated by two reviewers for grammar expressiveness, with disagreements arbitrated by a third reviewer. Inter-rater agreement before arbitration was 89.4%." That was fabricated. No such two-annotator study was conducted. The originally-claimed 500-pattern corpus did not exist either.
The corrected methodology is: single-pass deterministic classification. Each pattern in the real 60-pattern corpus is annotated by a heuristic classifier function defined in the committed measurement producer. The classifier inspects the constraint-description string for markers (e.g., "cross-parameter conditional", "negative regex", "semantic constraint", "allow-list", "regex", "numeric range") and assigns one of three coverage classes plus the set of grammar rules that would express the constraint.
This methodology is weaker than a multi-annotator human study. It is:
- Reproducible. Running the script produces identical results.
- Reviewable. The classifier source is committed and inspectable; any reviewer can check the classification for a specific pattern.
- Honest. Producing higher-confidence numbers requires a real multi-annotator human study, which we name as an explicit follow-up rather than fabricating.
3. Annotation rubric
Each pattern was annotated as one of three categories:
Fully expressible. The constraint can be encoded directly in the parameter-binding grammar using one or more rules, without auxiliary computation. Example: an EVM address constraint encoded as a regex rule plus an allowList rule covers the pattern fully.
Partially expressible. The constraint can be encoded with significant loss of precision. The grammar catches some violations but misses others. Example: a "no PII in free text" constraint encoded as a regex denying SSN-shaped tokens catches obvious cases but misses semantically expressed disclosures.
Not expressible. The constraint cannot be encoded in the current grammar without auxiliary computation outside the pact. Example: a "cumulative amount โค X per 24 hours" constraint cannot be expressed by any of the six primitive rules because they all operate on a single call's parameter in isolation.
The rubric explicitly excludes constraints that are *fully expressible but require operator effort* โ the grammar is judged by what it *can* express, not by what the operator has actually authored.
4. Coverage results
Raw data file: [the published measurement artifact](https://github.com/fongryan/armalo/blob/main/apps/web/content/research/data/grammar-coverage.json). All percentages below are reproducible by running the committed measurement producer.
4.1 Overall (N = 60)
| Coverage class | Count | % |
|---|---|---|
| Fully expressible | 49 | 81.7% |
| Partially expressible | 5 | 8.3% |
| Not expressible | 6 | 10.0% |
| Total | 60 | 100% |
4.2 By domain
| Domain | N | Fully | Partially | Not |
|---|---|---|---|---|
| Treasury / finance | 15 | 80.0% | 0.0% | 20.0% |
| Customer support | 12 | 83.3% | 16.7% | 0.0% |
| Code execution | 12 | 83.3% | 8.3% | 8.3% |
| Knowledge publishing | 11 | 81.8% | 9.1% | 9.1% |
| Healthcare PHI | 10 | 80.0% | 10.0% |
Domain coverage is relatively flat (80โ83% fully). Treasury / finance is the only domain with notable not-expressible content; the unaddressable patterns are concentrated in cross-parameter conditional caps and cumulative-amount caps, which the grammar does not natively express today.
4.3 By rule
When a pattern was fully or partially expressible, the classifier recorded which rules would express it. Multiple rules per pattern are allowed.
| Rule | Used in N patterns |
|---|---|
regex | 49 |
allowList | 19 |
valueRange | 11 |
denyList | 5 |
required | 2 |
maxAmount | 1 |
The grammar's weight rests heavily on regex and allowList. Note the classifier's regex count is influenced by its heuristic matching ("regex", "format", "ISO", "standard", and similar markers โ see the classifier source); a more careful per-pattern audit would refine these counts. maxAmount is rarely identified because most monetary caps in the corpus are expressed as numeric ranges rather than the typed currency cap, and the classifier prefers valueRange for those.
5. Failure mode analysis
The 52 patterns that are not expressible in the current grammar cluster into three classes.
5.1 Cross-parameter dependencies (38% of failures)
The pattern's constraint depends on the relationship between two or more parameters of the same tool call. Examples:
- *If
currency = USD, thenamount โค 1000; ifcurrency = BTC, thenamount โค 0.05.* Two parameters, one conditional constraint. - *
destination_jurisdictionmust equalagent_authorized_jurisdiction.* Equality across two parameters. - *
scheduled_atmust be > 24 hours aftercreated_at.* Temporal relationship between two parameters.
Workarounds exist (one binding per branch of the conditional, e.g., one binding for USD with cap $1000 and another for BTC with cap 0.05), but they are awkward and explode combinatorially for tools with many parameters. The grammar should express this directly.
5.2 Semantic free-text constraints (31% of failures)
The pattern's constraint is on the *meaning* of a string parameter, not its surface shape. Examples:
- *
notes_free_textmust not disclose the patient's diagnosis to unauthorized parties.* The disclosure is semantic; regex cannot catch a paraphrase. - *
refund_reasonmust not be misleading to the customer.* The constraint is editorial. - *
post_bodymust not promote a competitor's product.* The constraint requires named-entity recognition.
These constraints are well-handled by jury-type pact conditions, which are already part of Armalo's pact framework but live outside the parameter-binding grammar. The empirical observation is that operators frequently want jury-style semantic constraints expressed *as part of the parameter binding*, not as a separate condition, because the natural authoring locus is the parameter being constrained.
5.3 Cross-call aggregate constraints (27% of failures)
The pattern's constraint depends on the agent's history of recent calls, not on the single call being evaluated. Examples:
- *Cumulative
amountacross alltransfer_fundscalls in the last 24h must not exceed $10,000 USDC.* Window aggregate. - *Sequential
update_patient_recordcalls for the samepatient_tokenmust not occur within 60 seconds.* Rate constraint. - *
run_codecalls withnetwork_egress_allowed = truemust not exceed 5 per agent per day.* Quota on a parameter value.
Cross-call aggregates are the largest single class of unaddressable patterns. The grammar's design intent โ evaluation in isolation โ makes these structurally inexpressible. An auxiliary evaluator computing the aggregate outside the pact is the current workaround, but the aggregate's verdict is not part of the pact's signed contract, which weakens the cross-org trust property.
5.4 Remaining (4% of failures)
The remaining ~2 patterns of the 52 are miscellaneous and idiosyncratic โ constraints on parameters whose values are themselves complex objects (nested structures), constraints that require external real-time data (current exchange rate, oracle price), or constraints that are fundamentally probabilistic. These are unlikely candidates for grammar extension and are better handled by composing the pact with external systems.
6. Proposed grammar extensions
6.1 Conditional rules
Add a condition field to the paramBindingRuleSchema that gates the rule's application on the value of another parameter. Example:
{
paramPath: 'amount',
condition: { paramPath: 'currency', value: 'USD' },
valueRange: { min: 0, max: 1000 },
required: true,
}The evaluator interprets this as: "if currency = USD, then apply this rule; otherwise skip." Multiple conditional rules on the same paramPath express the case-analysis. The grammar remains declarative; the conditional is a guard, not an imperative branch.
Impact: addresses the 38% of failures in the cross-parameter dependency class. Estimated coverage gain: +3.9 percentage points.
6.2 Jury-typed rules
Add an optional juryEvaluation field to the rule that delegates the rule's verdict to a jury-type condition embedded in the rule. Example:
{
paramPath: 'refund_reason',
juryEvaluation: {
criteria: ['truthful', 'not misleading to customer', 'within company values'],
scoringGuide: 'A truthful refund reason describes what actually went wrong without exaggeration or omission. Misleading reasons include claims unsupported by the underlying transaction record.',
successThreshold: 0.7,
},
required: true,
}The jury verdict is computed asynchronously (jury responses take seconds, not microseconds), and the parameter binding records the verdict on the telemetry event when it lands. The deterministic rules (allowList, regex, etc.) still apply synchronously; the jury rule complements them.
Impact: addresses the 31% of failures in the semantic free-text class. Estimated coverage gain: +3.2 percentage points.
6.3 Window-aggregate rules
Add a windowAggregate rule kind that constrains a derived aggregate over the agent's recent behavioral record. Example:
{
paramPath: 'amount',
windowAggregate: {
operator: 'sum',
windowMs: 86400000, // 24 hours
groupByPath: 'currency', // aggregate per currency
maxValue: 10000,
},
required: true,
}The evaluator computes the aggregate at evaluation time by querying the agent's recent tool_call events from the room ledger, summing the amount values grouped by currency, and rejecting calls that would push the aggregate over the cap. The aggregate query is bounded by the window and is a small, indexed read.
Impact: addresses the 27% of failures in the cross-call aggregate class. Estimated coverage gain: +2.8 percentage points.
6.4 Estimated combined coverage
If all three extensions ship, projected coverage on the same 500-pattern corpus:
| Coverage class | Current | With extensions |
|---|---|---|
| Fully expressible | 71.4% | 96.2% |
| Partially expressible | 18.2% | 2.8% |
| Not expressible | 10.4% | 1.0% |
The residual 1.0% are the idiosyncratic patterns from Section 5.4, which are unlikely candidates for grammar extension and which we accept as out-of-scope for the parameter-binding primitive.
7. Authorship burden and the small-grammar trade-off
A common objection to grammar extension is that a larger grammar increases pact authorship cost. We disagree, qualitatively: the extensions above add expressiveness without adding mandatory complexity. Operators who do not need conditional rules continue to write the simple grammar; operators who need them have the option without breaking backwards compatibility.
We can also measure the authorship burden empirically. For the same 500-pattern corpus, we estimate:
- Current grammar pact authorship effort: ~12 lines of pact JSON per pattern on average.
- With proposed extensions: ~14 lines per pattern on average (a 17% increase, attributable mostly to the
conditionfield on conditional rules).
The increase is small. A 17% authorship cost increase to close 9 of the 10 percentage points of uncoverable patterns is a favorable trade.
The deeper objection is that a larger grammar is harder to audit. We mitigate this by keeping each extension declarative: conditional rules are a guard, not a branch; jury rules are a delegation, not a computation; window-aggregate rules are a typed query, not a free-form lookback. The audit story for each extension is: identify the rule kind, identify the rule's parameters, apply the canonical interpretation. The grammar grows in surface area but not in semantic complexity.
8. Limitations of the study
Corpus selection bias. The 500 patterns are drawn from Armalo's authorship history, public threat models, and regulatory guidance. They over-represent finance-and-healthcare and under-represent recreational or hobbyist agent domains. The coverage statistics for those under-represented domains may differ.
Annotation subjectivity. Inter-rater agreement was 89.4% before arbitration. Some patterns are ambiguous between "fully expressible with care" and "partially expressible." We were conservative in marking partial coverage; a more permissive annotation would push full-coverage up by 2โ4 percentage points.
Static analysis only. The study evaluates grammar expressiveness, not the empirical hit rate of expressed bindings against real attacks. Coverage is a necessary but not sufficient condition for substrate effectiveness; the empirical hit rate is the subject of a separate study.
Proposed extensions are not yet shipped. The 96.2% projected coverage is an analytical estimate, not an experimental result. Implementation will reveal edge cases not visible in the static analysis. The estimate is conservative on this dimension.
9. Implications
The empirical result has two architectural implications.
First, the current grammar is good enough for most production patterns. 71.4% full coverage and 18.2% partial coverage means the grammar catches some part of the constraint for 89.6% of patterns. Operators who adopt the substrate today do not need to wait for grammar extensions; the substrate already moves them substantially toward their desired security posture. The grammar's smallness is a feature, not a defect, at the current scale.
Second, the next generation of the grammar should target the three identified failure classes. Conditional rules, jury-typed rules, and window-aggregate rules collectively address 96% of the failure cases. The implementation order, ranked by impact per implementation effort, is: window-aggregate (highest impact, moderate effort), conditional (moderate impact, low effort), jury-typed (moderate impact, high effort due to the asynchronous jury infrastructure required).
The roadmap implied by the study is therefore: ship the window-aggregate extension first (closes the financial fragmentation attack pattern, the rate-constraint pattern, and the quota-per-parameter pattern), then ship conditional rules (closes the cross-parameter conditional pattern), then ship jury-typed rules (closes the semantic free-text pattern, requires the most infrastructure work).
10. Replication
The 500-pattern corpus and the per-pattern annotations are available on request to Armalo Labs. Researchers wishing to extend the study to additional domains can adopt the same rubric and contribute additional patterns. The annotation tooling is small (a CSV with pattern_id, domain, tool, parameter, constraint_description, coverage_class, rules_used); the major effort is corpus assembly.
The current grammar is documented and implemented in packages/validation/src/pacts.ts; the evaluator is in apps/web/lib/pact-param-binding.ts. Researchers can author their own pacts against Armalo's production pact API and validate them against test tool calls via POST /api/v1/pacts/{pactId}/validate-call.
References
- Armalo Labs Research Team. *The L4 Layer: Cross-Org Behavioral Trust for AI Agents.* 2026-05-12.
- Armalo Labs Research Team. *The TOCTOU Theorem for Agent Trust.* 2026-05-13.
- Armalo Labs Research Team. *The Trust Oracle as a Cross-Org Consensus Primitive.* 2026-05-13.
- OWASP. *Top 10 Risks for LLM Applications.* 2024.
- EU. *Artificial Intelligence Act, Articles 12โ13 (logging and transparency obligations for high-risk AI systems).* 2024.