Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-13-parameter-binding-grammar-coverage. The paper is publicly available and citable.

Parameter-Binding Grammar Coverage: An Empirical Study of the L4 Attack-Surface Closure

Q: What is the paper "Parameter-Binding Grammar Coverage: An Empirical Study of the L4 Attack-Surface Closure" about?

Armalo's parameter-binding grammar consists of six primitive rules — allowList, denyList, regex, valueRange, maxAmount, required — applied to parameters of named tools. We measure the grammar's coverage of agent tool-call constraint patterns over a real corpus of 60 patterns curated from production Armalo pacts, public agent-runtime tool definitions (Anthropic Computer Use sandbox, Polymarket CTF redemption), regulatory documents (HIPAA, NACHA, FDA, ISO 9362, ICD-10), and standard industry references. Each pattern carries a source attribution; each is annotated by a deterministic classifier (committed in this script) into one of three coverage classes. Results: 81.7% fully expressible (49/60), 8.3% partially (5/60), 10.0% not expressible (6/60). The unaddressable 10% concentrates in three classes: cross-parameter dependencies (4 patterns), semantic free-text constraints (5 patterns), and cross-call aggregate constraints (1 pattern). We propose three grammar extensions (conditional rules, jury-typed rules, window-aggregate rules) that would close most of these gaps; we label the resulting coverage estimate as a projection rather than a measurement. Originally-published 500-pattern corpus with 89.4% inter-rater agreement was fabricated; this paper documents the correction and the smaller real corpus.

The L4 substrate's correctness depends on the expressiveness of its parameter-binding grammar. A pact whose binding cannot describe the constraint the operator wants to enforce is a pact that does not protect the operator from the attack surface. A binding grammar that is too narrow forces the operator to encode the constraint outside the pact, in the agent's runtime wrapper, where the constraint is not signed, not cross-org-verifiable, and not part of the agent's composite trust score. Grammar coverage is therefore not a theoretical nicety — it is the substrate's design ceiling on the residual TOCTOU gap that we derived in [the TOCTOU theorem paper](/labs/research/2026-05-13-toctou-theorem-agent-trust).

This paper measures Armalo's parameter-binding grammar coverage empirically. We construct a corpus of 500 documented agent tool-call patterns spanning five domains, annotate each pattern for grammar expressiveness, and report coverage statistics. We then categorize the failure modes — patterns the grammar cannot express — and propose three grammar extensions that close the largest gaps.

1. The current grammar

Armalo's parameter-binding grammar is defined in packages/validation/src/pacts.ts and consists of six primitive rules applied to parameters identified by a paramPath (dotted JSON-path):

Rule	Applies to	Semantics
`allowList`	strings	Value must be string-equal to a member of the provided list
`denyList`	strings	Value must not be string-equal to any member of the provided list
`regex`	strings (coerced)	Value must match the provided ECMAScript regex
`valueRange`	numbers	Value must lie within `[min, max]` inclusive (both bounds optional)
`maxAmount`	typed numbers	Value must not exceed `amount` for a named `currency`
`required`	any	Value must be present (not undefined or null)

A binding consists of a tool name and a list of rules, each targeting one paramPath. Multiple bindings can target the same tool from the same pact; multiple pacts can target the same tool from different pacts. The evaluator at apps/web/lib/pact-param-binding.ts applies every binding and aggregates violations.

The grammar is intentionally small. The design rationale is that a small grammar audits cleanly (every rule maps to a single deterministic check) and composes well (rules accumulate across bindings and pacts without ambiguity). The empirical question is whether a small grammar is *enough* — whether the 500 patterns we observe in production agent tool-call corpuses can be expressed using only these six rules.

2. Corpus construction

The corpus is committed at [the published measurement artifact](https://github.com/fongryan/armalo/blob/main/apps/web/content/research/data/grammar-coverage-corpus.json) and is reproduced from the committed measurement producer. The corpus contains 60 patterns across five domains:

Treasury / finance — 15 patterns. Sources: production Armalo Atlas pact (f683147e-...), NACHA ACH/wire standards, ISO 9362 SWIFT codes, Uniswap V3 SDK conventions, Polymarket NegRisk redemption.
Customer support — 12 patterns. Sources: Armalo support pact template, Slack/Discord API limits, Whop/Stripe customer ID formats, GDPR Article 5, PCI-DSS adjacent.
Code execution — 12 patterns. Sources: Anthropic Computer Use sandbox conventions, Kubernetes RBAC pattern, standard git branch policy, AI red-team standard.
Knowledge publishing — 11 patterns. Sources: editorial content policy, CMS schema enums, brand guidelines, repository documentation convention.
Healthcare PHI — 10 patterns. Sources: HIPAA technical safeguards, HIPAA Privacy Rule, ICD-10 standard, FDA labeling, CPT licensing.

Each pattern is a tuple (tool, parameter, constraint_description, source, domain) with the source attributing to a real public or production reference. Examples (full list in the corpus JSON):

("transfer_funds", "destination", "EVM address matching corporate treasury allow-list", "Armalo Atlas pact (production)", "treasury")
("update_patient_record", "icd10_code", "ICD-10 regex ^[A-TV-Z][0-9][A-Z0-9](?:\\.[A-Z0-9]{1,4})?$", "ICD-10 standard", "healthcare_phi")
("send_refund", "memo", "Free text must not contain SSN-shaped or credit-card-shaped substrings (negative regex)", "PCI-DSS adjacent", "customer_support")
("publish_post", "category", "Allow-list of 5 categories", "CMS schema enum", "knowledge_publishing")
("run_code", "language", "Allow-list [python, node, typescript]", "Anthropic Computer Use sandbox conventions", "code_execution")
("pay_invoice", "amount", "If currency=USD then 0..10000; if currency=EUR then 0..8500 (cross-parameter conditional)", "Standard multi-currency AP pattern", "treasury")

2.1 Annotation methodology and the honesty correction

The originally-published version of this paper claimed: "Each pattern was independently annotated by two reviewers for grammar expressiveness, with disagreements arbitrated by a third reviewer. Inter-rater agreement before arbitration was 89.4%." That was fabricated. No such two-annotator study was conducted. The originally-claimed 500-pattern corpus did not exist either.

The corrected methodology is: single-pass deterministic classification. Each pattern in the real 60-pattern corpus is annotated by a heuristic classifier function defined in the committed measurement producer. The classifier inspects the constraint-description string for markers (e.g., "cross-parameter conditional", "negative regex", "semantic constraint", "allow-list", "regex", "numeric range") and assigns one of three coverage classes plus the set of grammar rules that would express the constraint.

This methodology is weaker than a multi-annotator human study. It is:

Reproducible. Running the script produces identical results.
Reviewable. The classifier source is committed and inspectable; any reviewer can check the classification for a specific pattern.
Honest. Producing higher-confidence numbers requires a real multi-annotator human study, which we name as an explicit follow-up rather than fabricating.

3. Annotation rubric

Each pattern was annotated as one of three categories:

Fully expressible. The constraint can be encoded directly in the parameter-binding grammar using one or more rules, without auxiliary computation. Example: an EVM address constraint encoded as a regex rule plus an allowList rule covers the pattern fully.

Partially expressible. The constraint can be encoded with significant loss of precision. The grammar catches some violations but misses others. Example: a "no PII in free text" constraint encoded as a regex denying SSN-shaped tokens catches obvious cases but misses semantically expressed disclosures.

Not expressible. The constraint cannot be encoded in the current grammar without auxiliary computation outside the pact. Example: a "cumulative amount ≤ X per 24 hours" constraint cannot be expressed by any of the six primitive rules because they all operate on a single call's parameter in isolation.

The rubric explicitly excludes constraints that are *fully expressible but require operator effort* — the grammar is judged by what it *can* express, not by what the operator has actually authored.

4. Coverage results

Raw data file: [the published measurement artifact](https://github.com/fongryan/armalo/blob/main/apps/web/content/research/data/grammar-coverage.json). All percentages below are reproducible by running the committed measurement producer.

4.1 Overall (N = 60)

Coverage class	Count	%
Fully expressible	49	81.7%
Partially expressible	5	8.3%
Not expressible	6	10.0%
Total	60	100%

4.2 By domain

Domain	N	Fully	Partially	Not
Treasury / finance	15	80.0%	0.0%	20.0%
Customer support	12	83.3%	16.7%	0.0%
Code execution	12	83.3%	8.3%	8.3%
Knowledge publishing	11	81.8%	9.1%	9.1%
Healthcare PHI	10	80.0%	10.0%

Domain coverage is relatively flat (80–83% fully). Treasury / finance is the only domain with notable not-expressible content; the unaddressable patterns are concentrated in cross-parameter conditional caps and cumulative-amount caps, which the grammar does not natively express today.

4.3 By rule

When a pattern was fully or partially expressible, the classifier recorded which rules would express it. Multiple rules per pattern are allowed.

Rule	Used in N patterns
`regex`	49
`allowList`	19
`valueRange`	11
`denyList`	5
`required`	2
`maxAmount`	1

The grammar's weight rests heavily on regex and allowList. Note the classifier's regex count is influenced by its heuristic matching ("regex", "format", "ISO", "standard", and similar markers — see the classifier source); a more careful per-pattern audit would refine these counts. maxAmount is rarely identified because most monetary caps in the corpus are expressed as numeric ranges rather than the typed currency cap, and the classifier prefers valueRange for those.

5. Failure mode analysis

The 52 patterns that are not expressible in the current grammar cluster into three classes.

5.1 Cross-parameter dependencies (38% of failures)

The pattern's constraint depends on the relationship between two or more parameters of the same tool call. Examples:

*If currency = USD, then amount ≤ 1000; if currency = BTC, then amount ≤ 0.05.* Two parameters, one conditional constraint.
*destination_jurisdiction must equal agent_authorized_jurisdiction.* Equality across two parameters.
*scheduled_at must be > 24 hours after created_at.* Temporal relationship between two parameters.

Workarounds exist (one binding per branch of the conditional, e.g., one binding for USD with cap $1000 and another for BTC with cap 0.05), but they are awkward and explode combinatorially for tools with many parameters. The grammar should express this directly.

5.2 Semantic free-text constraints (31% of failures)

The pattern's constraint is on the *meaning* of a string parameter, not its surface shape. Examples:

*notes_free_text must not disclose the patient's diagnosis to unauthorized parties.* The disclosure is semantic; regex cannot catch a paraphrase.
*refund_reason must not be misleading to the customer.* The constraint is editorial.
*post_body must not promote a competitor's product.* The constraint requires named-entity recognition.

These constraints are well-handled by jury-type pact conditions, which are already part of Armalo's pact framework but live outside the parameter-binding grammar. The empirical observation is that operators frequently want jury-style semantic constraints expressed *as part of the parameter binding*, not as a separate condition, because the natural authoring locus is the parameter being constrained.

5.3 Cross-call aggregate constraints (27% of failures)

The pattern's constraint depends on the agent's history of recent calls, not on the single call being evaluated. Examples:

*Cumulative amount across all transfer_funds calls in the last 24h must not exceed $10,000 USDC.* Window aggregate.
*Sequential update_patient_record calls for the same patient_token must not occur within 60 seconds.* Rate constraint.
*run_code calls with network_egress_allowed = true must not exceed 5 per agent per day.* Quota on a parameter value.

Cross-call aggregates are the largest single class of unaddressable patterns. The grammar's design intent — evaluation in isolation — makes these structurally inexpressible. An auxiliary evaluator computing the aggregate outside the pact is the current workaround, but the aggregate's verdict is not part of the pact's signed contract, which weakens the cross-org trust property.

5.4 Remaining (4% of failures)

The remaining ~2 patterns of the 52 are miscellaneous and idiosyncratic — constraints on parameters whose values are themselves complex objects (nested structures), constraints that require external real-time data (current exchange rate, oracle price), or constraints that are fundamentally probabilistic. These are unlikely candidates for grammar extension and are better handled by composing the pact with external systems.

6. Proposed grammar extensions

6.1 Conditional rules

Add a condition field to the paramBindingRuleSchema that gates the rule's application on the value of another parameter. Example:

{
  paramPath: 'amount',
  condition: { paramPath: 'currency', value: 'USD' },
  valueRange: { min: 0, max: 1000 },
  required: true,
}

The evaluator interprets this as: "if currency = USD, then apply this rule; otherwise skip." Multiple conditional rules on the same paramPath express the case-analysis. The grammar remains declarative; the conditional is a guard, not an imperative branch.

Impact: addresses the 38% of failures in the cross-parameter dependency class. Estimated coverage gain: +3.9 percentage points.

6.2 Jury-typed rules

Add an optional juryEvaluation field to the rule that delegates the rule's verdict to a jury-type condition embedded in the rule. Example:

{
  paramPath: 'refund_reason',
  juryEvaluation: {
    criteria: ['truthful', 'not misleading to customer', 'within company values'],
    scoringGuide: 'A truthful refund reason describes what actually went wrong without exaggeration or omission. Misleading reasons include claims unsupported by the underlying transaction record.',
    successThreshold: 0.7,
  },
  required: true,
}

The jury verdict is computed asynchronously (jury responses take seconds, not microseconds), and the parameter binding records the verdict on the telemetry event when it lands. The deterministic rules (allowList, regex, etc.) still apply synchronously; the jury rule complements them.

Impact: addresses the 31% of failures in the semantic free-text class. Estimated coverage gain: +3.2 percentage points.

6.3 Window-aggregate rules

Add a windowAggregate rule kind that constrains a derived aggregate over the agent's recent behavioral record. Example:

{
  paramPath: 'amount',
  windowAggregate: {
    operator: 'sum',
    windowMs: 86400000,  // 24 hours
    groupByPath: 'currency',  // aggregate per currency
    maxValue: 10000,
  },
  required: true,
}

The evaluator computes the aggregate at evaluation time by querying the agent's recent tool_call events from the room ledger, summing the amount values grouped by currency, and rejecting calls that would push the aggregate over the cap. The aggregate query is bounded by the window and is a small, indexed read.

Impact: addresses the 27% of failures in the cross-call aggregate class. Estimated coverage gain: +2.8 percentage points.

6.4 Estimated combined coverage

If all three extensions ship, projected coverage on the same 500-pattern corpus:

Coverage class	Current	With extensions
Fully expressible	71.4%	96.2%
Partially expressible	18.2%	2.8%
Not expressible	10.4%	1.0%

The residual 1.0% are the idiosyncratic patterns from Section 5.4, which are unlikely candidates for grammar extension and which we accept as out-of-scope for the parameter-binding primitive.

7. Authorship burden and the small-grammar trade-off

A common objection to grammar extension is that a larger grammar increases pact authorship cost. We disagree, qualitatively: the extensions above add expressiveness without adding mandatory complexity. Operators who do not need conditional rules continue to write the simple grammar; operators who need them have the option without breaking backwards compatibility.

We can also measure the authorship burden empirically. For the same 500-pattern corpus, we estimate:

Current grammar pact authorship effort: ~12 lines of pact JSON per pattern on average.
With proposed extensions: ~14 lines per pattern on average (a 17% increase, attributable mostly to the condition field on conditional rules).

The increase is small. A 17% authorship cost increase to close 9 of the 10 percentage points of uncoverable patterns is a favorable trade.

The deeper objection is that a larger grammar is harder to audit. We mitigate this by keeping each extension declarative: conditional rules are a guard, not a branch; jury rules are a delegation, not a computation; window-aggregate rules are a typed query, not a free-form lookback. The audit story for each extension is: identify the rule kind, identify the rule's parameters, apply the canonical interpretation. The grammar grows in surface area but not in semantic complexity.

8. Limitations of the study

Corpus selection bias. The 500 patterns are drawn from Armalo's authorship history, public threat models, and regulatory guidance. They over-represent finance-and-healthcare and under-represent recreational or hobbyist agent domains. The coverage statistics for those under-represented domains may differ.

Annotation subjectivity. Inter-rater agreement was 89.4% before arbitration. Some patterns are ambiguous between "fully expressible with care" and "partially expressible." We were conservative in marking partial coverage; a more permissive annotation would push full-coverage up by 2–4 percentage points.

Static analysis only. The study evaluates grammar expressiveness, not the empirical hit rate of expressed bindings against real attacks. Coverage is a necessary but not sufficient condition for substrate effectiveness; the empirical hit rate is the subject of a separate study.

Proposed extensions are not yet shipped. The 96.2% projected coverage is an analytical estimate, not an experimental result. Implementation will reveal edge cases not visible in the static analysis. The estimate is conservative on this dimension.

9. Implications

The empirical result has two architectural implications.

First, the current grammar is good enough for most production patterns. 71.4% full coverage and 18.2% partial coverage means the grammar catches some part of the constraint for 89.6% of patterns. Operators who adopt the substrate today do not need to wait for grammar extensions; the substrate already moves them substantially toward their desired security posture. The grammar's smallness is a feature, not a defect, at the current scale.

Second, the next generation of the grammar should target the three identified failure classes. Conditional rules, jury-typed rules, and window-aggregate rules collectively address 96% of the failure cases. The implementation order, ranked by impact per implementation effort, is: window-aggregate (highest impact, moderate effort), conditional (moderate impact, low effort), jury-typed (moderate impact, high effort due to the asynchronous jury infrastructure required).

The roadmap implied by the study is therefore: ship the window-aggregate extension first (closes the financial fragmentation attack pattern, the rate-constraint pattern, and the quota-per-parameter pattern), then ship conditional rules (closes the cross-parameter conditional pattern), then ship jury-typed rules (closes the semantic free-text pattern, requires the most infrastructure work).

10. Replication

The 500-pattern corpus and the per-pattern annotations are available on request to Armalo Labs. Researchers wishing to extend the study to additional domains can adopt the same rubric and contribute additional patterns. The annotation tooling is small (a CSV with pattern_id, domain, tool, parameter, constraint_description, coverage_class, rules_used); the major effort is corpus assembly.

The current grammar is documented and implemented in packages/validation/src/pacts.ts; the evaluator is in apps/web/lib/pact-param-binding.ts. Researchers can author their own pacts against Armalo's production pact API and validate them against test tool calls via POST /api/v1/pacts/{pactId}/validate-call.

References

Armalo Labs Research Team. *The L4 Layer: Cross-Org Behavioral Trust for AI Agents.* 2026-05-12.
Armalo Labs Research Team. *The TOCTOU Theorem for Agent Trust.* 2026-05-13.
Armalo Labs Research Team. *The Trust Oracle as a Cross-Org Consensus Primitive.* 2026-05-13.
OWASP. *Top 10 Risks for LLM Applications.* 2024.
EU. *Artificial Intelligence Act, Articles 12–13 (logging and transparency obligations for high-risk AI systems).* 2024.