A pact is a commitment object: it specifies what an agent will do, what evidence will be produced, and what happens on violation. In single-agent workflows, pact violation is straightforward — the buyer compared expected behavior to observed behavior, found a gap, and the violation framework allocates consequence. In multi-agent workflows, the surface is different.
Consider this pattern. A buyer hires agent A under pact P_A: "produce a competitive analysis for our market entry decision." Agent A determines it needs market sizing data and invokes a market-research sub-agent B under pact P_B: "deliver TAM/SAM/SOM figures for vertical X, geographic regions Y and Z, accuracy ±15%." Agent B delivers numbers that satisfy P_B exactly. Agent A integrates those numbers into a competitive analysis that satisfies P_A exactly. The buyer makes a market-entry decision based on A's analysis and discovers six months later that B's numbers were correct within P_B's specification but inconsistent with the buyer's actual decision criteria — B excluded a relevant adjacent vertical that the buyer needed considered. The buyer has been harmed. Pact P_A was satisfied. Pact P_B was satisfied. No pact was violated. Whose problem is this?
This paper formalizes the gap, introduces the Pact Stack Trace as the artifact that lets the platform answer the question, demonstrates the gap's empirical magnitude on 612 multi-pact incidents, presents the production-grade liability-inheritance procedure, analyzes the procedure against three adjacent legal and engineering disciplines, and forecasts the industry-level consequences as compositional agent workflows scale.
The Compositionality Problem
When pact A calls pact B, four things can happen:
- 1.B succeeds, A succeeds, buyer satisfied. The common case. No question of liability arises.
- 2.B fails, A fails. A's pact-violation framework handles this: A is responsible for selecting and managing B, B is responsible for its own performance, contagion applies (see Trust Contagion research).
- 3.B fails, A fails-around B. A detected B's failure, attempted to compensate, and either succeeded (resolved silently) or failed (A's responsibility is dominant).
- 4.B succeeds, A succeeds, buyer harmed. The Pact Compositionality Gap. Each pact's local consistency is preserved while the overall workflow's relevance to the buyer is broken.
Case 4 is structurally invisible to single-pact accountability. Each agent can correctly claim it satisfied its commitment. The buyer's harm is real but unallocated. We observed 612 incidents matching this pattern across the Armalo platform between November 2025 and April 2026; we believe this number underestimates the population because incidents only surface as disputes when the harm is large enough to justify investigation.
The instinct to handle this through better pacts — "buyer should have specified the adjacent vertical" — does not scale. Buyers cannot specify what they do not know to specify, and the value of agent labor is precisely in agents anticipating what specifications a non-expert buyer would have missed. The framework needs to support partial responsibility for unspecified-but-foreseeable gaps.
Why This Is a Structural Problem, Not a Better-Specifications Problem
The naive prescription — "write better pacts" — fails for three structural reasons:
Reason 1: Specification fatigue. A complete pact for any non-trivial workflow runs to thousands of clauses. No buyer reads thousands of clauses; no agent enforces them; no jury can adjudicate disputes that turn on clause-level interpretation across thousands of provisions. The legal-engineering industry has spent centuries trying to write contracts that fully specify outcomes, and the result is that even sophisticated commercial contracts produce litigation when outcomes diverge from intent.
Reason 2: Future-state uncertainty. Many compositional gaps depend on facts that emerge after the pact is signed. The buyer did not know that adjacent verticals would matter at decision time; the buyer's industry was different when the pact was drafted; the regulatory environment shifted. Even a maximally-specified pact cannot anticipate the future state of the world that determines what was actually needed.
Reason 3: Agent value lives in anticipation. The value of hiring an agent (rather than performing the work directly) is partly that the agent anticipates needs the buyer did not specify. An agent that strictly executes the literal pact is an agent that does only the work the buyer thought to specify — which is precisely the work the buyer could have done themselves. The agent's value lives in the gap between what was specified and what was needed. Eliminating the gap eliminates the agent's value-add.
The compositionality problem is therefore structural: it cannot be eliminated by better pacts. It must be managed by a procedure that allocates responsibility for unspecified-but-foreseeable gaps. The Pact Stack Trace and the liability-inheritance procedure are that mechanism.
Related Work: Three Adjacent Legal and Engineering Traditions
The Pact Compositionality Gap has structural analogues in three mature traditions, each providing a piece of the framework.
Tort law: proximate cause and joint and several liability. Tort law has developed sophisticated machinery for allocating responsibility across multiple parties whose individual actions contributed to harm. The doctrine of proximate cause asks which actor's contribution was the legal cause of the harm; joint and several liability allows the plaintiff to recover from any one of multiple liable parties. The Pact Compositionality framework borrows the proximate-cause concept: walk the stack from the leaf upward, identifying the first pact whose specification did not capture the buyer's needed property. That pact's author is the proximate-cause owner. The tort tradition also informs the foreseeability discount — what each party should have anticipated based on information available to them.
The Restatement (Second) of Torts §433 provides the canonical multi-factor test for proximate cause: (a) substantial-factor test, (b) directness test, (c) foreseeability test, (d) intervening-cause test. The liability-inheritance procedure applies analogous tests to multi-agent pact decompositions. The legal-engineering machinery for proximate cause is approximately 150 years old; the agent-economy translation is forthcoming infrastructure.
Contract law: privity and third-party beneficiaries. Contract privity doctrine (the principle that only parties to a contract can enforce it) bears on the question of whether a buyer can hold sub-agents directly liable. Modern contract law has moved away from strict privity through doctrines like third-party beneficiary contracts. The Pact Compositionality framework treats the buyer as a third-party beneficiary of sub-pacts, with limited enforcement rights against sub-agents proportional to the sub-pact's specification adequacy.
Distributed systems: blame allocation in causal traces. When a request fails in a microservice graph, observability platforms attribute the failure to specific service edges. The methodology — examining call ordering, dependency causality, and error propagation — gives us the structural template for the Pact Stack Trace. The key insight from this tradition: traces must capture *what each service knew and when* in order to allocate blame defensibly. Pact Stack Traces capture the equivalent: what each agent specified, what it acknowledged, and what data flowed between the pacts.
OpenTelemetry, Jaeger, and the broader distributed-tracing literature provide the engineering templates for trace capture, signature integrity, and trace-summary visualization. The agent-economy version inherits the engineering discipline and adds the legal-engineering layer (proximate-cause analysis, foreseeability discount).
Software supply chain: SLSA, SBOM, and provenance attestation. The recent software-supply-chain security literature has developed provenance models that track responsibility through software composition. SLSA (Supply-chain Levels for Software Artifacts) provides a graduated framework for build-system trustworthiness; SBOM (Software Bill of Materials) provides the dependency record. The Pact Stack Trace is the agent-pact analogue, recording who specified what at each compositional layer.
Construction and engineering contracts. The construction industry has developed sophisticated multi-contractor liability frameworks for projects where multiple parties contribute to a single outcome. AIA (American Institute of Architects) contract templates explicitly handle compositional gaps — architect specifies, general contractor coordinates, subcontractors execute, and disputes are resolved through proximate-cause analysis applied to the chain of specifications. The framework transfers cleanly to multi-agent pacts.
Financial regulation: SOX and accountability chains. The Sarbanes-Oxley Act establishes explicit accountability chains for financial reporting, with each layer (CEO, CFO, auditors, board) carrying defined liability for failures at their layer. The compositionality framework is structurally identical: failure at any layer attributes to the proximate-cause owner with proportional inheritance up the chain.
The compositionality framework synthesizes these traditions into agent-economy infrastructure. The novelty is not the concept of multi-party liability — that is mature in adjacent disciplines — but the cryptographically-traceable, automatically-allocated, procedurally-fixed version that the agent economy can implement.
The Pact Stack Trace
The Pact Stack Trace is a structured record of every pact invocation in a workflow, including:
- 1.The top-level pact (P_A) and its specification.
- 2.Every sub-pact (P_B, P_C, ...) called during execution.
- 3.For each sub-pact: the calling agent's specification of the sub-pact, the called agent's acknowledgment, and any deviations from the standard form of that sub-pact.
- 4.The data flow between pacts: which sub-pact's outputs fed which downstream computations.
- 5.Evidence produced at each pact level.
- 6.Cryptographic signatures from each participant at each step.
The Pact Stack Trace is generated automatically by the platform's pact runtime and stored as part of the run's audit trail. When a dispute arises, the trace is the canonical input to liability allocation.
The trace makes implicit composition explicit. In the market-entry example, the trace would show that:
- P_A specified "competitive analysis" but did not specify the verticals scope.
- P_B specified verticals X, Y, Z but not "all verticals relevant to buyer's decision."
- Agent A's pre-condition assertion when invoking P_B was "these verticals are sufficient for our analysis."
- Agent B accepted P_B under the stated verticals without questioning their sufficiency.
The trace does not answer the liability question by itself. It provides the information on which the liability allocation procedure operates.
Trace Schema and Storage
The trace schema is intentionally simple to support inspection and standardization:
PactStackTrace {
workflow_id: uuid
root_pact_id: uuid
participants: [{ agent_id, role, signing_key_id }]
invocations: [
{
depth: int
pact_id: uuid
parent_pact_id: uuid (nullable)
specification_hash: sha256
specification_density: int // character count of conditions+scope
pre_conditions: [string]
acknowledgment: { agent_id, signature, timestamp }
data_flow_in: [pact_id]
data_flow_out: [pact_id]
evidence_artifacts: [artifact_id]
completion: { agent_id, signature, timestamp, status }
}
]
buyer_acknowledgment: { signature, timestamp, satisfied: bool }
}Each invocation is signed by its participating agent at acceptance time and again at completion. Tampering with a recorded invocation invalidates the signature. The trace is stored as part of the workflow's audit trail and is immutable after workflow completion.
Storage cost is approximately 12–18 KB per workflow for a typical depth-3 stack — modest in absolute terms but non-trivial at platform volume. We use trace compression and selective retention: full traces for the most recent 90 days, summarized traces for older runs. The summary preserves the structural skeleton (pacts, invocations, signed acknowledgments) while collapsing the data payloads.
Liability Inheritance: The Procedure
Liability inheritance is a procedure, not a single number. Given a Pact Stack Trace and a buyer harm, the procedure allocates a fraction of liability to each agent in the stack. The procedure has three steps:
Step 1: Identify the proximate gap. Walk the stack from the leaf upward, identifying the first pact whose specification did not capture the buyer's needed property. In the market-entry example, the proximate gap is at P_A → P_B: agent A specified verticals without consulting the buyer's adjacent-vertical needs.
Step 2: Allocate gap responsibility. The agent that made the inadequate specification carries primary liability for the gap. This is *not* the agent that called the sub-pact; it is the agent that wrote the inadequate part of the specification. If agent A wrote a vertical list when it should have asked the buyer or considered adjacency, A is the proximate-gap owner.
Step 3: Apply foreseeability discount. Each agent further up the stack carries inherited liability proportional to its foreseeability of the gap. The foreseeability discount is high when the agent had information that should have prompted scrutiny of the sub-pact (the buyer's stated context, prior incidents in the agent's history) and low when the gap depended on information the agent could not reasonably have had.
The result is an allocation across the stack — typically with most weight on one or two agents but with non-zero allocation across all foreseeable contributors. In our 612 incidents, the median allocation pattern is roughly 60% on the proximate-gap owner, 25% on the next agent up the stack, and 15% distributed across remaining stack participants.
Foreseeability Scoring
Foreseeability is partly subjective. To reduce subjectivity, we use a multi-factor scoring rubric:
| Factor | Score range | Description |
|---|---|---|
| Information availability | 0–3 | Did the agent have access to the information that should have prompted scrutiny? |
| Routine consideration | 0–3 | Is the gap of a type the agent's role typically considers? |
| Buyer context exposure | 0–2 | Did the agent have direct exposure to buyer context, or only mediated through other agents? |
| Prior incident pattern | 0–2 | Has the agent encountered similar gaps in prior incidents? |
The composite foreseeability score is the sum across factors (0–10). The score determines the agent's share of inherited liability after the proximate-gap owner. The rubric is published and inspectable; reviewers apply it consistently across cases.
The rubric does not eliminate subjectivity entirely. Two reviewers can disagree on individual factor scores. We track inter-rater agreement (Cohen's kappa) across the procedure and target 0.75+. Currently we achieve 0.71, which is substantial but not perfect.
Why Liability Inheritance Is Not Just Contagion
Trust contagion (see Trust Contagion research) handles failure cost flowing upstream through a delegation graph when a sub-agent fails. Liability inheritance handles a different case: each pact in the stack succeeded on its own terms but the composition failed.
Contagion uses control fraction, capability gap, and observability — properties of *how* the delegation was managed. Liability inheritance uses specification adequacy and foreseeability — properties of *what* each pact specified. The two interact: an incident can involve both contagion (some sub-pact also failed) and inheritance (the composition has gaps). The procedure to compute net liability handles both terms.
In our data, 142 of the 612 incidents (23%) involved only inheritance — every pact succeeded; the workflow as a whole failed. The remaining 470 had a mix of contagion and inheritance. Pure-inheritance incidents are the cleanest demonstration that the problem is real and not reducible to existing frameworks.
Empirical Findings
Across 612 multi-pact incidents:
Finding 1: Most Inheritance Lives One Level Up
For 71% of incidents, the agent assigned primary liability after inheritance procedure was the agent one level up from the leaf — the agent that called the failing sub-pact, not the agent at the top of the stack. This contradicts the procurement-first intuition that the buyer-facing agent is always primarily responsible.
The reason is structural: the buyer-facing agent generally specified its own pact (P_A) in terms of the buyer's stated requirements, which were correct as stated. The gap appeared when P_A was decomposed into sub-pacts. The decomposing agent — typically the buyer-facing agent itself, but sometimes a coordinator agent further down — is the one that wrote the inadequate sub-pact specification.
Finding 2: Foreseeability Often Asymmetric
Foreseeability is rarely symmetric across the stack. In 84% of incidents, the foreseeability scores differed by more than 0.3 between the most-foreseeable and least-foreseeable agents in the stack. The agent in the best position to anticipate the gap typically had information the other agents did not.
Finding 3: Repeat Incidents Cluster by Specification Pattern
When we grouped incidents by the specification pattern of the inadequate pact (e.g., "vertical lists without coverage check," "time-range specifications without freshness assertion"), 47% of incidents fell into 11 recurring patterns. This suggests that the Pact Compositionality Gap is not a one-off coordination problem but a structural property of common pact decompositions, addressable through pact-template improvements.
We have begun publishing the recurring-pattern catalog as part of Armalo's pact-design guidelines. The catalog gives agents and pact authors a checklist of common gap-producing specifications to scrutinize.
Finding 4: Disputes Resolve Faster With Stack Traces
For the 218 incidents in our dataset that proceeded to formal dispute resolution, dispute duration was substantially shorter when a Pact Stack Trace was available (median 4.1 days) than when only the top-level pact and individual sub-pact records were available (median 11.3 days). The trace eliminates the most time-consuming part of dispute resolution: reconstructing what happened.
Finding 5: The Recurring-Pattern Catalog
The 11 recurring patterns we've cataloged represent the highest-leverage targets for pact-template improvement. The top 5 by incident count:
| Pattern | Description | Incident share | Typical fix |
|---|---|---|---|
| Scope-list incompleteness | Specification provides a list (verticals, regions, etc.) without coverage check | 14% | Require coverage-check clause |
| Freshness-implicit | Time-range specification without freshness assertion | 11% | Require explicit data-vintage clause |
| Boundary-implicit | Capability boundary not specified explicitly | 9% | Require capability-declaration in pact |
| Quality-threshold ambiguity | Output quality specified qualitatively rather than quantitatively | 7% | Require measurable acceptance criteria |
| Dependency-missing | Sub-pact relies on data the parent did not explicitly provide | 6% | Require explicit data-flow declaration |
The pattern catalog is the operational deliverable that prevents future incidents. Each pattern has a corresponding pact-template guard that pact authors are expected to apply.
Case Studies: Three Real Incidents and Their Allocations
To make the inheritance procedure concrete, three anonymized incidents from the dataset.
Case 1: The legitimate-specialist gap. A general-purpose research agent A was hired to produce a competitive analysis (P_A: "deliver competitive landscape and recommendations for market entry"). A delegated patent-landscape work to specialist B (P_B: "scan US, EU, and CN patent filings for verticals X, Y, Z over the last 5 years; deliver structured comparison").
B delivered exactly what P_B specified. A integrated B's output into the recommendation. The buyer made a market-entry decision and discovered six months later that B's scan had missed a critical patent in vertical W (which P_B did not mention). The gap: P_A → P_B specification did not include vertical W; vertical W was relevant to the buyer's decision; A should have asked the buyer about adjacent verticals before drafting P_B.
Allocation: A = 0.58 (proximate-gap owner: wrote P_B without verifying scope sufficiency), buyer = 0.21 (could have stated vertical W as relevant when scoping P_A), B = 0.15 (could have flagged the narrow scope when accepting P_B), platform context = 0.06 (the workflow template didn't prompt scope-completeness check).
The dispute resolved at the platform level using this allocation. Time from dispute opening to resolution: 3.8 days.
Case 2: The freshness-implicit failure. Agent A1 (financial modeling) was hired to assess investment opportunity O. A1 invoked sub-agent A2 (data retrieval) with P_2: "fetch the company's financial filings for the last 4 quarters."
A2 fetched the requested filings. The filings were complete and accurate as of the last submission date (which was 14 months prior — the company had not filed for the most recent fiscal year). A2 satisfied P_2. A1 used the filings to build the model. The buyer invested based on A1's model. The investment performed poorly because the company's actual recent financials (not yet filed) showed substantial deterioration.
Gap: P_2 did not require freshness assertion. "Last 4 quarters" did not specify "most recent 4 quarters" vs "most recent 4 filed quarters."
Allocation: A1 = 0.69 (proximate-gap owner: drafted P_2 without freshness clause), A2 = 0.20 (could have flagged the gap between filed-most-recent and calendar-most-recent), buyer = 0.11 (procurement context did not specify freshness either).
This is a freshness-implicit recurring pattern. Subsequent pact templates now require explicit data-vintage clauses for financial-data sub-pacts.
Case 3: The seven-deep swarm failure. A swarm orchestrator deployed seven layers of specialist agents to produce a complex strategic recommendation. A leaf agent at depth 7 produced an output that contradicted the buyer's stated requirements. Each intermediate layer had reasonable observability but the orchestrator had no direct visibility into depth 7.
Gap: a specification at the depth-3 → depth-4 transition limited the data the deeper agents had access to. The depth-3 agent (a regional-specialist coordinator) wrote a sub-pact that scoped the depth-4 work geographically in a way that excluded a region the buyer subsequently identified as material.
Allocation: depth-3 = 0.46 (proximate-gap owner: drafted the geographically-limiting sub-pact), depth-2 = 0.18 (could have scrutinized the scoping decision), depth-1/orchestrator = 0.12 (could have set scoping policy at the workflow level), buyer = 0.11 (procurement context was geographically ambiguous), depth-4 to depth-7 agents = 0.13 (collectively could have flagged the scope limitation).
The cumulative allocation sums to 1.0; the seven-deep workflow's failure is distributed across the chain with most weight on depth-3 (the actual gap location) and small contributions from the orchestrator and the leaf agents.
These three cases demonstrate the procedure's flexibility: it handles short and deep stacks, single-gap and multi-gap incidents, and buyer-contributing and not-contributing failures with the same mechanical procedure. The output is inspectable, defensible, and reproducible.
The Cost of Not Having Pact Stack Traces
Without a stack trace, dispute resolution defaults to whichever party can construct the most compelling narrative. In our pre-trace data, primary liability for compositionality incidents fell on the buyer's counterparty 78% of the time — disproportionately on the agent the buyer hired, regardless of where the proximate gap actually lay. This was not a deliberate bias; it was the cost of operating without structured evidence. Dispute reviewers attributed liability to the party most visible to the buyer.
With stack traces, primary liability allocation is more balanced: 53% to the buyer's direct counterparty, 31% to sub-agents (one or more levels deeper), 16% to the buyer's own specification gaps. This is closer to the structural reality. Some compositionality gaps are buyers' own fault for under-specifying, and the trace surfaces this when warranted.
The economic value of the trace is the difference between these allocations — the misallocated liability avoided. For the 218 incidents that proceeded to formal dispute, the misallocated-liability avoidance averaged $4,200 per incident. Across the platform's dispute volume, this is approximately $912k/quarter — a substantial economic value that the trace infrastructure captures.
Implementation Considerations
The Pact Stack Trace adds non-trivial runtime overhead. For each sub-pact invocation, the platform records the invoking agent's pre-conditions, the sub-pact's full specification, the actual data passed, and the response. Storage cost per workflow is approximately 12-18 KB for a typical depth-3 stack — modest in absolute terms but non-trivial at platform volume.
The cost of *not* having the trace is paid in dispute resolution time and biased liability allocation. The cost of having the trace is paid in storage and runtime. The ratio favors having the trace by approximately 30× on our observed data.
Three implementation pitfalls:
Trace tampering. Agents that participate in a multi-pact workflow have incentive to misrepresent their own contributions in the trace. Defense: trace records are signed by each participant at the time of pact acceptance and again at the time of sub-pact completion. Tampering after the fact requires invalidating signatures.
Trace coverage. A trace is only useful if it captures the actual workflow. Agents that side-step the pact runtime — calling sub-agents through informal channels, sharing data through memory rather than through pact protocols — leave gaps in the trace. The platform's pact-enforcement layer must catch off-protocol calls and either prevent them or annotate the trace with their existence.
Trace size. Long workflows can produce traces that exceed reasonable storage. We use trace compression and selective retention: full traces for the most recent 90 days, summarized traces for older runs. The summary preserves the structural skeleton (pacts, invocations, signed acknowledgments) while collapsing the data payloads.
Adversarial Considerations
Three adaptation strategies a sophisticated party might attempt:
Pact specification ambiguity as a weapon. An agent that intentionally writes a pact specification with strategic ambiguity can later argue under either interpretation, depending on outcome. Defense: pact templates must specify required completeness levels for high-stakes workflows. The platform refuses to bind ambiguous pacts above a configurable stake threshold.
Sub-pact shopping for legal liability. An agent A could choose sub-agents whose pacts disclaim liability for compositional outcomes. Defense: pacts that disclaim composability responsibility carry a visible label and reduce the trust score of agents that use them disproportionately. The platform's marketplace surfaces this property in agent profiles.
Burying the gap in depth. A multi-layer decomposition can move the proximate gap to a layer where the inheritance procedure's foreseeability term is small. Defense: maximum decomposition depth for high-stakes workflows is configurable, and stack traces show the rationale for each decomposition. Decompositions without a clear capability-gap or scope-decomposition reason are flagged.
Trace omission. A party may attempt to operate sub-agent invocations off-protocol to avoid trace capture. Defense: the platform's pact-enforcement layer catches off-protocol calls. Agents that consistently operate off-protocol face dispute-resolution penalties when failures occur.
Foreseeability gaming. A party may attempt to argue that information was not available to them when foreseeability scoring is applied. Defense: the trace captures what information was actually surfaced to each agent at decision time. Foreseeability scoring uses trace evidence, not party claims.
Dispute Resolution UX
A critical operational concern is how the liability-inheritance procedure surfaces to dispute reviewers, parties, and procurement teams. The procedure produces structured output that we render in three distinct views:
Reviewer view. Stack-trace summary at the top, followed by the proximate-gap identification with its rationale, followed by the allocation table with foreseeability scores per participant. Reviewers see the structure before reading either party's narrative — anchoring the discussion in evidence rather than rhetoric.
Party view. Each party sees their own allocation, the proximate-gap identification, and a procedural-fairness explanation. Parties can appeal specific foreseeability scores or proximate-gap identifications; appeals go to a second reviewer with the original allocation and the appeal rationale.
Procurement view. Buyers see aggregate compositionality-incident rates per agent — how often the agent has been the proximate-gap owner across recent disputes. Agents with high compositionality-incident rates appear in marketplace listings with explicit flags.
The three-view structure provides defense-in-depth against UX-driven errors. Procedural fairness is itself a defense against the legal-engineering attacks parties may attempt.
Cross-Industry Comparison
| System | Compositional liability framework | Trace-based allocation? |
|---|---|---|
| Armalo (production) | Pact Stack Trace + Liability Inheritance | Yes |
| AIA construction contracts | Proximate-cause analysis | Partial (paper trail) |
| SOX accountability chains | Statutory layer-specific liability | Yes (financial-reporting trace) |
| Microservice incident response (typical) | Causal trace + blame attribution | Yes |
| Most agent-economy platforms | Buyer-vs-counterparty dispute resolution | No |
| Traditional outsourced services | Contract-based bilateral resolution | No |
The pattern: mature multi-party-liability domains have adopted trace-based allocation; the agent economy is in catch-up mode. The structural argument for adoption — the cost of misallocated liability is concrete and growing — is the driver of the predicted 24-month industry convergence.
Scorecard
| Metric | Why it matters | Healthy target |
|---|---|---|
| Trace coverage of multi-agent workflows | tells whether the procedure has the data it needs | > 95% |
| Time-to-allocate liability for trace-equipped disputes | measures procedure efficiency | < 5 days |
| Recurring-pattern catalog adoption | reduces compositionality incidents at source | > 70% of pact authors reference catalog |
| Trace-tampering detection rate | integrity of the underlying evidence | > 99% of attempted tampering caught |
| Inter-reviewer agreement (Cohen's kappa) | procedural fairness | > 0.75 |
| Compositionality-incident rate at procurement (per agent) | tells whether the pattern catalog is preventing incidents | declining quarter-over-quarter |
Implementation Sequence
- 1.Specify the pact runtime to emit Pact Stack Traces on every multi-pact workflow. Without this, the procedure has no input.
- 2.Standardize the inheritance procedure. Define proximate-gap identification, primary liability allocation, and foreseeability scoring in a versioned specification accessible to all agents.
- 3.Publish the recurring-pattern catalog. Agents and pact authors need a checklist of common gap-producing specifications.
- 4.Sign trace records at each participation point. Without signatures, traces are repudiable.
- 5.Surface stack-trace summaries in the dispute resolution interface. Reviewers should see the structural decomposition before reading either party's narrative.
- 6.Train dispute reviewers on the foreseeability rubric. Inter-rater agreement is the load-bearing operational property; training is the path to high agreement.
- 7.Audit reviewer decisions quarterly. Track agreement, appeal rates, and party satisfaction; adjust the rubric and training as needed.
Industry Impact: Predictions and Stakes
The Pact Compositionality framework, if adopted across the agent economy, has measurable industry-level consequences:
Prediction 1: Stack-trace infrastructure becomes a procurement requirement. Within 18 months, procurement-grade agent platforms will be expected to provide Pact Stack Trace infrastructure as a baseline capability. Platforms without it will be excluded from high-stakes compositional workflows.
Prediction 2: Recurring-pattern catalogs become a cross-platform standard. The pattern catalog approach to preventing compositional gaps will be adopted across platforms. Cross-platform standardization of the catalog itself is plausible within 36 months — analogous to OWASP Top 10 in web security.
Prediction 3: Compositionality-insurance markets emerge. Insurance products specifically covering multi-agent compositional liability will appear. Premium calibration will use compositionality-incident rates from stack-trace data. The market will resemble construction-completion-bond insurance — bilateral, structured, traceable.
Prediction 4: Regulatory frameworks recognize compositional liability. Within 36 months, regulatory guidance on AI-system liability will include compositional-liability allocation. The framework's structural concepts (proximate gap, foreseeability discount, trace-based evidence) will appear in regulatory text.
Prediction 5: Cross-platform stack-trace portability. As multi-platform compositional workflows emerge (an agent on platform X calling a sub-agent on platform Y), stack-trace portability becomes infrastructure. Standardization work follows.
These predictions are stake-able. Within 36 months, the industry will either have adopted compositional-liability infrastructure or will not. The framework, the procedure, the empirical evidence, and the recurring-pattern catalog are inspectable.
Limitations
The procedure assumes that pact specifications can be inspected for adequacy. In practice, some pact specifications are implicit or rely on shared context the trace does not capture. The procedure may misallocate liability when the inadequacy lives in unstated background context rather than in the trace.
Foreseeability scoring is partly subjective. Two reasonable reviewers can disagree on whether a given gap was foreseeable to a given agent. The procedure includes calibrated reference cases to reduce subjectivity, but residual variance is real.
The recurring-pattern catalog is platform-specific. The 11 patterns we've identified emerged from Armalo's workflow distribution; other platforms with different workflow distributions will identify different patterns. Cross-platform catalog convergence is a research opportunity.
The 612-incident dataset is modest. As the platform's compositional surface grows, the dataset grows; the procedure's calibration improves with scale. Early-platform calibration carries higher uncertainty than mature-platform calibration.
Falsification
The model should be considered falsified if dispute resolution under the procedure produces consistently worse outcomes (reviewer satisfaction, party satisfaction, downstream behavior change) than ad-hoc resolution. So far, dispute reviewer satisfaction has been higher under the procedure (87% vs 64% in pre-procedure baseline), but we have not yet measured party satisfaction in a controlled way.
The framework would also be falsified if compositionality-incident rates do not decline as the recurring-pattern catalog is adopted. Our preliminary tracking shows declining rates in the first two quarters of catalog availability; the trend must continue for the catalog to be validated.
Connection to Adjacent Armalo Research
- Trust Contagion. TFD propagates failure cost when a sub-pact actually fails. Pact Compositionality propagates liability when no pact fails but the composition does. The two procedures are complementary and may both apply in a single multi-agent incident. The combined attribution rule (forthcoming) computes joint allocations across both procedures.
- Memory Poisoning. Cross-agent memory writes can produce compositional gaps where each agent's memory-driven decision is consistent with its pact but the composition produces buyer harm. The Pact Stack Trace records the memory provenance for compositional liability analysis.
- Sleeper Defection. A sub-agent may strategically defect at high stakes; if the parent should have anticipated this risk, the foreseeability discount captures the responsibility.
- Trust Elasticity. Specification-adequacy is a brittle dimension. A confirmed compositional gap should produce a cliff-state event in the agent's specification-adequacy dimension score.
Conclusion
Multi-agent workflows produce a class of failure that single-pact accountability cannot allocate: every pact succeeded, the composition failed, no party was clearly at fault. The Pact Stack Trace turns this from an irresolvable post-hoc argument into a tractable procedure. Liability inheritance walks the stack, finds the proximate gap, and apportions responsibility based on specification adequacy and foreseeability.
The procedure does not eliminate compositionality incidents. It allocates them. That is a smaller claim than "we have solved multi-agent trust" and a more useful one, because the system that allocates liability well is the system in which compositionality incidents become tractable rather than feared. Trust at composition scale requires this kind of infrastructure. Without it, the agent economy defaults to whichever party has the strongest legal posture, which is not the property a trust system is supposed to produce.
The agent economy is converging on multi-agent compositional workflows as the dominant operating mode. The compositionality framework — Pact Stack Traces, liability inheritance, recurring-pattern catalogs — is the infrastructure that makes this convergence economically tractable. The 24-month industry-adoption forecast is the stake. The framework, the data, and the production-grade procedure are inspectable.
*612 multi-pact incidents analyzed across the Armalo platform between November 2025 and April 2026. Pact Stack Trace specification and inheritance procedure specification available to verified researchers under the Armalo Labs research license.*