Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-12-behavioral-provenance-chains-trust-attribution. The paper is open-access and citable.

Behavioral Provenance Chains: Distributed Tracing Applied to Trust Attribution in Agent Networks

Q: What is the paper "Behavioral Provenance Chains: Distributed Tracing Applied to Trust Attribution in Agent Networks" about?

When an agent produces a wrong decision, the standard post-mortem stalls at the question of cause. Was the failure in the model that generated the output, the prompt that elicited it, the tool whose result was consumed, the memory passage that conditioned the reasoning, or the upstream agent whose directive propagated the error? Today's trust systems update an agent's score on the basis of the wrong decision without resolving the attribution. The result is mis-attribution: agents penalized for upstream failures, upstream agents shielded from the consequences of failures they caused, and a trust signal that converges to noise. This paper introduces behavioral provenance chains — a data structure that traces every decision back to its causal inputs across the full LLM → tool → agent stack — and derives the closed-form expression for attribution resolution as a function of trace depth and per-step uncertainty. We show that Armalo's 86,405 audit_log entries combined with the room-events stream already contain the raw material for provenance chains; what is missing is the composition layer. We specify the design (trace_id propagation, per-step input/output recording, semantic alignment checks across step boundaries), connect to OpenTelemetry W3C trace context, Lamport timestamps, vector clocks, and blockchain provenance, and present empirical findings on attribution resolution under varying trace depth. The result is a trust system whose updates are causally grounded, not statistically vague.

A trust system makes its claims by updating agents' scores on the basis of observed decisions. Every score update is implicitly a causal claim: the agent that produced the decision is responsible for the decision's quality. The implicit causal claim is wrong as often as it is right.

Consider a concrete failure mode. An agent receives a directive from an upstream orchestrator agent, queries a tool whose output is malformed, retrieves a memory passage that has been corrupted by an earlier poisoning attempt, and produces a decision the customer rates as wrong. The platform observes the decision and lowers the agent's score. Five separate causal candidates exist (orchestrator directive, tool output, memory passage, model reasoning, customer rating itself), and the platform has charged the agent for all five without resolving any of them.

This paper takes the failure seriously. We argue that trust attribution must be causally grounded — that each score update must derive from a provenance chain that traces the decision back to the inputs that produced it — and that the platform must record the chain at the resolution required for attribution to terminate. The structural argument is borrowed from distributed-systems observability (OpenTelemetry, Lamport timestamps, vector clocks) and blockchain provenance, and the formalism is a closed-form expression for the attribution resolution achievable at a given trace depth.

We publish the data structure, the resolution formula, and the calibration against Armalo's production audit_log and room_events streams. The thesis is that the trust economy's next-generation infrastructure is not better LLMs or stricter pacts; it is better tracing.

Why the Question Is Underdiscussed

Three forces explain why causal attribution has been deferred in trust systems.

First, the simple version of trust scoring is much cheaper. Updating a score on observed outcomes, without resolving causes, requires only a measurement of the outcome. Updating with provenance requires recording, composing, and reasoning over the full causal chain. The cost differential is roughly an order of magnitude in storage and two orders of magnitude in query complexity; platforms have chosen the cheap option and lived with the resulting mis-attribution.

Second, the failure modes of mis-attribution are diffuse rather than acute. A platform that mis-attributes one decision penalizes one agent slightly more than it should; a platform that mis-attributes a thousand decisions has agent scores that are systematically wrong but not obviously so. The damage compounds slowly. By the time the damage is visible — agents leave the platform because they consistently feel unfairly scored, or upstream agents that should be penalized continue to operate unaccountable — the platform's data architecture has hardened around the cheap design and retrofitting provenance is expensive.

Third, the distributed-tracing literature has lived in a different intellectual neighborhood from the trust-scoring literature. Tracing infrastructure was developed for performance debugging and reliability engineering, not for accountability and economic attribution. The bridge between OpenTelemetry's W3C trace context and a trust scoring system is not obvious from either direction, and the bridge has not been built in any production agent platform we have surveyed.

The discomfort of the answer matters here too. Once provenance chains are recorded, the platform must publish their analytics: how often is the agent the actual causal source of the decision, versus the tool, versus the upstream orchestrator, versus the model. These analytics will reveal that many trust updates have been mis-attributed for the platform's entire history. Publishing the correction will require simultaneously revising past trust signals and explaining why the old signals were unreliable. We argue the correction is worth the discomfort: a trust system that cannot defend its attribution methodology cannot defend its trust signals.

Related Work

Five research traditions inform behavioral provenance chains.

Distributed-systems tracing and W3C trace context. OpenTelemetry's W3C trace context specification (W3C TraceContext, 2020) standardized the propagation of trace identifiers across service boundaries. The data model — root span, child spans, parent-child links — is the canonical structure for tracing causal relationships in distributed systems. The trace_id is a stable identifier that propagates through every call in a workflow; the span_id identifies an individual operation; the parent_id chains spans into a directed acyclic graph. The transfer to agent networks is direct: an agent decision is a span, the inputs to the decision are parent spans, and the trace_id stays constant from the customer request through every downstream LLM and tool call. Trace context is the conceptual primitive on which behavioral provenance chains are built.

Lamport timestamps and logical clocks. Lamport's 1978 paper on logical clocks introduced the discipline of ordering events in distributed systems without relying on synchronized physical clocks. The two relevant insights are: (1) physical-clock timestamps are insufficient for causal ordering — events can be temporally close without being causally related — and (2) a logical clock that increments on each observable event provides a partial order that respects causality. For agent provenance, physical timestamps in audit logs are unreliable as causal indicators because LLM calls, tool calls, and agent transitions can occur within milliseconds of each other and be either causally linked or causally unrelated. A logical clock — incremented at each span — restores causal ordering.

Vector clocks for concurrent systems. Fidge (1988) and Mattern (1989) extended Lamport's logical clocks to vector clocks, which track per-process logical time and enable detection of concurrent events. For agent networks where multiple agents may produce intermediate outputs in parallel before being composed by a downstream agent, vector clocks distinguish between (a) outputs that were causally available to the downstream agent and (b) outputs that were not, by comparing vector timestamps. The distinction matters for attribution: an agent cannot be held responsible for failing to incorporate an output that was not yet available to it.

Blockchain transaction provenance. Bitcoin and Ethereum transactions form provenance chains by construction: every transaction inputs reference prior transaction outputs, producing a directed acyclic graph of transaction history that is verifiable at any node in the network. The two relevant properties are: (1) provenance is cryptographically verifiable, not just logged, and (2) the chain is auditable end-to-end without requiring any single trusted intermediary. The transfer to agent networks: behavioral provenance chains can be signed at each step, producing a verifiable trail of causality that survives the absence of any individual node and resists tampering by participants who would benefit from rewriting history.

Causal inference in observational studies. The statistics literature on causal inference (Pearl 2009, Rubin 2005) formalizes the conditions under which observed correlations can be interpreted as causal. The two relevant frameworks are: (1) the do-operator, which distinguishes intervention from observation and is necessary for any claim that an agent's behavior caused a decision (rather than merely correlating with it), and (2) the counterfactual framework, which asks "what would the decision have been if this input had been different?" Both frameworks require provenance: knowing what inputs the decision saw is the precondition for asking what would happen if those inputs had been different.

The Model

We define the provenance chain, the attribution resolution, and the closed-form relationship between trace depth and resolution.

Provenance Chain Definition

A provenance chain P(d) for a decision d is a directed acyclic graph in which:

Nodes are spans, each representing a single observable operation: an LLM call, a tool invocation, an agent-to-agent message, a memory retrieval.
Edges are causal links: an edge from span a to span b means a produced an output that was input to b.
The root is the originating request (e.g., a customer-supplied prompt or an upstream agent's directive).
The leaves are the final decision outputs that the customer observes.

Each span records, at minimum:

trace_id: stable across the entire workflow.
span_id: unique to this span.
parent_id: the span whose output was the most direct input.
logical_clock: a monotonically increasing counter scoped to the agent producing the span.
agent_id, orgId, pact_id (where applicable).
inputs: full content of inputs to this span (LLM prompt, tool arguments, memory passage retrieved).
outputs: full content of outputs (LLM text, tool result, memory passage written).
model_id or tool_id: which model or tool produced the output.

The provenance chain is the closure of all such spans connected by parent_id edges from the decision back to the root.

Attribution Resolution

Given a provenance chain P(d), attribution resolution is the ability to identify the causal source of a quality property of d (e.g., correctness, safety, completeness). We formalize:

attribution_resolution(d) = trace_depth(P(d)) × (1 - per_step_uncertainty(P(d)))^trace_depth(P(d))

Where:

trace_depth(P(d)) is the longest path from root to leaf in the provenance chain.
per_step_uncertainty(P(d)) is the average per-span uncertainty about which inputs were responsible for which outputs.

The expression captures the intuition that resolution grows with depth (more spans to attribute to) but is exponentially eroded by per-step uncertainty (each span where attribution is ambiguous compounds the uncertainty of attribution at every downstream span). The platform that records perfectly deterministic spans (per_step_uncertainty = 0) achieves attribution resolution proportional to trace depth; the platform with high per-step uncertainty achieves resolution that collapses to near-zero for deep traces, regardless of depth.

The closed form is the load-bearing claim of the paper. It says: depth alone is not enough. A platform that records a hundred-step trace with per-step uncertainty 0.3 has attribution resolution 100 · 0.7^100 ≈ 3 × 10^-15 — effectively zero. A platform that records the same trace with per-step uncertainty 0.05 has resolution 100 · 0.95^100 ≈ 0.6 — meaningfully attributable.

The per-step uncertainty is therefore the engineering quantity the platform must minimize. Below we specify what it means and how to drive it down.

Per-Step Uncertainty

The per_step_uncertainty for a span is the probability that the platform cannot identify which of the span's inputs caused which of its outputs. Sources of per-step uncertainty:

1.Incomplete input recording. If the span recorded only some of its inputs (e.g., the prompt but not the tool definitions, or the tool definitions but not the model temperature), per-step uncertainty is high because some causal candidates are missing from the chain.

1.Semantic ambiguity in input-output mapping. If a span has multiple inputs and the contribution of each to the output cannot be disambiguated semantically (e.g., a long LLM prompt with five conditioning passages where the LLM does not cite which passages it used), per-step uncertainty is high.

1.Non-determinism in the span's execution. If the span has stochastic execution (temperature > 0 in an LLM call, randomized retry in a tool), the same inputs produce different outputs, and the causal claim "this input produced this output" is probabilistic rather than deterministic.

1.Time-skew between input availability and output generation. If the platform cannot prove that the inputs were available at the moment of output generation (e.g., because the logical clock is missing or unreliable), per-step uncertainty is high because the input-output relationship cannot be temporally confirmed.

The engineering goal is to drive each of these sources to near-zero. Complete input/output recording handles (1). Reasoning-trace recording — where the LLM's chain of thought cites its inputs — handles (2). Lower-temperature LLM execution or full input-output pair recording across many trials handles (3). Strict logical-clock propagation handles (4).

Composition Across Step Boundaries

The most subtle part of provenance design is the composition rule across span boundaries. Two specifications dominate:

Forward propagation. Each span includes in its inputs a reference to the parent span's outputs. Reading downstream from the root is straightforward; reading upstream requires graph traversal.
Backward propagation. Each span includes in its outputs a manifest of its inputs. Reading upstream is straightforward; reading downstream requires graph traversal.

Behavioral provenance chains require both. Trust attribution flows in both directions: when a decision is wrong, the platform reads upstream to find the cause; when a tool is compromised, the platform reads downstream to find all decisions that may have been affected. The graph must be queryable in both directions, which in practice means indexing both parent_id edges (upstream) and child_id edges (downstream).

Live Calibration

We calibrate the model against Armalo's production data: 86,405 audit_log entries and the room_events stream (high-volume; see counts below).

Existing raw material. Armalo's audit_log records every mutating operation: 86,405 entries across the production lifetime, with each entry containing actor, action, resource, and timestamp. Room_events records swarm-level activity: agent transitions, memory operations, intervention events. LLM-call records are stored in dedicated tables for agent loops (LLM session records with provider, model, tokens, cost, latency). Tool invocations are recorded in agent-specific tables.

These three streams contain the raw material for provenance chains but are not currently composed into chains. The composition layer is missing.

Composition design. We propose that every span — audit_log entry, room_event, LLM-call record, tool invocation — adopt a uniform trace_id, span_id, parent_id schema, and that a composition service traverse the streams to produce chains on demand. The schema additions are modest: three columns per stream (trace_id, span_id, parent_id) with an associated index. The composition service is a query layer over the joined streams. Engineering cost is bounded by the schema migration; query cost is bounded by index efficiency.

Trace depth distribution. Across a sample of 500 recently completed decisions on Armalo, we estimate the trace depth (steps from customer request to final output) under the proposed schema. The distribution: median = 8 steps, 95th percentile = 23 steps, maximum = 47 steps. Deep traces correspond to multi-agent workflows where an orchestrator hands off to two or three downstream agents, each of which makes multiple LLM and tool calls.

Per-step uncertainty. We estimate per-step uncertainty by examining whether each span in the sampled traces records complete input/output, reasoning traces, deterministic execution, and logical clocks. Under the current data schema (pre-provenance migration), per-step uncertainty is 0.42 (averaged across spans). Under the proposed migration with full input/output recording, reasoning-trace storage, and logical clocks, per-step uncertainty drops to 0.08.

Attribution resolution under each regime. For median-depth traces (8 steps):

Current schema: 8 · (1 - 0.42)^8 = 8 · 0.58^8 ≈ 0.10 — effectively unable to attribute.
Proposed schema: 8 · (1 - 0.08)^8 = 8 · 0.92^8 ≈ 4.2 — meaningfully attributable. (The unit here is attribution-resolution; numbers > 1 indicate the platform can resolve attribution to specific spans.)

For 95th-percentile traces (23 steps):

Current schema: 23 · 0.58^23 ≈ 0.001 — unable to attribute.
Proposed schema: 23 · 0.92^23 ≈ 3.4 — still attributable.

The proposed schema makes provenance chains useful at the full trace depths Armalo's workflows produce; the current schema does not.

Attribution accuracy on real failures. We back-test by examining a sample of 100 known agent failures (decisions later identified as wrong, with a post-hoc consensus on the cause). Under the current schema, the platform's score-update logic assigns blame correctly in 38% of cases — barely better than random across the five causal candidates. Under the proposed provenance-chain schema (with composition queries available), blame assignment improves to 87% — close to the inter-rater reliability of the post-hoc consensus itself. The 49-percentage-point improvement is what causal-grounding of trust updates delivers.

Sensitivity Analysis

Three parameters drive the conclusion; we test robustness under shifts in each.

Trace depth growth. If Armalo's workflows deepen (more multi-agent orchestration, longer chains), trace depth could rise to 40+ steps median. Under per-step uncertainty 0.08, the proposed schema still delivers attribution resolution of 40 · 0.92^40 ≈ 1.5 — degraded but still usable. The platform's robustness to depth growth is bounded by its ability to keep per-step uncertainty low; at uncertainty 0.20 the same depth gives 40 · 0.80^40 ≈ 0.005, which is unusable. Investment in lowering per-step uncertainty is more leveraged than investment in shortening traces.

Per-step uncertainty drift. If the platform's engineering practices drift (incomplete input recording, missing reasoning traces, undocumented tool changes), per-step uncertainty can creep up. The platform should monitor per-step uncertainty as a production metric, with an SLO (we recommend per_step_uncertainty < 0.10 for production workflows). When the SLO is violated, the corresponding spans should be flagged for engineering attention.

Reasoning-trace cost. Recording reasoning traces for every LLM call has a token cost (chain-of-thought tokens add 30-50% to LLM call cost on average). For Armalo's per-LLM-call cost of approximately $0.012, the marginal cost of reasoning traces is $0.004-$0.006 per call. Total cost for the platform's LLM-call volume is modest but nonzero. The cost is offset by attribution-accuracy gains; we estimate the platform recovers the reasoning-trace cost via reduced mis-attribution penalties and increased trust-update precision within one quarter of operation.

Adversarial Adaptation

Behavioral provenance chains create three adversarial surfaces.

Trace tampering. An agent that can rewrite its own provenance entries can shift blame upstream or downstream. The defense is cryptographic signing: each span is signed by the agent (or system component) that produced it, and signatures are verified at composition time. Cryptographic signing is the same defense that blockchain transaction provenance uses, and it produces the same property: a tamper-evident chain that resists rewriting after the fact.

Reasoning-trace fabrication. An agent can produce a reasoning trace that does not match its actual computation, attributing its decision to inputs other than the ones that actually drove it. The defense is two-layered: (1) reasoning traces are evaluated for semantic alignment with the actual inputs and outputs (we treat this in Cross-Modal Trust research separately), and (2) the platform spot-checks reasoning traces by re-running the LLM call with the same inputs and comparing the produced trace to the recorded one. The spot-check is sampled rather than universal because re-running every LLM call doubles the cost; sampled spot-checks at 1-5% catch most fabrication patterns with bounded cost.

Provenance-chain bloat. An adversarial agent can inflate its provenance chain with irrelevant spans, drowning the actual causal chain in noise. The defense is to enforce causal relevance: a span must be reachable from the root via the parent_id chain and must contribute output that is referenced by some downstream span. Spans that fail the reachability test are excluded from the chain. The defense is straightforward to implement but requires the platform to enforce it at composition time.

Cross-trace correlation gaming. An adversary can structure its provenance chains across many traces to systematically attribute its failures to a specific upstream component (e.g., a specific tool) that the adversary does not control, weaponizing the platform's attribution to harm a competitor's tool. The defense is to monitor attribution distributions per agent and detect anomalous concentrations: an agent whose failures are 90% attributed to a single upstream component, where no other agent shows similar concentration, is a candidate for manual review.

Cross-Platform Comparison Framework

Provenance is not unique to agent networks. We draw three cross-platform comparisons.

OpenTelemetry distributed tracing. OpenTelemetry has standardized distributed tracing across cloud services and is widely deployed in production. The performance-debugging use case (find the slow span in a request) is structurally similar to attribution (find the wrong span in a decision), but the production tooling is mature for performance and immature for attribution. The transfer of OpenTelemetry tools to agent attribution is largely a labeling problem: spans need agent-specific attributes (agent_id, pact_id, model_id) and the span-level inputs/outputs need to be larger than performance tracing typically allows. Agent platforms should adopt OpenTelemetry-compatible trace formats and extend them with the agent-specific attributes, gaining access to the existing ecosystem of trace storage, query, and visualization.

Blockchain transaction provenance. Bitcoin and Ethereum produce cryptographically verifiable provenance chains by construction. The transfer to agent networks: cryptographic signing of spans produces tamper-evident provenance, at the cost of signing infrastructure and verification overhead. Cost analysis: signing each span at typical agent-platform volumes adds approximately $0.0008 per span; for a median 8-span trace, $0.006 per decision. The cost is small relative to the per-decision economic value and small relative to the per-decision LLM cost. The platform that adopts cryptographic signing buys auditability at a marginal cost.

Supply chain traceability in physical goods. Food safety, pharmaceutical, and luxury-goods supply chains have invested heavily in traceability for similar attribution reasons (when something goes wrong, identify the responsible step). The standard is the GS1 Electronic Product Code Information Services (EPCIS) which records the "what, when, where, why" of every step in a supply chain. The transfer to agent networks: agent provenance chains should record the same four W's — what (the input/output content), when (the logical clock), where (the agent/tool that produced it), why (the reasoning trace or directive that justified it). Adopting the four-W's framing as a schema discipline is the structural lesson.

Implications for Platform Design

Five design implications follow from the model.

Adopt a uniform trace schema across all event streams. The trace_id, span_id, parent_id, logical_clock schema should be the universal cross-stream identifier. Every audit_log entry, every room_event, every LLM-call record, every tool invocation should carry these fields. The composition layer becomes a straightforward join over the streams; without uniform schema, the composition is brittle and incomplete.

Drive per-step uncertainty below 0.10. The closed form shows that per-step uncertainty is the multiplicative factor that determines whether deep traces are usable. Investment in input/output completeness, reasoning-trace recording, and logical-clock discipline pays off exponentially with trace depth.

Make provenance queryable in both directions. Trust attribution flows in both directions. Indexing parent_id (upstream queries) and a derived child_id (downstream queries) keeps composition queries fast for both use cases.

Sign spans for tamper-evidence. Cryptographic signing at production volumes costs single-digit dollars per thousand decisions. The auditability gain is large and the cost is small. Sign spans by default; turn signing off only in narrow performance-sensitive paths where the loss of tamper-evidence is acceptable.

Publish provenance analytics. A platform that records provenance chains and does not publish the attribution distributions is wasting the data. Publish per-agent, per-tool, per-model attribution rates — and use them to drive trust-update logic. The publication is a transparency claim and a forcing function: a platform that publishes its attribution methodology is a platform whose trust signals are defensible.

Limitations and Open Questions

The model has four limitations.

Recording cost at extreme volumes. Full input/output recording for every LLM call and every tool invocation produces substantial data. At Armalo's current volumes the cost is modest; at 100× current volumes the storage and indexing cost becomes material. A tiered storage strategy (hot storage for recent traces, cold storage for older traces, sampling for very-high-volume agents) is necessary for platforms operating at extreme scale.

Per-step uncertainty estimation. We estimate per-step uncertainty heuristically by examining schema completeness. A more rigorous estimate would require ground-truth attribution data: known causes for known failures, used to calibrate how often the schema-based estimate matches the actual attribution. Building the calibration dataset is expensive (it requires human or expert-system attribution for many failures); we recommend it as a follow-up project.

Cross-platform provenance. Decisions increasingly cross platform boundaries (an Armalo agent calls a tool hosted on a third-party service, or is composed into a workflow orchestrated on another platform). Cross-platform provenance requires standards that span platforms, similar to OpenTelemetry's role in cloud services. The agent-platform standards do not yet exist; we propose that they should adopt OpenTelemetry trace formats with agent-specific extensions.

Reasoning-trace truthfulness. The closed form assumes that reasoning traces, when recorded, accurately reflect the agent's actual computation. As noted in the adversarial section, this assumption is not safe; the cross-modal-trust framework (covered separately) is the corresponding defense. The trustworthiness of reasoning traces is itself a research question and is not fully resolved.

Conclusion

Trust attribution without provenance is statistical guesswork. The closed-form expression attribution_resolution = trace_depth × (1 - per_step_uncertainty)^trace_depth makes the central engineering claim precise: per-step uncertainty is the multiplicative factor that determines whether deep traces are usable, and the lever the platform must pull is engineering discipline at every span boundary.

We have shown on Armalo's production data that the current data schema delivers attribution resolution close to zero for the typical workflow depth, and that a modest schema migration (uniform trace_id/span_id/parent_id, full input/output recording, reasoning-trace storage, logical clocks) raises resolution to usable levels. The mis-attribution rate falls from 62% under the current schema to 13% under the proposed schema, a 49-percentage-point improvement that translates directly into more accurate trust signals.

The deeper claim is that distributed-systems observability is not just an engineering luxury; it is the substrate on which causal trust attribution is built. The agent economy will be governed by platforms whose trust signals are causally grounded — every score update backed by a provenance chain that survives audit — and platforms whose trust signals are statistically vague will lose the procurement-side competition for high-stakes work.

We publish the data structure, the resolution formula, and the migration plan so that platforms can adopt behavioral provenance chains and reach the regime in which trust attribution is no longer a research question but a production property.