In a Three-Agent Pipeline, the Trust Score Is the Minimum. Not the Average.
Multi-agent pipelines are in production at serious scale. An orchestrator routes tasks to specialist agents. A research agent feeds a synthesis agent that feeds a delivery agent. Three agents, each evaluated and scored, each with a trust signal that looks healthy. The intuition is that the pipeline score is somewhere near the average. The math says it's closer to the minimum.
This is trust transitivity. The AI agent ecosystem hasn't fully reckoned with it yet, and the operators who discover it do so by getting burned.
The Math That Surprises People
Take a three-agent pipeline. Agent A has a composite trust score of 940. Agent B has 880. Agent C has 720.
The naive read: average is 847. That's in the Silver tier. You'd probably ship this.
The correct read: the pipeline trust is bounded by its weakest link. The terminal node — the agent whose output actually reaches your user or downstream system — is Agent C at 720. Everything Agents A and B did is only as trustworthy as the final transformation, because you can't verify quality at intermediate stages if the terminal agent doesn't.
But it's actually worse than the minimum, because of error amplification.
Error Amplification: Why the Pipeline Score Falls Below the Minimum
Here's what happens that the individual scores don't capture:
Agent A has a 94% accuracy rate. That means 6% of its outputs are wrong. Those wrong outputs don't disappear — they become inputs to Agent B. Agent B, reasoning from Agent A's outputs, now has a contaminated input distribution. Not 6% contaminated — because Agent B can sometimes catch and correct Agent A's errors. But not always. Some fraction of Agent A's errors survive Agent B's processing.
By the time those surviving errors reach Agent C, they've been processed through two layers that didn't catch them. Agent C applies its own error rate to an already-contaminated input distribution. The effective error rate at the output is not the sum of the individual error rates, but it's higher than the naive independence assumption suggests.
The formal model: if Agent A's error rate is e_A and Agent B's propagation rate for A's errors is p_{AB}, and Agent B's independent error rate is e_B, the effective error rate entering Agent C is approximately e_A * p_{AB} + e_B * (1 - e_A * p_{AB}). For typical error propagation rates in LLM-based pipelines, this compounds meaningfully at each node.
Trust in a pipeline degrades with each node. It doesn't average.
Why Individual Scores Miss This
When you query the trust oracle for a single agent, you get a clean signal. The score reflects that agent's behavioral history on defined commitments, evaluated with ground-truth inputs. The certification tier reflects consistently maintained performance. The signal is interpretable and predictive for single-agent deployments.
When you compose agents into a pipeline, several things happen that current trust infrastructure doesn't model:
Trust assumptions don't compose. Each agent's behavioral pact specifies its individual commitments. "Agent A: accuracy ≥ 90% on classification tasks." That pact was evaluated in isolation — with ground-truth inputs sampled from the pact's defined test distribution. In a pipeline, Agent A receives inputs from Agent B's outputs, which may not resemble the ground-truth distribution A was evaluated on. The pact is still being tested. The test is no longer the same one as what was evaluated.
Silent failure amplification. There's a critical asymmetry between loud failures and silent failures in pipeline contexts. When Agent A fails loudly — throws an error, returns a structured failure response — the downstream pipeline knows not to proceed. The failure is contained. When Agent A fails silently — returns a well-formed but wrong response with appropriate confidence — the failure is invisible to Agent B. Agent B reasons from corrupt data and produces a confidently wrong output. Agent C inherits that. Silent failures amplify through a pipeline in a way that loud failures don't.
Evaluation distribution drift. An agent evaluated in isolation on a curated test suite is evaluated under different input conditions than it encounters in production. In a pipeline, the upstream agent's output distribution becomes the downstream agent's input distribution. If that distribution diverges significantly from the test suite distribution, the pact evaluation loses predictive power for in-pipeline behavior.
The Infrastructure Doesn't Model This
Current agent evaluation infrastructure evaluates agents in isolation. This is the right starting point — you have to understand individual agent reliability before you can model pipeline reliability. But it stops there.
There's no standard for:
- Computing a pipeline trust score from individual agent scores given the pipeline topology
- Identifying the "weakest link" agent in a production pipeline and quantifying its impact on terminal output quality
- Alerting when a new agent is added to a pipeline that drops the pipeline's effective trust below a threshold
- Evaluating agents under realistic input distributions from their actual upstream agents, not just from benchmark test suites
The result: operators look at a dashboard showing healthy individual scores and assume the pipeline is healthy. The pipeline trust profile is invisible. The first time they see it is when the terminal output fails.
What Pipeline Trust Modeling Requires
Topology-aware pipeline scoring. The simplest defensible model is min(scores) — the pipeline trust equals its weakest link. This is conservative and often correct for sequential pipelines. A more nuanced model accounts for topology: parallel pipelines where multiple agents produce outputs that are reconciled have different trust dynamics than sequential pipelines. A pipeline where the weakest agent is in the middle, followed by a strong validator agent, has different risk characteristics than one where the weakest agent is terminal.
Silent vs. loud failure weighting. An agent's contribution to pipeline risk should be weighted by its failure mode distribution, not just its aggregate error rate. An agent that fails at 5% but 90% of those failures are loud (explicit errors) is safer in a pipeline than an agent that fails at 3% but 80% of those failures are silent (confident wrong output). The silent failure rate is what actually propagates.
Upstream input distribution testing. Evaluate each agent in the pipeline not just against its ground-truth test suite, but against a sample of actual inputs it receives from its upstream neighbors in production. If Agent A's output distribution has drifted, Agent B should be evaluated on the actual distribution, not the distribution assumed at pact evaluation time.
Weakest-link alerting. When any agent in a production pipeline's trust score drops below a threshold — from behavioral drift, model update, or performance degradation — operators should be alerted with the pipeline-level impact quantified, not just notified that "Agent C's score dropped." The notification that matters is "Agent C's score dropped and it's the terminal node in Pipeline X, which processes 40% of your production traffic."
The Security Parallel
Cryptographers formalized this decades ago. A chain is only as strong as its weakest link. TLS validates every certificate in a chain, not just the root. One compromised intermediate certificate authority undermines the entire trust path regardless of the root CA's trustworthiness.
Agent pipelines follow the same logic. An orchestrator with a Platinum trust score doesn't protect downstream users from a Bronze-tier terminal agent. The trust doesn't average up from the orchestrator to the terminal. The weakest node determines the effective trust of the path.
This is also why the composition of individually trustworthy agents doesn't automatically produce trustworthy pipelines. Trustworthiness, like security, doesn't compose naively. It requires reasoning about the composition as a system.
The Question for Builders
When you compose agents into a pipeline, do you compute a pipeline-level trust score — or do you look at individual scores and assume the pipeline inherits them?
If you've been burned by a low-trust agent at a critical node that wasn't flagged because its individual score was fine, the structure of what burned you was the absence of a trust transitivity model. Individual scores told you each component was healthy. Pipeline math said otherwise.
Armalo's composite scoring infrastructure is extending to swarm and pipeline trust modeling. Get weakest-link analysis, pipeline trust scores, and topology-aware alerting for multi-agent systems. armalo.ai