In a Three-Agent Pipeline, the Trust Score Is the Minimum. Not the Average. | Armalo Changelog

Most engineering teams building multi-agent pipelines have an intuition that pipelines are less reliable than their individual components. The intuition is correct. What's usually missing is the math — and the math reveals that the situation is worse than intuition suggests, worse in a specific way that changes how you should think about component reliability targets.

Take a five-agent sequential pipeline. Each agent has 95% reliability. The intuition: "We've got five nines agents, the pipeline should be pretty reliable." The calculation: 0.95^5 = 0.774. At ten agents: 0.95^10 = 0.599. Your pipeline of individually reliable agents is failing two out of every five runs.

Most teams know this abstractly. The part that changes behavior is doing the actual arithmetic for your pipeline depth and your component reliability targets, and then working backward: if I need 95% pipeline reliability with 10 agents, each agent needs to be at 99.5% (0.995^10 ≈ 0.951). That's not "pretty good" — that's extremely high and usually requires explicit engineering effort to achieve.

Why It's Actually Worse Than `0.95^N`

The compound probability calculation assumes independent failures. In LLM-based pipelines, failures are not independent. This is the part that individual trust scores don't capture.

Agent A produces outputs that become Agent B's inputs. When Agent A fails silently — producing a confident, well-formed, wrong response — Agent B doesn't know. It receives what looks like a normal input and processes it normally. Agent B's reliability guarantee was measured on correct inputs. The reliability guarantee says nothing about what happens when inputs are corrupted by an upstream failure that wasn't detected.

The critical asymmetry: loud failures don't propagate. When Agent A throws an error or returns a structured failure response, downstream agents know not to proceed. The failure is contained. Loud failures subtract from pipeline reliability in the straightforward compound-probability way.

Silent failures amplify. Agent A's silent failure enters Agent B as a normal-looking input. Agent B processes it, possibly adding its own reasoning on top of the corrupted data. By the time the error reaches Agent C, it has been processed and reframed through two layers that didn't catch it. Agent C adds a third layer of confident processing. The terminal output is wrong in a way that traces to Agent A's original silent failure — but there's no signal anywhere in the pipeline that a failure occurred.

The effective error rate at the terminal output is higher than the compound probability model predicts, by a factor that depends on the silent failure rate at each node.

Individual Scores Miss Pipeline Risk Entirely

When you query a trust oracle for a single agent, you get clean signal. The score reflects behavioral history on defined commitments, evaluated with ground-truth inputs. The certification tier reflects sustained performance. For single-agent deployment, this is exactly the right signal.

When you compose agents, three things happen that individual scores don't model.

Pact conditions were evaluated against the wrong input distribution. "Agent B: accuracy ≥ 92% on synthesis tasks" was evaluated with ground-truth inputs drawn from the pact's test distribution. In a pipeline, Agent B receives outputs from Agent A — which may look nothing like the test distribution if Agent A has behavioral drift, or if your production input distribution diverges from Agent A's training distribution. The pact is still live. The test has changed.

Confidence signals from upstream become inputs to downstream reasoning. If Agent A expresses 0.9 confidence on a response it got wrong, Agent B may weight that response heavily in its synthesis. The downstream agent is reasoning from a false prior with high expressed confidence. This is worse than Agent B reasoning from no confidence signal at all — high confidence in a wrong answer is more misleading than uncertainty in a wrong answer.

Your weakest agent determines your pipeline risk profile more than your average agent. Consider: orchestrator at 940, research at 890, synthesis at 880, validator at 910, delivery at 720. The average is 868. The terminal agent — whose output is what actually reaches users — is at 720. Everything before it was processed through four layers, and the output quality is bounded by what the 720-score agent produces from those inputs. The pipeline risk profile is set by the weakest agent at the most consequential position, not by the average score.

The Infrastructure Gap

Current agent evaluation infrastructure evaluates agents in isolation. This is the right starting point. You can't understand pipeline reliability without understanding component reliability. But isolation evaluation doesn't tell you what you need to know at composition time.

There's no standard infrastructure for:

Computing pipeline trust from component scores given pipeline topology
Weighting silent failure rate separately from aggregate failure rate
Flagging when a new agent added to a pipeline drops pipeline reliability below a threshold
Evaluating agents against their actual upstream input distribution, not their certified test distribution

The result: operators see healthy individual scores and assume the pipeline is healthy. The pipeline trust profile is invisible until something fails terminally.

What You Should Actually Compute

The practical minimum: treat pipeline reliability as min(component scores), not mean(component scores). This is conservative but usually correct for sequential pipelines. The terminal agent and any bottleneck nodes have outsized impact.

More precisely:

For the compound probability effect: (r_1 × r_2 ×... × r_N), where each r_i is component reliability. This gives you the baseline assuming independent failures.

For the silent failure amplification: identify each component's silent failure rate — the fraction of failures that produce well-formed wrong outputs rather than explicit errors. Weight each component's contribution to pipeline risk by its silent failure rate, not just its aggregate failure rate. A 97% reliable agent where 80% of the 3% failures are silent is riskier in a pipeline than a 95% reliable agent where 90% of failures are loud.

For position weighting: terminal agents and agents whose outputs feed multiple downstream consumers have higher impact per unit of reliability. A 720-score terminal agent in a five-agent pipeline is a different risk from a 720-score agent in the middle with a 940-score validator after it.

The engineering implication: if you need 95% pipeline reliability and you have a 10-agent pipeline, each agent needs to be at 99.5%. If your current agents are at 95%, your pipeline is at 60%. The gap between where you are and where you need to be is not closeable by declaring components "good enough" — it requires either higher component reliability targets or explicit error containment at each pipeline node.

The Security Parallel

Cryptographers formalized this decades ago. TLS validates every certificate in a trust chain, not just the root. One compromised intermediate CA undermines the entire trust path regardless of root CA trustworthiness.

An orchestrator with a Platinum trust score does not protect downstream users from a Bronze-tier terminal agent. Trust doesn't flow upward from weak components to strong ones. Trustworthiness, like security, doesn't compose naively. It requires reasoning about the composition as a system.

The question for any multi-agent system in production: do you have a pipeline-level trust score, or are you looking at individual component scores and assuming the pipeline inherits them?

Armalo's composite scoring infrastructure is extending to pipeline and swarm trust modeling: weakest-link analysis, topology-aware scoring, silent-failure rate tracking. armalo.ai

Why It's Actually Worse Than `0.95^N`

The compound probability calculation assumes independent failures. In LLM-based pipelines, failures are not independent. This is the part that individual trust scores don't capture.

The effective error rate at the terminal output is higher than the compound probability model predicts, by a factor that depends on the silent failure rate at each node.

Individual Scores Miss Pipeline Risk Entirely

When you compose agents, three things happen that individual scores don't model.

The Infrastructure Gap

There's no standard infrastructure for:

Computing pipeline trust from component scores given pipeline topology
Weighting silent failure rate separately from aggregate failure rate
Flagging when a new agent added to a pipeline drops pipeline reliability below a threshold
Evaluating agents against their actual upstream input distribution, not their certified test distribution

The result: operators see healthy individual scores and assume the pipeline is healthy. The pipeline trust profile is invisible until something fails terminally.

What You Should Actually Compute

More precisely:

For the compound probability effect: (r_1 × r_2 ×... × r_N), where each r_i is component reliability. This gives you the baseline assuming independent failures.

The Security Parallel

The question for any multi-agent system in production: do you have a pipeline-level trust score, or are you looking at individual component scores and assuming the pipeline inherits them?

Armalo's composite scoring infrastructure is extending to pipeline and swarm trust modeling: weakest-link analysis, topology-aware scoring, silent-failure rate tracking. armalo.ai

Why It's Actually Worse Than 0.95^N

Individual Scores Miss Pipeline Risk Entirely

The Infrastructure Gap

What You Should Actually Compute

The Security Parallel

Why It's Actually Worse Than 0.95^N

Individual Scores Miss Pipeline Risk Entirely

The Infrastructure Gap

What You Should Actually Compute

The Security Parallel

Why It's Actually Worse Than `0.95^N`

Why It's Actually Worse Than `0.95^N`