Karpathy's March of Nines

Your agent pipeline fails
more than you think.

A 10-step agent workflow at 90% per-step reliability succeeds only 34.9% of the time. Reliability compounds failure. This is the central insight behind harness engineering.

Measure your pipeline Read the docs

The compounding failure problem

Each step in your pipeline multiplies the failure probability of the previous one. By step 10, even a "good" 90% step rate collapses to catastrophic overall failure.

10 steps at 90% each34.9% overall

= 34.9%

10 steps at 95% each59.9% overall

= 59.9%

10 steps at 99.5% each95.1% overall

= 95.1%

The implication

Most agent pipelines run at 85–92% per-step reliability. At 10 steps, that means the workflow fails roughly half the time. The fix isn't better prompts — it's deterministic software rails: state machines, validation loops, and retry-on-failure gating.

Calculate your pipeline reliability

Enter the number of steps in your agent workflow and your estimated per-step success rate. See the real overall reliability — and what per-step rate you need to reach production-grade nines.

Pipeline Steps10

2 steps20 steps

Per-Step Reliability90.0%

80%100%

Overall Pipeline

10 steps × 90.0% each

34.9%

Need for 99% overall

99.900% / step

Need for 99.9% overall

99.9900% / step

This pipeline fails more than 40% of the time. Most production agents run here.

The harness engineering framework

Five interlocking mechanisms that turn unreliable AI calls into production-grade pipelines. Armalo implements all five and exposes them as SDK primitives for builders.

State Machines

Fixed-phase enforcement. Agents can only move to valid next states — invalid transitions are blocked at the DB level, not caught after the fact.

Validation Loops

Programmatic output checking with forced iteration on failure. Not just detecting bad output — retrying until it passes or hitting a defined retry cap.

March of Nines

Per-step reliability tracking with compounding math. Surfaces the real overall success rate of your pipeline, not the best-case single-step number.

Sub-Agent Delegation

Supervisor agents spawning isolated sub-agents with scoped context, model routing, and inherited permissions. Divide-and-validate at every layer.

SDK Coverage

One-line trust gates, retry wrappers, and pipeline reliability queries. Builders get harness-ready primitives without reinventing reliability infrastructure.

Know your pipeline's actual reliability.

Armalo computes per-step success rates from real eval data and shows you the compounding failure math for every agent workflow you run.

Get started free See pricing

Karpathy's March of Nines

Your agent pipeline fails
more than you think.

A 10-step agent workflow at 90% per-step reliability succeeds only 34.9% of the time. Reliability compounds failure. This is the central insight behind harness engineering.

Measure your pipeline Read the docs

The compounding failure problem

Each step in your pipeline multiplies the failure probability of the previous one. By step 10, even a "good" 90% step rate collapses to catastrophic overall failure.

10 steps at 90% each34.9% overall

= 34.9%

10 steps at 95% each59.9% overall

= 59.9%

10 steps at 99.5% each95.1% overall

= 95.1%

The implication

Calculate your pipeline reliability

Enter the number of steps in your agent workflow and your estimated per-step success rate. See the real overall reliability — and what per-step rate you need to reach production-grade nines.

Pipeline Steps10

2 steps20 steps

Per-Step Reliability90.0%

80%100%

Overall Pipeline

10 steps × 90.0% each

34.9%

Need for 99% overall

99.900% / step

Need for 99.9% overall

99.9900% / step

This pipeline fails more than 40% of the time. Most production agents run here.

The harness engineering framework

Five interlocking mechanisms that turn unreliable AI calls into production-grade pipelines. Armalo implements all five and exposes them as SDK primitives for builders.

State Machines

Fixed-phase enforcement. Agents can only move to valid next states — invalid transitions are blocked at the DB level, not caught after the fact.

Validation Loops

Programmatic output checking with forced iteration on failure. Not just detecting bad output — retrying until it passes or hitting a defined retry cap.

March of Nines

Per-step reliability tracking with compounding math. Surfaces the real overall success rate of your pipeline, not the best-case single-step number.

Sub-Agent Delegation

Supervisor agents spawning isolated sub-agents with scoped context, model routing, and inherited permissions. Divide-and-validate at every layer.

SDK Coverage

One-line trust gates, retry wrappers, and pipeline reliability queries. Builders get harness-ready primitives without reinventing reliability infrastructure.

Know your pipeline's actual reliability.

Armalo computes per-step success rates from real eval data and shows you the compounding failure math for every agent workflow you run.

Get started free See pricing

Your agent pipeline failsmore than you think.

The compounding failure problem

Calculate your pipeline reliability

The harness engineering framework

Know your pipeline's actual reliability.

Your agent pipeline failsmore than you think.

The compounding failure problem

Calculate your pipeline reliability

The harness engineering framework

Know your pipeline's actual reliability.

Your agent pipeline fails
more than you think.

Your agent pipeline fails
more than you think.