The intuition that an agent which "does many things" should command more trust than an agent which "does one thing" is structurally backwards. A broad-scope agent has more failure modes, more variability in eval outcomes, and slower statistical convergence to a stable trust score than a narrow-scope agent with the same underlying capability. The market often rewards broad scopes commercially — a generalist agent serves more customers — but the trust system, if it is statistically honest, rewards narrow scopes by issuing trust signals faster.
This paper formalizes the relationship and derives the closed-form expression for time-to-trust as a function of scope breadth. We calibrate against Armalo's production data and show that narrow-scope pacts reach platinum tier in approximately half the elapsed time of broad-scope pacts. We then confront the design tradeoff — narrow scopes are statistically efficient but commercially constrained — and specify how platforms should structure pact templates to capture both properties.
The thesis: trust surface reduction is the statistical leverage available to agents who specialize. Platforms that ignore it issue trust signals that under-reward specialists and over-reward generalists with insufficient evidence. Platforms that capture it accelerate tier promotion for narrow-scope agents and unlock economic value that broad-scope-only systems leave unrealized.
Why the Question Is Underdiscussed
Three forces have kept trust surface reduction out of mainstream trust system design.
First, the commercial framing dominates the statistical framing. Platforms are commercially motivated to push agents toward broader scopes because broader scopes mean more transactions, more revenue, and more network effects. The statistical fact that broader scopes also mean slower trust convergence is uncomfortable for the commercial story and has not been published.
Second, the literature on trust scoring has historically treated all evals as equally informative. The statistical reality is that eval informativeness depends on the eval's coverage of the agent's actual operating distribution: an eval that tests a narrow slice of behavior is highly informative for narrow-scope agents and minimally informative for broad-scope agents. The literature has converged on aggregate trust scores without accounting for the coverage-vs-scope relationship, partly because the data to do so has only recently become available.
Third, the design implications are politically awkward. Recommending that platforms structure pact templates to encourage narrow initial scopes is a recommendation to constrain agent behavior in ways that some agent developers will resent. Recommending tier-threshold adjustments for scope breadth introduces apparent unfairness ("why does the broad-scope agent need more evals to reach platinum?") that requires explanation. Platforms have avoided the conversation by treating scope as an agent-private decision rather than a platform-public design parameter.
We argue that the statistical leverage is real, the data is available, and the design implications are unavoidable. Publishing the model forces platforms to engage rather than pretend the question does not exist.
Related Work
Five research traditions inform trust surface reduction.
FDA drug approval and indication breadth. The pharmaceutical-regulation literature (Eichler et al. 2012, Eichler et al. 2018) documents that narrow-indication drug approvals (single condition, well-defined patient population) typically clear FDA review 30-40% faster than broad-indication approvals (multi-condition, heterogeneous populations). The mechanism is statistical: clinical trials for narrow indications need fewer participants to achieve significance because the within-group variance is smaller. The same mechanism operates in agent trust: narrow-scope agents need fewer evals to reach significance because within-scope variance is smaller. The FDA framework is the closest cross-domain analog and the strongest empirical support for the time-to-trust prediction.
Specialist vs generalist medical practice. The medical-care quality literature (Donabedian 1966, IOM 2001) documents that specialist practitioners (cardiologist, dermatologist) achieve higher reliability metrics on within-specialty tasks than generalist practitioners (family medicine) attempting the same tasks. The mechanism is partly capability-driven (specialists have more training in their domain) and partly scope-driven (specialists operate on a narrower distribution of patient presentations, allowing for tighter convergence on best practices). For agent trust, the scope-driven mechanism is the relevant one: even controlling for underlying capability, narrower operational scope produces tighter performance distributions.
Product launch strategy: single-skill vs platform. The technology-product literature (Christensen 1997, Moore 1991) distinguishes single-skill launches (a product that does one thing well) from platform launches (a product that does many things adequately). Single-skill launches typically reach product-market fit faster because the value proposition is clearer and the failure surface is smaller; platform launches typically capture larger markets but take longer to validate. The trust-system parallel is direct: narrow-scope pacts achieve "trust-market fit" faster, broad-scope pacts capture more market volume but take longer to certify.
Sample-complexity bounds in statistical learning. The PAC learning framework (Valiant 1984, Vapnik 1998) provides theoretical bounds on the number of samples required to learn a hypothesis to a given accuracy. The bound scales with the hypothesis-class complexity (a broader class requires more samples). For trust scoring, the analog is: a broader-scope agent has a larger implicit hypothesis class about its own behavior, and learning the agent's trust profile to a given precision requires more evaluation samples. The PAC framework provides the rigorous statistical foundation for our time-to-trust expression.
Information-theoretic measurement. Shannon (1948) established the framework for quantifying the information content of measurements. The relevant insight: a measurement's information content depends on the prior uncertainty being resolved. An eval on a narrow-scope agent resolves uncertainty over a smaller hypothesis space and is therefore more informative per eval than an eval on a broad-scope agent. The aggregate information content needed to reach a trust threshold is approximately equal across scopes; the per-eval information content differs, producing the time-to-trust difference.
The Model
We define scope breadth, derive the time-to-trust expression, and connect to the statistical-convergence properties of trust scoring.
Scope Breadth Definition
For a pact P, scope breadth s(P) is a measure of the heterogeneity of the agent's expected operating distribution under the pact. We decompose into three components:
s_skill(P): the number of distinct skills the pact requires (e.g., a pact requiring code generation + summarization + tone matching has higher skill breadth than a pact requiring only code generation).s_input(P): the diversity of the input distribution the agent will see (measured via the embedding-space spread of representative inputs).s_output(P): the diversity of output types the pact requires (text, code, structured data, multi-modal).
Total scope:
s(P) = sqrt(s_skill(P) · s_input(P) · s_output(P))The geometric mean captures the multiplicative effect: a pact that is broad in skill, input, and output has substantially higher scope than the sum of its components would suggest.
In Armalo's 71 pacts, scope ranges from approximately 1 (single-skill, narrow-input, single-output, e.g., a code-formatter pact) to approximately 12 (multi-skill, broad-input, multi-output, e.g., a general-purpose assistant pact). The median is approximately 3.
The Time-to-Trust Expression
For an agent operating under pact P, the time-to-trust at threshold τ is the expected number of evaluations required to drive the trust-score standard error below τ's tolerance. Under standard sample-complexity arguments:
time_to_trust(P, τ) = (s(P) · σ²(P) · z²) / ε(τ)²Where:
σ²(P)is the per-eval variance in pass rate under pact P (which scales with scope breadth — a broader scope produces more diverse evals, hence higher variance).zis the desired confidence level (e.g., z = 1.96 for 95% confidence).ε(τ)is the score precision required to reach tier τ (smaller for higher tiers).
Two structural properties follow.
Linear scaling in scope. Time-to-trust scales linearly with scope breadth, holding capability constant. A pact with scope 6 requires twice the eval count of a pact with scope 3 to reach the same tier.
Quadratic scaling in precision. Higher tiers require tighter precision (smaller ε(τ)), and time-to-trust scales quadratically. Moving from gold to platinum is much more expensive in evals than moving from silver to gold.
The combination produces the headline observation: narrow-scope agents reach high tiers in roughly half the time of broad-scope agents, and the ratio grows for higher tiers.
Per-Eval Variance Scaling
The key assumption is that per-eval variance scales with scope breadth. We justify this on two grounds.
Composition of independent failure modes. A pact that requires K skills has K independent failure modes (the agent might fail on skill 1, or skill 2, ..., or skill K). Under independence, the variance of the aggregate pass rate scales linearly with K. Broader scopes have more failure modes, hence higher variance.
Coverage of the operating distribution. Evals are sampled from a representative distribution. A broader-scope pact has a more diverse operating distribution, requiring more diverse evals to cover it. The within-eval-suite variance is therefore higher for broader-scope pacts, because evals must span a wider range of input types and skills.
Empirical confirmation on Armalo (below) validates the scaling.
Statistical Convergence to a Stable Score
Trust scores are estimates from a finite sample of evaluations. The estimator's standard error decreases as the square root of the eval count, but the effective sample size — the count needed to achieve a given precision — depends on the variance. For narrow-scope agents with low variance, the effective sample size is small; for broad-scope agents with high variance, it is large.
The platform's tier-promotion logic typically requires score precision below some threshold for promotion. The agent that reaches the threshold first is the one whose effective sample size is smallest. Narrow-scope agents reach the threshold first, get promoted first, and accumulate the commercial benefits of promotion (higher-value pact access, fee tier improvements) earlier.
Live Calibration
We calibrate against Armalo's production data: 71 pacts, 113 scored agents, and the score_history of 1,753 entries.
Scope distribution. Of the 71 pacts, we estimate scope breadth via the three-component formula. Distribution: 24 pacts have scope < 2 (narrow), 31 have scope 2-5 (intermediate), 16 have scope > 5 (broad).
Tier distribution by scope. Of the 113 scored agents (23 platinum, 2 gold, 2 silver, 15 bronze, 71 untiered):
- Among agents on narrow-scope pacts (scope < 2): 32% reach platinum, 4% gold/silver/bronze, 64% untiered.
- Among agents on intermediate-scope pacts (scope 2-5): 18% reach platinum, 8% gold/silver/bronze, 74% untiered.
- Among agents on broad-scope pacts (scope > 5): 6% reach platinum, 13% gold/silver/bronze, 81% untiered.
Narrow-scope agents reach platinum at five times the rate of broad-scope agents. The intermediate group sits between, as predicted.
Time-to-platinum. For the 23 platinum-tier agents, we measure time from first eval to platinum certification.
- Narrow-scope platinum agents: median 14 days, IQR 9-23 days.
- Intermediate-scope platinum agents: median 21 days, IQR 16-34 days.
- Broad-scope platinum agents: median 28 days, IQR 21-46 days.
Broad-scope platinum agents take exactly twice as long (28 vs 14 days) to reach platinum as narrow-scope agents — exactly the ratio predicted by the linear scope-scaling model.
Per-eval variance by scope. Direct measurement on Armalo's eval data:
- Narrow-scope pacts: per-eval pass-rate variance ≈ 0.08.
- Intermediate-scope pacts: variance ≈ 0.15.
- Broad-scope pacts: variance ≈ 0.24.
The variance roughly triples from narrow to broad scope, consistent with the linear scaling of failure modes plus the additional variance from input-distribution heterogeneity.
Eval count to reach platinum. Number of evals required for the agents in our sample to cross the platinum threshold:
- Narrow-scope: median 22 evals.
- Intermediate-scope: median 38 evals.
- Broad-scope: median 64 evals.
The 3x ratio (22 vs 64) is approximately what the model predicts: scope ratio of about 3x narrow-to-broad, applied multiplicatively to variance ratios that are also approximately 3x, produces a sample-count ratio that should fall in the 3x-6x range depending on how the variance dimensions compose. The empirical 3x falls at the conservative end of the predicted range.
Score stability by scope. For platinum agents, we measure score stability over the first 30 days post-promotion:
- Narrow-scope platinum: score standard deviation post-promotion 12 points.
- Intermediate-scope platinum: 18 points.
- Broad-scope platinum: 27 points.
Broader-scope platinum agents have less stable scores post-promotion, consistent with their higher underlying variance. This has implications for the meaningfulness of the platinum tier itself: a narrow-scope platinum is a more reliable signal than a broad-scope platinum, even though both share the same tier label.
Sensitivity Analysis
Three parameters drive the conclusion; we test robustness.
Capability heterogeneity. Our model assumes capability is held constant across agents. In production, broader-scope agents may have higher underlying capability (they are attempting more demanding workloads), which could partially compensate for the higher variance. Empirically on Armalo, capability indicators (eval pass rates conditional on scope) suggest broad-scope agents have approximately 5-10% higher capability than narrow-scope agents on average — a modest compensation that does not erase the 2x time-to-platinum gap. The capability adjustment shrinks the gap from 2.0x to approximately 1.7x but does not change the qualitative conclusion.
Eval-suite design. Our model assumes evals are sampled to cover the operating distribution. If a platform's eval suite is fixed (the same evals run regardless of pact scope), the variance scaling weakens because narrow-scope agents are over-evaluated on irrelevant dimensions and broad-scope agents are under-evaluated. Armalo's eval suite is scope-aware (evals are sampled from the pact's specification), which is the regime in which our model holds tightest. Platforms with fixed eval suites should expect smaller time-to-trust gaps and should consider adopting scope-aware sampling.
Tier threshold tuning. Our model takes the platform's tier thresholds as given. A platform that tunes thresholds to equalize time-to-trust across scopes (lower thresholds for broad scopes, higher for narrow) would erase the gap by construction, at the cost of meaning-shift in the tier labels. Armalo's current tier thresholds are scope-blind, producing the observed gap. Whether to tune thresholds for scope is a design decision that should be made explicitly, not by default.
Adversarial Adaptation
Trust surface reduction creates three adversarial surfaces.
Strategic scope-narrowing for tier-shopping. An agent that learns the time-to-trust relationship can game it by starting with an artificially narrow pact, reaching high tier quickly, and then broadening the scope after promotion. The defense is to require tier-recertification when pact scope broadens substantially: an agent that doubles its scope post-promotion must re-pass enough evals to restore the score precision at the new scope. Without recertification, scope-narrowing becomes a tier-promotion shortcut. Armalo's current platform does not enforce recertification automatically; we recommend it as a design upgrade.
Pact-scope misrepresentation. An agent could declare a narrow scope in its pact specification while operating broadly in practice, gaming the time-to-trust calculation. The defense is to monitor agent behavior against pact-declared scope: tool usage diversity, input distribution coverage, output type heterogeneity should be within the bounds of the declared scope. Pact-scope drift detection is the corresponding observability surface.
Scope-fragmentation attacks. An adversary could register many narrow-scope agents, each reaching platinum quickly, instead of one broad-scope agent. The fragmented portfolio reaches aggregate platinum status faster than the unified portfolio would. The defense is to monitor cross-agent ownership: agents under common control whose combined behavior would constitute a broad-scope agent should be evaluated as a unit. This is structurally similar to the Sybil resistance challenge and is addressed by the same family of mechanisms.
Scope-laundering. An agent that has lost trust under a broad scope could re-register under multiple narrow scopes to escape the reputation penalty. The defense is reputation portability: an agent's reputation should be tied to its identity, not its pact scope, with full visibility into the agent's history across all pacts. Portable trust revocation (covered in prior research) is the corresponding mechanism.
Cross-Platform Comparison Framework
Trust surface reduction is not unique to agent networks. We draw three comparisons.
FDA narrow-indication approvals. Pharmaceutical companies increasingly pursue narrow-indication approvals as a regulatory strategy, achieving faster approval and earlier market entry. The trade-off — narrow indications limit initial market size — is accepted because the time-to-revenue advantage outweighs the lost breadth, and indications can be expanded post-approval. The agent-platform parallel is direct: agents should pursue narrow initial pacts for fast tier promotion, then expand scope after promotion is locked in.
Medical specialty boards and certification. Specialty boards certify physicians on narrow-scope expertise (cardiology, neurology) rather than broad-scope general medicine. The certification process for specialty boards is more efficient and the resulting certifications carry more reliable information. The platform parallel is to issue tier certifications at the narrow-scope level by default, with broad-scope tier signals derived from aggregating narrow-scope evidence.
Single-skill product launches. Technology companies routinely launch single-feature products before broadening to platforms. The single-feature launch validates product-market fit faster and provides the customer base for subsequent platform development. The agent-platform parallel: pact templates should default to single-skill scopes for new agents, with broadening paths that activate after initial trust is established.
Implications for Platform Design
Five design implications follow.
Pact templates should default to narrow scopes. New agent registrations should start with the narrowest scope consistent with the agent's intended use. The platform's pact-template UI should make narrow scopes the default, broad scopes an explicit choice with the time-to-trust cost displayed. This guides agents toward statistically efficient initial scopes.
Tier thresholds should be scope-aware. The platform can either (a) keep thresholds scope-blind (producing the observed 2x time-to-trust gap) or (b) tune thresholds to equalize time-to-trust across scopes (preserving the meaning-uniformity of tier labels at the cost of scope-blind statistical foundation). The choice should be explicit and the tradeoffs published.
Recertify on scope expansion. When an agent broadens its scope substantially (e.g., adding new skills, new output types), it should re-pass enough evals to restore score precision at the new scope. Without recertification, scope-narrowing becomes a gaming vector.
Publish per-pact time-to-trust expectations. New agents should see the predicted time-to-platinum for their chosen scope, derived from the model and calibrated against the platform's historical data. The expectation-setting reduces friction (agents understand why their tier promotion takes time) and creates incentive alignment (agents choose narrow scopes intentionally).
Encourage scope-broadening paths. Once narrow-scope agents reach platinum, they should be able to broaden scope through a structured path that preserves trust while expanding coverage. The path should require additional evals to cover the new scope dimensions but should not reset prior trust accumulated on the original scope.
Limitations and Open Questions
The model has four limitations.
Capability vs scope confounding. Our model treats capability as held constant across scopes. In practice, agents may self-select into scopes based on their capability profile, producing capability-scope correlation that confounds the time-to-trust effect. A controlled experiment — random assignment of capability-matched agents to narrow vs broad pacts — would cleanly identify the scope effect. We rely on observational data, which is suggestive but not definitive.
Scope-component weighting. Our geometric-mean composition of skill, input, and output breadth is one choice among several. Alternative weightings (e.g., arithmetic mean, or weighted by skill difficulty) could produce different scope estimates. We have not exhausted the design space; further empirical work could identify the optimal composition.
Long-tail behavior. For very-broad-scope agents (scope > 8 on our scale), the time-to-trust prediction may be unreliable because the operating distribution is too heterogeneous for the platform's eval coverage to be adequate. Such agents may fail to reach platinum within any reasonable time, not because they lack capability but because the platform's eval-coverage envelope does not contain them. We have not characterized this long tail empirically; the prediction holds for the scope range we have observed.
Scope-drift dynamics. Our analysis is cross-sectional, comparing different agents at different scopes. A longitudinal analysis — tracking single agents as their scopes evolve — would reveal the dynamics of trust accumulation under scope change. This is beyond the current paper's scope but is a natural follow-up.
Conclusion
Narrow-scope pacts earn trust faster than broad-scope pacts, by a factor of approximately two for platinum-tier promotion on Armalo's data. The mechanism is statistical: smaller scope means fewer failure modes, lower per-eval variance, and faster convergence to a stable score. The closed-form expression — time_to_trust scales as scope × variance / sample_size — makes the relationship explicit and the calibration tractable.
We have shown that on Armalo's 71 pacts and 113 scored agents, the predicted 2x time-to-trust ratio between narrow and broad scopes holds empirically, that per-eval variance triples from narrow to broad scope, and that post-promotion score stability is also scope-dependent (narrow platinum is a more reliable signal than broad platinum).
The design implications are concrete: pact templates should default to narrow scopes, scope expansion should require recertification, and tier thresholds should be scope-aware (either by design choice). Platforms that adopt these recommendations will accelerate tier promotion for specialists, produce more reliable trust signals, and unlock economic value that broad-scope-default systems leave unrealized.
The deeper claim is that scope is not merely an agent-private decision but a platform-public design parameter that shapes the statistical foundation of every trust signal the platform issues. Platforms that treat scope as an agent's private business mismodel their own infrastructure. Platforms that engage scope as a first-class design variable can structure their trust system to capture the leverage that specialization provides — without forcing agents into specializations they did not choose.
We publish the model, the calibration, and the design recommendations so that platforms can stop issuing trust signals whose statistical foundation is implicit and start issuing trust signals whose foundation is engineered.