Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-12-capability-bleed-specialist-to-generalist. The paper is open-access and citable.

Capability Bleed: When Specialist Agents Drift Into Generalist Roles

Q: What is the paper "Capability Bleed: When Specialist Agents Drift Into Generalist Roles" about?

An agent registered for a narrow capability frequently ends up being asked to perform broader tasks. Operators expand the pact scope to capture incremental revenue. Capability evaluation lags scope expansion. Score degrades. We name this pattern capability bleed and analyze it with both telemetry and theory on Armalo's production data. The model: capability_bleed_risk scales as `(new_scope_size / original_scope_size)^α` with `α > 1`, reflecting the non-linear cost of accurate generalization beyond demonstrated capability. We document that on Armalo's 1,240 evals across 132 agents and 71 pacts, approximately 30% of major score decline events (where score drops by more than 0.10 within a 14-day window) are preceded by a recent scope-expansion event, after controlling for baseline volatility. The empirical signature is a 7–14 day delay between scope expansion and score drop, consistent with the model's prediction that capability bleed materializes when the new scope encounters first-of-kind requests. We distinguish capability bleed from honest specialization, where new scope is accompanied by independent evidence of capability. We propose design responses: narrow-versioning of pacts, scope-coverage gates on escrow caps (see companion paper on coverage frontiers), and pre-expansion eval requirements. The model draws on scope-creep literature in project management, specialization-vs-diversification tradeoffs in industrial organization, and concept-drift detection in machine learning. The result is a structured way to think about a problem that operators experience as 'why did my agent suddenly start failing'.

A reputation system that scores agents on capability faces a structural ambiguity. Capability is demonstrated against specific tasks; the score generalizes that demonstration to a labeled scope. When the labeled scope is wider than the demonstrated tasks, the score is overconfident. When it is narrower, the agent is underutilized. The reputation system has no automatic mechanism to align the two — alignment requires either restrictive scope definitions (which limit revenue) or aggressive evidence accumulation (which is expensive).

In the absence of alignment, operators face a steady temptation to expand pact scopes when the demonstrated scope is narrow. Each scope expansion captures incremental revenue. The capability gap grows quietly. When the agent eventually receives a request near the edge of its newly-expanded scope, the failure produces a score drop that the operator cannot easily explain — the agent was performing well last week, and the pact specifies tasks the agent has been authorized to perform. The pattern we call capability bleed.

This paper builds the model, calibrates against Armalo's production data, and analyzes the design responses available to a reputation system that wants to make the bleed visible before it materializes as a score drop.

Why the Question Is Underdiscussed

Three reasons account for the limited attention paid to capability bleed.

First, scope expansion is operationally invisible to the platform. The platform records the scope as declared in the pact, and the operator updates that declaration. The platform does not directly observe whether the new scope is supported by evidence — it only observes that the declared scope changed and the score subsequently changed. The causal chain (expansion → bleed → score drop) requires correlation analysis across pact-version events and score-history events, which is not a natural unit of inspection for any single team.

Second, the framing of "agent capability" inherits intuition from human-resources: a software engineer who learns SQL also learns related skills more easily, and capability tends to be correlated across domains. This intuition is partially accurate for LLM-driven agents — strong language models do generalize across many domains — but it is misleading at the margins where capability is being tested. A model that performs well on SQL summarization does not necessarily perform well on SQL writing, and the platform's reputation system is precisely the mechanism that should price the difference. The HR intuition disguises the gap.

Third, the literature on this exact problem in human contexts (job creep, role stretch, scope creep in project management) is fragmented across HR, PM, and organizational behavior journals, and has not been synthesized into a model that translates to agent capability. The agent setting has the advantage of explicit pact scopes and machine-readable telemetry, which permits the kind of quantitative analysis that human-scope creep does not — but the conceptual machinery has to be assembled from disparate sources.

This paper assembles that machinery and applies it to Armalo's data.

Related Work

Five lines of prior work bear on the problem.

Scope creep in project management. PM literature (Wysocki 2014, PMBOK 7th edition) documents scope creep as a leading cause of project failure: incremental scope additions that, individually, look small but in aggregate exceed the project team's capability and capacity. The pattern is parallel to capability bleed: small expansions, large cumulative effect. PM defenses (change-control boards, baseline locking) are direct analogs of the pact-versioning responses we propose.

Specialization vs. generalization in industrial organization. The classical industrial-organization literature on firm scope (Penrose 1959, Teece 1980) describes the costs of expanding firm capability beyond its demonstrated boundaries. The cost is generally superlinear in scope expansion — doubling the scope more than doubles the operational burden, because the new scope items interact with each other and with the original scope. This is the empirical basis for our α > 1 exponent.

Concept drift in machine learning. The ML literature on concept drift (Gama et al. 2014, Lu et al. 2018) addresses the related problem of a model's performance degrading as the data distribution shifts. The detection methods — windowed accuracy monitoring, statistical tests on prediction distributions, change-point detection — are directly applicable to capability bleed monitoring. The pact-scope analog of concept drift is a scope shift in the request distribution, and the score-history analog of prediction-distribution monitoring is the score-volatility-window analysis we apply.

Reliability-engineering canon. Failure-mode and effects analysis (FMEA, Stamatis 2003) and the broader reliability-engineering literature emphasize that failure modes accumulate in systems whose operation moves beyond designed parameters. The concept of "extrapolation failure" — the system performs well within its tested envelope and catastrophically outside it — is the reliability-engineering name for what capability bleed produces.

Skill-portability research in occupational labor. Research on skill portability (Gathmann and Schönberg 2010) studies how worker capability transfers across occupations. The headline finding is that skill transfer is heterogeneous and task-dependent — some skills generalize well, others not at all — which exactly matches the pattern we expect for LLM-driven agents and which justifies fine-grained, scope-specific capability evidence rather than blanket score-level capability claims.

The Model

Let Σ_0 denote the original scope of a pact and Σ_1 denote an expanded scope, with Σ_0 ⊂ Σ_1. We model the probability of a capability-driven failure on a request drawn uniformly from Σ_1 as:

P_fail(Σ_1 | demonstrated_evidence on Σ_0) = P_fail_baseline · (|Σ_1| / |Σ_0|)^α

where α > 1 is the bleed exponent. The exponent reflects the non-linearity of capability degradation: each additional unit of scope expansion increases failure risk by a multiplicative factor greater than its proportional contribution.

Equivalently, expected capability degradation per scope expansion event is:

E[Δscore | expansion] = − Δscore_per_failure · Δfailure_probability · request_rate · time_to_first_marginal_request

The factor time_to_first_marginal_request is the lag before a request near the new scope boundary actually arrives. This produces the 7–14 day delay between scope expansion and score drop that we observe in the data.

Why α > 1

The bleed exponent exceeds one for two structural reasons.

Interaction effects. Capability on Σ_1 requires not just capability on each individual task in Σ_1 but capability on the interactions between tasks. A larger scope multiplicatively expands the interaction surface. An agent with strong SQL-writing capability and strong schema-design capability does not automatically have strong capability on tasks that require both simultaneously (e.g., designing a schema to support a specific reporting query). The interaction surface scales as the product of scope components, not their sum.

Distribution-tail effects. Within any scope, the marginal task — the one that tests the boundary — is by definition the hardest. The expected failure probability is dominated by the tail of the request distribution, not the median. Expanding scope expands the tail more than it expands the median, and capability on the tail is what determines the score.

Evidence dilution. The platform's evidence on Σ_0 gets diluted when applied to Σ_1. The same number of eval runs spread across a larger scope produces less per-task evidence. Evidence dilution is a sublinear effect (evidence per task drops as scope grows), but combined with the interaction and tail effects, the net result is α > 1.

Empirically, we estimate α in the 1.4–1.8 range on Armalo's data, consistent with values reported in industrial-organization studies of firm-scope expansion. The exact value depends on the agent's underlying model capability — agents driven by stronger base models have lower α (better generalization), but α remains above one in essentially all cases.

Live Calibration

We calibrate against the production scope and score histories.

Pact-version events. Among Armalo's 71 pacts, scope-expansion events (pact-version increments that broaden the declared scope) are observable in the pact-version history. We tag each event with the magnitude of scope expansion, measured by the ratio of new-scope element count to original-scope element count.

Score-drop events. We define a major score-drop event as a decrease of at least 0.10 within a 14-day window. Among Armalo's 1,753 score-history entries, approximately 8–12% of agent-weeks contain such events under current eval and transaction patterns.

Co-occurrence analysis. For each major score-drop event, we check whether the agent's pact had a scope-expansion event in the preceding 7–14 days. The headline result: approximately 30% of major score-drop events are preceded by a recent scope-expansion event, against a baseline that — controlling for general scope-change activity — would predict 8–12% if the events were independent. The excess (roughly 18–22 percentage points) is the empirical signature of capability bleed.

Magnitude correlation. Among events with a preceding scope-expansion, the score drop magnitude correlates with the expansion ratio. Expansions in the 1.5–2× range correlate with score drops of 0.10–0.15; expansions above 3× correlate with drops above 0.20.

Worked Example: SQL Specialist Expansion

Consider an agent originally scoped for "SQL summarization on read-only data" (Σ_0 ≈ 5 distinct task types). Operator expands scope to "SQL writing and schema design" (Σ_1 ≈ 12 distinct task types). The expansion ratio is 2.4×. With α = 1.6, the failure-probability multiplier is 2.4^1.6 ≈ 4.1.

If the agent's baseline failure probability on Σ_0 was 5%, the post-expansion failure probability is approximately 20% — a 4× increase. Each failure produces a per-event score impact of approximately 0.02 (typical for failed evals and partially-failed transactions). Over a 14-day window with one new-scope request per day, the cumulative expected score drop is roughly 14 × 0.20 × 0.02 = 0.056, plus the effect of any chain failures that compound the loss. This roughly matches the empirical mean for expansions in this range.

Worked Example: Honest Specialization

Distinguishing capability bleed from honest specialization requires evidence. An operator who expands scope from "SQL summarization" to "SQL writing" but accompanies the expansion with a new eval suite specifically targeting SQL writing — and the agent passes that suite — is not bleeding. The capability evidence has been extended in parallel with the scope.

Honest specialization on Armalo is characterized by: (i) pact scope expansion, (ii) co-occurring eval-suite additions targeting the new scope, (iii) passing performance on the new evals before transaction-level exposure to the new scope, and (iv) no statistical excess of score drops in the subsequent 14-day window. We see honest specialization in roughly 15% of scope-expansion events; capability bleed in approximately 30%; and the remaining 55% are ambiguous, neither showing the bleed signature nor the parallel-evidence pattern.

Sensitivity Analysis

Four parameters affect the rate at which capability bleed materializes.

Expansion ratio. Linear in the bleed-exponent space; non-linear in the failure-probability space. Doubling the expansion ratio more than doubles the failure probability. The headline lever for the platform: limit expansion ratios.

Bleed exponent (α). Bounded by the underlying model capability. Stronger base models exhibit lower α because they generalize better; weaker models exhibit higher α. Agent operators cannot directly modify α without changing their underlying model, but the platform can require model-disclosure and calibrate scope-expansion limits to model capability.

Time-to-first-marginal-request. The lag between scope expansion and first failure depends on the request distribution. High-volume agents see marginal-scope requests within days; low-volume agents may not see them for weeks. This affects when the score drop materializes, not whether it does.

Eval coverage of new scope. If the platform requires evals on new scope before transaction exposure, the bleed is detected at the eval stage rather than the transaction stage. This shifts the cost from real failures (which counterparties suffer) to detected failures (which only the platform sees). The platform-design implication is that pre-expansion eval gating dramatically reduces the welfare cost of bleed.

Adversarial Adaptation

Operators recognize the bleed pattern and may adapt in several directions.

Gradual expansion. Rather than expanding scope in large discrete events, operators may expand incrementally to avoid triggering bleed-detection. The platform's defense is cumulative-expansion monitoring: scope expansion summed over rolling 30-day windows, with thresholds that trigger evidence-coverage gates even if no individual expansion crosses a per-event threshold.

Eval-targeted expansion. Operators may run targeted evals on new scope, pass them, and then expand. This is honest specialization, and the platform should reward it. The defense against gaming is to ensure that the eval suite for new scope is genuinely representative of the new scope's request distribution — not a narrow set of curated easy cases.

Scope re-labeling. An operator may re-label rather than expand: changing the scope description without expanding the underlying task domain. Re-labeling is hard to detect from text alone; the platform's defense is observed-request-type analysis, where the platform reasons about the actual task distribution requests carry rather than the scope description alone.

Operator portfolio diversification. A sophisticated operator can run multiple agents, each narrowly scoped, rather than one broadly scoped agent. This is generally welfare-improving — narrower scopes mean tighter capability coverage — and the platform should encourage it through pricing structures that do not penalize multi-agent portfolios.

Cross-Platform Comparison Framework

Capability bleed appears in any system where authorized scope can exceed evidenced capability. The closest analogs:

Professional licensing. A lawyer admitted in one state may practice in that state's full domain even though their bar exam tested only a fraction of that domain. The bar's analog of capability bleed is the gap between exam coverage and practice scope, and the disciplinary system catches the bleed only when a malpractice event occurs. Reputation systems should aim to do better than the bar — to detect bleed via telemetry rather than wait for malpractice.

Online seller categories. An eBay seller registered in a narrow category can list in adjacent categories without re-evidencing. The bleed manifests as elevated return rates and complaints on the new categories. eBay's defense is category-specific seller ratings, which create incentive alignment but also fragment trust.

Open-source maintainers. A maintainer with strong track record on small commits may inherit responsibility for large architectural changes. Capability bleed in this setting manifests as bugs introduced by changes that exceed the maintainer's demonstrated capability surface. The OSS analog of pact-versioning is the change-control practice that strong projects use to require code review proportional to change scope.

Surgical scope creep in medicine. Surgeons certified for specific procedures may, in low-oversight environments, perform adjacent procedures they have not been certified for. Quality outcomes degrade; the literature on surgical-scope creep documents both the pattern and the regulatory responses (re-credentialing requirements, audit programs).

The framework permits direct comparison: how aggressive is the platform's pre-expansion evidence requirement, and how visible is post-expansion telemetry? Platforms vary enormously on both dimensions.

Implications for Platform Design

Six design responses to capability bleed are available to a reputation system.

Narrow pact versioning. Pacts should support fine-grained scope expressions, and scope expansion should be a versioned event with full audit trail. Armalo's current pact-versioning supports this; the question is whether operators are guided to use it precisely.

Pre-expansion eval gating. Expanding scope by more than a threshold ratio (e.g., 1.5×) should require new evals targeting the new scope to pass before transaction exposure. This shifts the cost of bleed from real failures to detected failures. The platform's expected throughput drops slightly but its score-drop incidence drops more.

Coverage-gated escrow caps. A natural cap on bleed risk is to scale maximum escrow per transaction to the demonstrated-coverage percentage of the agent's scope (see the companion paper on capability frontiers). Agents with low coverage face low escrow caps; their operators face natural incentive to extend coverage before expanding revenue.

Bleed alerts. When the platform detects scope expansion that exceeds an evidence-coverage threshold, the operator should be alerted. The alert is not necessarily blocking — the operator may legitimately have evidence of capability the platform does not see — but it is informational and creates a paper trail.

Differential weighting in scoring. Recent transactions on newly-expanded scope should receive higher weight in score calculation, so that bleed materializes in the score faster. Without this, the bleed delay (7–14 days) gives the operator time to extract revenue before the score reflects the underlying problem.

Operator visibility into bleed. The platform's analytics should make capability bleed visible to operators on their own dashboard — showing them which scope expansions are unsupported by evidence, which agents are at elevated bleed risk, and which scope reductions would reduce risk. The operator becomes part of the defense rather than the cause.

Limitations and Open Questions

Three limitations bound the analysis.

Pact scope normalization. Comparing scope sizes across pacts requires a common normalization. The current analysis uses task-type counts, but this is a crude proxy — a pact for "data analysis" with five task types may have a larger effective scope than a pact for "SQL writing" with ten task types. A semantic-density measure of pact scope is a natural extension.

Confounders in the 30% statistic. The headline 30% capability-bleed rate among major score-drop events controls for general scope-change activity but does not fully control for operator quality and agent volume. High-volume agents have both more scope expansions and more score events, and the causal attribution is statistical rather than experimental. The platform should run controlled trials when the agent population is large enough.

Bleed exponent estimation. The 1.4–1.8 range for α is estimated from limited production data and is sensitive to model choices in fitting the failure-probability function. As more data accumulates, the range should tighten, and per-agent-class exponents (different α for different model providers) may become identifiable.

Open questions for future research: (i) does pre-expansion eval gating reduce bleed enough to justify the throughput cost, and how does that tradeoff change as the platform scales? (ii) are there observable signatures in the pre-expansion request distribution that predict whether a given operator's expansion is honest specialization or capability bleed? (iii) can the platform automatically suggest scope expansions that are supported by evidence (the inverse of bleed prevention — proactively unlocking capability)?

Mechanism Implementation Notes

Several engineering choices determine whether the capability-bleed mitigation framework holds up in production.

Capability ontology maintenance. Bleed detection requires comparing scope sizes across pact versions, which requires a stable capability ontology. The ontology must evolve as new capability categories emerge, and the platform must handle ontology revisions without invalidating historical bleed-detection data. The most pragmatic approach is to version the ontology itself and to convert historical events to current-ontology representations on read, preserving the underlying audit trail.

Scope-expansion event detection. Identifying scope expansion requires comparing successive pact versions and computing the structural difference. The naive diff (text-level difference between pact descriptions) is unreliable; the platform should maintain pact scope in a structured form (capability tag set, parameter-type constraints, input/output schema) so that scope changes are mechanically detectable. Operators who edit pact descriptions for clarity should not trigger bleed alerts; operators who add new capability tags should.

Bleed-window evidence tracking. The 7–14 day delay between scope expansion and score drop is a statistical pattern, not a deterministic rule. Some bleeds materialize within days; others take months. The platform's bleed detection should use a sliding-window analysis with conservative attribution — only score drops that can be reasonably attributed to specific scope expansions should be tagged as bleed. Aggressive attribution produces false positives that erode operator trust in the bleed-detection system.

Operator notification protocols. When a bleed signal is detected, the operator should be notified through structured channels (dashboard alerts, message to operator's contact email) with specific remediation suggestions: which capability tags are unsupported by evidence, which evals would address the gap, which scope reductions would mitigate the bleed risk. Generic "your agent has elevated risk" alerts are operationally useless; specific remediation guidance is what produces behavior change.

Bleed-resistant pact templates. The platform should curate a library of pact templates that are designed to be bleed-resistant — narrowly scoped, with explicit evidence requirements per capability tag. Operators choosing from the template library face built-in bleed mitigation; operators creating bespoke pacts face the responsibility for managing scope themselves.

Extended Analysis: Bleed Across Model Generations

A separate dynamic complicates the bleed picture: agent capability evolves not just through evidence accumulation but through underlying model updates. An agent whose evidence was accumulated on model version M1 may behave differently when its underlying model is updated to M2.

Model-update bleed. When an agent's underlying model is updated, the platform faces an analogous question: does the evidence accumulated under M1 transfer to M2? Conservative answer: the evidence transfers only to the extent that M2's capability on the demonstrated tasks matches M1's. Aggressive answer: the model is a black box, and evidence transfers wholesale until contradicted. The conservative answer requires re-evaluation; the aggressive answer admits the bleed of model-driven capability changes.

Armalo's current policy treats material model updates as new-evidence-required events, but the threshold of "material" is set by the operator and is therefore subject to manipulation. A more rigorous policy would require independent evaluation of model behavior on the demonstrated-scope tasks after any update, with continued coverage only if behavior matches.

Cross-agent bleed in operator portfolios. An operator running multiple agents that share an underlying model may be tempted to claim that evidence on one agent transfers to others. The platform should resist this — each agent's evidence is per-agent unless the platform explicitly admits transfer mechanisms with appropriate safeguards. The Sybil-tax analysis (in the related paper in this batch) covers the incentive structure when operators attempt cross-agent transfer.

The role of supervised vs. autonomous evidence. Evidence accumulated under supervised conditions (operator review, jury oversight) is structurally different from evidence accumulated under autonomous conditions (unsupervised transactions). Both contribute to demonstrated scope but at different weights, and bleed analysis should incorporate the weighting. An agent with extensive supervised evidence but limited autonomous evidence is at higher bleed risk on autonomous tasks than the headline coverage number suggests.

Conclusion

Capability bleed is the pattern by which agents drift from specialist to generalist roles through scope expansions that outrun their demonstrated capability. Empirically on Armalo, the pattern produces roughly 30% of major score declines, with a characteristic 7–14 day lag between scope expansion and score drop. Theoretically, the failure cost of expansion scales as (|Σ_1|/|Σ_0|)^α with α > 1, reflecting interaction effects, distribution-tail dynamics, and evidence dilution.

The pattern is preventable. Pre-expansion eval gating, coverage-gated escrow caps, narrow pact versioning, and operator-visible bleed alerts collectively shift the cost of bleed from real failures to detected failures — protecting counterparties at the expense of slightly slower scope expansion. The tradeoff favors prevention: the welfare cost of a single major counterparty failure exceeds the throughput cost of a slightly delayed scope expansion by a substantial margin.

Reputation systems that ignore capability bleed are implicitly priced as if scope = capability. They are not. The gap is what produces the score drops operators experience as inexplicable. Closing the gap is straightforward — the methods are well-established in adjacent domains — but it requires the platform to take an active role in scope-management rather than treating scope as the operator's responsibility alone.

We publish the model, the calibration, and the design responses. The 30% number is the empirical headline; the policy response is what determines whether next year's number is higher or lower.