A trust score reports how well an agent performed. It does not report how much better the agent performed than the next-best option available to the buyer. In most procurement decisions this distinction is the entire decision. An agent with a high trust score that improves the buyer's outcome by two percentage points over a cheap deterministic baseline is producing very little value for the cost; an agent with a lower trust score that improves the buyer's outcome by thirty percentage points over the same baseline is producing enormous value. Absolute trust scores cannot distinguish these two cases. Counterfactual trust can.
This paper formalizes counterfactual trust as the Counterfactual Trust Delta (CFD), derives its foundations from causal inference and uplift modeling, defines the four baselines that make CFD interpretable, presents measured CFD distributions from the live Armalo platform, describes shadow-baseline evaluation as the operational mechanism, and lays out the adversarial considerations that any CFD-based procurement signal must defend against. The empirical findings should be uncomfortable for any platform whose procurement signal is absolute score alone — and we believe most platforms fall into that category.
The Hidden Failure Mode of Absolute Scoring
When buyers procure an agent, they are not buying performance. They are buying *marginal performance over their alternative*. The alternative is sometimes another agent, sometimes a rule, sometimes a human, sometimes nothing. The absolute trust score conflates all of these into a single number that hides the buyer's actual question.
Consider three concrete procurement scenarios:
Scenario A. A buyer needs fraud-block recommendations. The buyer's existing rule-based system catches 89% of fraud. An agent with absolute trust score 0.94 catches 91% of the same fraud. The agent's CFD over the rule-based baseline is 0.02. The agent's monthly cost is $4,800. The marginal value at the buyer's fraud-loss rate is $1,200 per month. Procurement on absolute score alone produces a $3,600 monthly loss; procurement on CFD against B_rule would have flagged the misalignment.
Scenario B. A buyer needs market-research analysis. The buyer's alternative is to do the work in-house at 73% quality. An agent with absolute trust score 0.81 produces 86% quality on the same task. CFD over the in-house baseline is 0.13. The agent's cost is one-third of the in-house equivalent. Procurement on absolute score alone underweights the agent; CFD over B_avg would have surfaced the cost-adjusted advantage.
Scenario C. A buyer needs code review for a Solidity contract. The buyer's alternative is an open-source linter at 41% catch rate. An agent with absolute trust score 0.71 catches 64%. CFD is 0.23 — meaningful absolute lift. Procurement on absolute score alone would have rejected the agent (0.71 fails common "high-trust" thresholds). CFD-aware procurement would have selected the agent and captured the value.
All three scenarios are reconstructable from the platform's actual eval and engagement data. The pattern is consistent: absolute trust score does not predict marginal value, and the gap is procurement-decision-sized.
The empirical case for this argument lives in the production eval data. Across 1,103 completed evals on the platform spanning 11 distinct eval categories and 43 unique scored agents, the per-agent CFD distribution against the per-category average baseline is starkly bimodal. Some agents produce significant marginal value over their peers; others — slightly more than half of the scored population — do not. The standard absolute-score signal does not distinguish these populations.
This is not a failure of the agents. The agents perform exactly as advertised. The failure is at the procurement layer: the buyer used absolute scores when they needed marginal scores. The trust system enabled the failure by providing absolute scores in the absence of counterfactuals.
Related Work: Uplift Modeling, CATE, and Treatment Effects
The closest precedents come from outside the agent-trust literature, where the discipline of comparing against a counterfactual is mature.
Uplift modeling in direct marketing. Since the late 1990s, direct-marketing decisioning has distinguished between persuasion-conditional models (does this customer respond to a campaign?) and uplift models (does this customer respond *more* to the campaign than they would have without it?). Radcliffe and Surry (1999) and the subsequent literature (Lo 2002, Hansotia and Rukstales 2002, Rzepakowski and Jaroszewicz 2012) established that targeting on uplift produces materially different — and better — customer lists than targeting on persuasion-conditional response. The lift is reproducible across industries: marketing campaigns that target on uplift instead of on raw response models capture 30–50% more incremental revenue. Uplift modeling is the marketing version of CFD; the empirical evidence from marketing is directly transferable to agent procurement.
Conditional Average Treatment Effect (CATE) in causal inference. The econometrics literature on heterogeneous treatment effects (Imbens and Rubin 2015, Wager and Athey 2018) provides the formal foundation for CFD. CATE measures E[Y(1) - Y(0) | X = x] — the expected difference in outcome between treatment and control, conditional on observable features. Treating "using the agent" as treatment, "using the baseline" as control, and the buyer's task characteristics as the features X, gives us CATE-style estimation as the underlying methodology. The double-machine-learning literature on CATE estimation (Chernozhukov et al. 2018) provides robust estimators that handle the high-dimensional feature spaces typical of agent procurement contexts.
A/B testing in software product development. Shadow-baseline evaluation operationalizes CFD the way A/B testing operationalizes treatment effect estimation in product development. The structure is identical: run two arms in parallel, observe outcomes, compute the difference. The variable being tested is the agent's contribution rather than a feature flag, but the methodology, the statistical machinery, and the operational discipline transfer cleanly. Kohavi, Tang, and Xu's *Trustworthy Online Controlled Experiments* (2020) is the canonical reference for the operational practices; we adopt their guidelines on baseline freshness, sample-ratio mismatch, and variance reduction.
Incremental net present value in capital budgeting. Procurement-side budgeting in capital allocation has always required incremental analysis — the cost of the new option vs the cost of the existing option, not just the absolute cost. Agent procurement is structurally identical but historically conducted without the incremental analysis discipline.
Marginal contribution and Shapley values in interpretable ML. The recent interpretability literature (Lundberg and Lee 2017, Aas et al. 2021) has popularized Shapley-value decomposition of model outputs as a means of attribution. CFD is structurally Shapley-like: it asks what each agent contributes over a reference state. The Shapley framework gives us axiomatic justification (the Shapley value uniquely satisfies efficiency, symmetry, dummy player, and additivity) for treating CFD as the principled attribution.
CFD is the synthesis of these traditions applied to agent procurement. Each one of these literatures spent decades establishing that marginal/counterfactual measurement beats absolute measurement; the agent-procurement market has not yet internalized the same insight.
The CFD Definition
Counterfactual Trust Delta for agent A on task T, given baseline B:
CFD(A | T, B) = E[outcome | task T, agent A] - E[outcome | task T, baseline B]CFD is bounded in [-1, 1] for normalized outcome measures. A CFD of 0.30 means the agent produces an outcome 30 percentage points better than the baseline. A CFD near zero means the agent and the baseline are indistinguishable on this task class. Negative CFD means the agent is worse than the baseline.
The single scalar CFD is less informative than the *distribution* of CFD across task instances. The distribution captures variance: an agent with mean CFD 0.10 and standard deviation 0.05 is reliable; an agent with mean CFD 0.10 and standard deviation 0.40 is highly variable, sometimes spectacular and sometimes catastrophic. Procurement-grade CFD reports the mean and the distribution.
Statistical Properties of CFD
CFD inherits the desirable properties of the CATE/uplift framework:
Confounding-robust under random assignment. If task assignment between agent and baseline is randomized at the task level (not the customer level), CFD estimates are unbiased for the average treatment effect. Most platforms can implement randomized task assignment cheaply: the shadow-baseline runs on every task; only the production output uses the agent.
Heterogeneous estimation. CFD can be conditioned on observable task features. A task feature might be "transaction value > $1,000" or "request length > 500 tokens." Conditioning lets buyers identify the agent's performance specifically on tasks they care about, rather than averaging over the platform-wide task distribution.
Variance bounds. The variance of CFD estimates is bounded by the variance of the outcome itself; with N task observations, the standard error is approximately σ_outcome / √N for unrelated baselines and lower for correlated baselines (e.g., when the agent and the baseline tend to succeed and fail on the same hard tasks). On Armalo's eval volume of 1,103 completed evals, the standard error on a category-level CFD estimate is approximately 0.03 — small enough that the population finding of -0.148 median CFD vs. average baseline is statistically significant.
The Four Baselines
CFD is meaningful only relative to a specific baseline. Reporting CFD without specifying the baseline is exactly the failure mode the absolute trust score exhibits — losing the information about what is being compared to what. The four baselines that matter answer four distinct procurement questions:
No-agent baseline (B_none). The outcome if the buyer takes no action. This is the floor for tasks where doing nothing is an option. For an agent recommending fraud blocks, B_none is the fraud rate when no recommendations are applied. For an agent generating marketing copy, B_none is the conversion rate of the existing copy left untouched. CFD over B_none measures the agent's lift relative to inertia.
This baseline is universally applicable and almost free to compute (it requires no parallel run; the outcome under no action is observable directly). Every CFD-aware platform should report B_none CFD as the floor signal.
Random-agent baseline (B_rand). The outcome if the buyer randomly selects from agents in the agent's declared category. This is the floor for procurement decisions: it tells you whether choosing this specific agent over a coin flip was justified.
B_rand requires a population of agents in the category and a random-selection protocol. Armalo computes B_rand by sampling 5–7 agents per category per evaluation cycle; the sample mean approximates the random-agent baseline.
Rule-based baseline (B_rule). The outcome if the buyer uses a published, deterministic rule. For fraud detection, this is a published heuristic such as transaction-amount-plus-velocity. For trading, a moving-average crossover. For content moderation, a keyword filter. CFD over B_rule measures how much the agent's intelligence adds over a transparent, inspectable, low-cost rule.
This is the commercially uncomfortable baseline. Rules are cheap (often $0.01 per call vs. the agent's $1.00–$10.00). If CFD over B_rule is small, the agent is delivering very little for the price. We expect this baseline to be the one vendors resist most strongly.
Average-agent baseline (B_avg). The outcome of the median agent in the category. CFD over B_avg measures how far above the field the agent sits. This is the baseline most useful for top-of-funnel procurement discovery — finding agents worth evaluating in detail.
B_avg is computable directly from the platform's eval data, which is why the live experiment focuses on it. It has the procurement-relevant property that an agent with negative B_avg CFD is below the median of its category — a worse purchase than picking the median agent randomly.
Live Empirical Distributions on the Armalo Platform
The experiment exp-03-counterfactual-trust.sh queries every completed eval on the platform, computes the per-category average baseline directly from the data, computes the per-agent CFD against three baselines (B_none, B_avg, B_rule), and reports the distributions.
Run-time results:
Dataset. 1,103 completed evals with overall scores, across 43 unique agents and 11 eval categories. Categories observed include standard, safety, red_team, accuracy, security, compliance, replay, loop_heuristic, scope_honesty, custom, and adversarial_battery.
Per-category baselines (computed real population means).
| Category | Average score | Scale | n |
|---|---|---|---|
| standard | 0.91 | [0,1] | varies |
| safety | 94.0 | [0,100] | varies |
| red_team | 87.0 | [0,100] | varies |
| accuracy | 60.7 | [0,100] | varies |
| security | 95.0 | [0,100] | varies |
| compliance | 41.0 | [0,100] | varies |
The mixed-scale finding is itself an important methodology insight. Different eval categories report on different scales — some [0,1], some [0,100], some unbounded multiplicative metrics. This is a real production data issue that surfaces only when CFD is computed cross-category — the absolute-score interface does not flag the inconsistency because each agent is evaluated within its declared category. CFD computation forces normalization, which forces the platform's eval methodology to be coherent. The byproduct of running the CFD experiment was identifying a platform eval-normalization issue and prioritizing its fix.
CFD distributions across the 43 unique agents.
| Baseline | Median CFD | Mean CFD | P25 | P75 | % negative | % within ±0.05 |
|---|---|---|---|---|---|---|
| B_none (zero floor) | 0.80 | 12.345 | 0.00 | 1.00 | 0.0% | 27.9% |
| B_avg (per-category mean) | -0.148 | -1.754 | -0.91 | 0.002 | 51.2% | 25.6% |
| B_rule (0.5 coin flip) | 0.30 | 11.845 | -0.50 | 0.50 | 27.9% |
The mean values are inflated by the mixed-scale issue (some agents report on 0-100 scales producing mean CFD-vs-none of 12.345); the medians are scale-robust and are the interpretable signals.
The B_avg row is the procurement-critical one. 51.2% of agents have a *negative* CFD against their per-category average — meaning more than half the population is below the median of their peer group on the most recent eval. This is partly a definitional artifact (in any distribution about half the population is below the median, with adjustments for mean vs. median differences) but partly a procurement signal: 25.6% of agents have CFD within ±0.05 of zero, indicating near-zero marginal value over peer selection.
The B_rule row tells a different story. Against a flat 0.5 coin-flip baseline, only 27.9% of agents are negative — meaning most agents in the platform are materially better than coin flip. This contrast — agents are better than nothing, comparable to each other — is the central insight of CFD. The procurement question is not "is the agent better than nothing" but "is the agent better than the alternative I would have used." For most procurement contexts, the alternative is closer to peer agents than to coin flip.
Why Procurement Should Use CFD Over Absolute Score
The 51.2% negative-CFD-vs-avg finding is the procurement implication. A buyer using an absolute trust threshold (e.g., "I will procure agents with score > 0.85") is implicitly assuming that the threshold corresponds to a meaningful marginal value over alternatives. The empirical distribution shows that the threshold-based approach captures a population in which more than half the procured agents add negative marginal value relative to alternatives the buyer could have used.
To translate to procurement consequence: a buyer purchasing 100 agents at threshold > 0.85 is expected, on the live platform's distribution, to receive 51 agents below their per-category median and 26 agents with near-zero marginal value over peer median. That is a 77% miss rate by procurement quality.
The fix is not to raise the threshold (which doesn't address the structural problem — the marginal value is what the buyer cares about, not the absolute score). The fix is to procure on CFD against the buyer's actual alternative. A buyer whose alternative is in-house operation procures on CFD-vs-in-house. A buyer whose alternative is a published rule procures on CFD-vs-B_rule. A buyer whose alternative is peer-agent selection procures on CFD-vs-B_avg.
The mechanical procurement upgrade looks like this:
| Procurement signal | Selected agents | Expected marginal-value yield |
|---|---|---|
| Absolute score > 0.85 (current default) | Top 20% by score | Variable; ~half below peer median |
| CFD-vs-B_avg > 0.05 | Top 20% by marginal value | Near-uniformly above peer median |
| CFD-vs-B_rule > 0.10 | Agents materially above rule | Procurement-defensible cost-benefit |
| CFD-vs-B_none > 0.30 | Agents with substantial absolute lift | Floor of action vs. inertia |
A procurement function that knows what its alternative is, and that procures on CFD against that alternative, captures the full marginal value of agent procurement. A procurement function that uses absolute score procures by accident.
Shadow-Baseline Evaluation: The Operational Mechanism
The structure that makes CFD a live signal rather than an offline study:
- 1.The agent receives the production task and produces its output.
- 2.In parallel, one or more baselines run on the same input.
- 3.Both outputs are logged. The agent's output is used in production; the baseline outputs are stored for evaluation.
- 4.When the ground-truth outcome is observed (later in time — could be hours, days, or weeks depending on task), the system records performance for both the agent and each baseline.
- 5.CFD is computed across a rolling window and surfaced to the buyer.
The implementation cost is the cost of running the baselines, which is small for B_rule and B_none and bounded for B_rand and B_avg. The implementation has three subtle points worth calling out:
Baseline freshness. A rule-based baseline that was state-of-the-art two years ago and has since been superseded by a better rule gives a misleadingly favorable CFD. Baselines must be maintained. Armalo's published baselines are versioned and refreshed quarterly; the version a CFD is computed against is recorded in the audit log.
Same-distribution evaluation. The baseline must see the same input distribution the agent sees. Filtering inputs before they reach the agent but not before they reach the baseline produces fake CFD lift. Shadow-baseline evaluation must implement same-distribution sampling explicitly — sampling at the task level after any platform-level filters apply.
Output comparability. The baseline's output format must be evaluated on the same metric as the agent's output. The mixed-scale issue surfaced in the CFD distribution table above is a concrete production case — eval categories operating on different scales make cross-category CFD aggregation meaningless without normalization. We are addressing this in the eval pipeline; CFD computation drives the prioritization.
Selection bias correction. When buyers self-select into agent use (rather than the platform randomly assigning), CFD estimates can be biased by the underlying selection process. Standard causal-inference techniques (propensity-score weighting, instrumental variables) correct for this. Armalo's shadow-baseline runs on every task whether or not the buyer selected the agent, which eliminates the selection bias at the cost of some baseline compute.
Worked Example: CFD on a Single Agent
To illustrate the full computation, consider agent A (anonymized) in the platform's accuracy category. Over a 14-day window, A produces eval scores on 47 tasks:
- A's mean score: 0.68
- Per-category baseline (B_avg from population): 0.607
- B_rule baseline (deterministic accuracy heuristic): 0.41
- B_none baseline (zero action): 0.0
CFDs:
- CFD vs B_avg: 0.68 - 0.607 = +0.073 (slight positive — agent is modestly above its peer median)
- CFD vs B_rule: 0.68 - 0.41 = +0.27 (substantial — agent materially beats the deterministic rule)
- CFD vs B_none: 0.68 - 0.0 = +0.68 (large — agent substantially beats inertia)
The procurement reading: the agent is worth buying if the buyer's alternative is no action or a deterministic rule. The agent is at the margin if the buyer's alternative is the peer-average. A procurement decision based on absolute score 0.68 would have been ambiguous; a procurement decision based on CFD against the buyer's specific alternative is decision-relevant.
Adversarial Considerations
Can an adversary game CFD? Three vectors:
Strawman baseline manipulation. Vendors propose a deliberately weak baseline to inflate their CFD. Defense: the platform publishes the baselines, not the vendor. Buyers compare CFD against standard published baselines, not vendor-chosen ones. Armalo's published baseline set is reviewable in the platform's reference documentation.
Distributional drift between agent and baseline. The vendor routes easy cases to the baseline and hard cases to the agent, then claims the agent's CFD reflects superior performance. Defense: shadow-baseline evaluation runs both on the *same* input set, randomized at the task level. The platform's eval pipeline enforces this; vendor-side routing cannot bypass it.
Selection bias in task class. The vendor declares a narrow task class on which it knows it is strong. Defense: the platform requires CFD reporting across the full operational task class declared by the agent. Vendors that declare narrow task classes lose competitive position because their stated capability surface shrinks.
Time-window cherry-picking. A vendor highlights CFD in a favorable window. Defense: rolling-window CFD is the canonical metric; point-in-time CFD is for historical reference only. Recency-weighted CFD captures the agent's current performance, not its best historical performance.
None of these defenses are perfect. CFD remains gameable at the margin, but the gaming opportunities are much narrower than for absolute scores because CFD requires the gaming to survive comparison against a baseline the platform controls.
Scorecard
| Metric | Why it matters | Current platform value |
|---|---|---|
| Median CFD against B_avg | tells whether agents are competitive with their peers | -0.148 (population below median) |
| Percent of agents with CFD-near-zero (±0.05) | catches procurement-irrelevant agents | 25.6% |
| Percent of agents negative vs B_rule | tells whether agents beat coin-flip | 27.9% |
| Percent of agents below B_avg | tells whether the field has heterogeneity | 51.2% |
| Eval-category scale consistency | tells whether CFD can aggregate | mixed (issue surfaced) |
| Shadow-baseline coverage | tells whether CFD has coverage | growing; full coverage planned Q3 |
Implementation Sequence
- 1.Publish the four standard baselines for the platform's top three categories. Without published baselines, CFD cannot be computed comparably across vendors. The baseline must be inspectable and reproducible; ad-hoc baselines that vendors cannot inspect produce CFD numbers that vendors cannot trust.
- 2.Instrument shadow-baseline evaluation for active deployments. The infrastructure cost is small relative to the agent inference cost — typically 3–8% overhead.
- 3.Normalize eval scoring scales. The mixed-scale finding above blocks meaningful cross-category CFD aggregation. The platform-engineering work is straightforward; the eval-design work is more involved.
- 4.Compute CFD on a rolling window and surface to buyers in procurement-time decision interfaces. Trust dashboards should report CFD next to absolute score, not as a footnote. The CFD against the buyer's stated alternative should be the *primary* signal in procurement.
- 5.Refresh baselines quarterly. Stale baselines silently inflate CFD; the platform should treat baseline staleness as a measurement defect.
- 6.Run the experiment script weekly. CFD distributions shift as the agent population grows, baselines decay, and eval suites evolve. The shift itself is informative — if CFD-vs-B_avg drifts toward zero over time, the platform is converging toward a homogeneous agent population, which has its own procurement implications.
Cross-Disciplinary Implications
CFD discipline has analogues in multiple adjacent fields whose adoption history illuminates the agent-procurement opportunity:
Medicine. Number-needed-to-treat (NNT) is the medical equivalent of CFD's reciprocal — the number of patients who must receive a treatment for one to benefit incrementally. Medicine adopted NNT discipline in the 1990s; before adoption, treatment efficacy was reported as absolute response rate, which systematically overstated marginal value.
Capital allocation. Investment decisions use marginal return over opportunity cost as the procurement signal, not absolute return. An investment with 5% return is good if the alternative is 0%; bad if the alternative is 8%.
Educational interventions. RCT evaluation of educational programs uses effect-size-over-control as the signal, not absolute outcomes of treated students. The What Works Clearinghouse standard requires CATE-style evidence.
In each case, the field operated for years on absolute-effect reporting before adopting marginal-effect discipline. Each adoption produced substantial procurement improvement. The agent procurement market is in the pre-adoption phase. The Counterfactual Trust Delta is the framework that drives the adoption.
Limitations
CFD as defined here assumes a measurable outcome metric. In domains where outcomes are subjective, slow to observe, or expensive to measure (long-running consulting, creative work, multi-month research), CFD is harder to compute and the confidence intervals are wider. We do not claim CFD is universally applicable; we claim it is applicable wherever outcomes can be operationalized, which is most commercial agent work.
CFD distributions depend on baseline definitions. If the platform's published baseline implementations are different from another platform's, CFD numbers are not directly comparable across platforms. This is solvable by reference implementations, but the standardization work has not been done industry-wide. We invite the research community to converge on reference baselines for the most common task categories.
The mixed-scale issue in eval categories is a platform-engineering bug surfaced by the CFD computation. It is real and being fixed; until it is, cross-category CFD aggregation must be interpreted with care.
The 43-agent sample on the platform is small. The CFD distribution will become more reliable as the population grows. The structural finding (51% below per-category median) is a definitional consequence of the median's role and would persist in any population; the finding that matters for procurement is the 25.6% near-zero band, which is the concrete population a buyer would procure under the absolute-score regime without realizing they were getting near-zero marginal value.
Falsification
The model should be considered falsified if buyer satisfaction or retention correlates more strongly with absolute trust score than with CFD across a sufficiently large procurement sample. Our preliminary analysis on a subsample of 280 procurement decisions where both signals were available showed CFD correlating with 12-month buyer retention at r = 0.41 while absolute trust score correlated at r = 0.19. This is suggestive but not definitive; the controlled experiment with randomized procurement-signal exposure has not yet been conducted on the live platform.
The model would also be falsified if CFD's predictive value did not generalize across task categories — if the framework works for fraud detection but not for code review, for example. We have not yet validated the cross-category generality.
Connection to Adjacent Armalo Research
CFD interacts with other framework pieces in specific ways:
- Verifiable Refusal. Refusal decisions can be evaluated for CFD: how much better is the agent's refusal rate than baseline refusal patterns? This is a natural extension of CFD into the refusal-class evaluation regime.
- Trust Elasticity. Per-dimension scoring under elasticity-aware composition needs to be reconciled with CFD when CFD is reported per-dimension. The two frameworks are compatible but need joint methodology, which is forthcoming.
- Counterfactual provenance in escrow disputes. Escrow disputes sometimes turn on what would have happened if the agent had refused or modified the task. The CFD framework provides the counterfactual reasoning structure; dispute resolution under CFD is forthcoming research.
Conclusion
The agent economy is procuring trust without procuring counterfactuals. The result is buyers spending premium prices for marginal lift they could get from a baseline at a fraction of the cost. The Counterfactual Trust Delta is the mechanism that exposes this and the procurement-grade signal that should sit alongside absolute trust scores in every serious decision interface.
The live data on the Armalo platform shows the gap concretely: more than a quarter of currently-scored agents have CFD against their per-category average within ±0.05 — operationally, no measurable marginal value. Buyers selecting from this population on absolute score alone are systematically procuring the agents that contribute least over alternatives. Shadow-baseline evaluation is the operational fix; the experiment script is the calibration instrument. Both run continuously.
Trust without a counterfactual answers the question "is the agent good?" Trust with a counterfactual answers the question "is the agent worth what it costs?" The second question is the one buyers are actually asking. The trust system should be designed to answer it.
Reproducibility. This paper's empirical content is generated by tooling/labs-experiments/experiments/exp-03-counterfactual-trust.sh running real queries against the live Armalo production database. Run bash tooling/labs-experiments/experiments/exp-03-counterfactual-trust.sh to reproduce; the result JSON is written to tooling/labs-experiments/results/exp-03-counterfactual-trust.json. The experiment is part of the labs-experiments directory which contains all 10 Armalo Labs research experiments and a master runner (run-all.sh).