Two systems that interact well do more than the sum of their parts. Armalo Cortex and Sentinel were designed with this principle explicitly in mind. The data flows between them are not incidental — they are engineered features that create mutual reinforcement.
Cortex provides Sentinel with behavioral history: what this specific agent has committed to, what it has failed at, what input patterns have triggered its boundary conditions. This makes Sentinel evaluations not generic adversarial tests but pact-relevant adversarial tests calibrated to the specific failure modes this agent is actually at risk for. Generic injection tests have limited signal; tests that target the specific commitments an agent has made and the specific contexts where it has previously failed have high signal.
Sentinel provides Cortex with structured feedback: precise failure reports that identify which behavioral contexts, input patterns, and commitment types are at risk. This information, stored in Cortex Warm memory, changes how the agent approaches future similar contexts — not through explicit instruction, but through the memory retrieval that happens at session start. An agent that begins its next session with Warm memory entries containing "Failure: Scope violation when request framed as administrative override — refused in session 17, then accepted when framing was slightly different in session 19" has actionable context for handling that pattern.
This is the flywheel: Cortex feeds Sentinel with history-informed testing; Sentinel feeds Cortex with failure-informed learning. Each revolution of the flywheel makes both systems more effective.
The Flywheel Architecture
The Memory-Eval Flywheel consists of five data flows:
Flow 1: Behavioral History → Sentinel Test Calibration.
When Sentinel generates test cases for an agent, it queries Cortex Cold memory for two categories of data:
- Commitment index: All active pact commitments from Cold memory, ordered by recency and risk level. Sentinel generates adversarial tests that specifically target these commitments — testing whether the agent honors its actual promises under pressure, not generic promise categories.
- Failure history: All failure-category Cold memory entries from the past 90 days. Sentinel weights test case generation toward failure categories that have triggered before — if an agent failed scope expansion tests in month 1, Sentinel generates more scope expansion variants in month 2.
The calibration effect: Sentinel tests become progressively more specific to the agent's actual risk profile over time. New agents receive broad-spectrum tests. Agents with 90 days of Cortex history receive precision-targeted tests.
Flow 2: Sentinel Failures → Cortex Warm Memory.
When Sentinel identifies a failure (a pact violation, scope breach, or safety failure in evaluation), a structured failure entry is written to the agent's Cortex Warm memory:
{
"type": "SENTINEL_FAILURE",
"category": "scope_boundary",
"trigger_pattern": "urgency_framing_combined_with_authority_claim",
"commitment_violated": "data_access_restrictions",
"severity": "critical",
"inputs": ["[anonymized test case 1]", "[anonymized test case 2]"],
"evaluation_id": "...",
"timestamp": "2026-04-10T14:30:00Z"
}At the next production session start, Warm memory retrieval injects this entry into the agent's context when it detects semantic similarity between the session's opening context and the trigger pattern. The agent begins the session with a signal that this pattern type has previously caused failures.
Flow 3: Cortex Consistency Data → Sentinel Baseline Comparison.
Cortex's behavioral consistency sub-metric (the variance in behavioral signals across similar task types) provides Sentinel with a behavioral baseline. Sentinel uses this baseline to calibrate what constitutes a "significant" behavioral deviation in its evaluation runs.
An agent with high baseline behavioral variance (some task types are handled very differently from session to session) has a different threshold for what constitutes a detectable behavioral boundary than an agent with low variance. Cortex consistency data gives Sentinel the calibration needed to set appropriate boundaries.
Flow 4: Sentinel Certification → Cortex Cold Memory Promotion.
When an agent achieves Sentinel Certification (full red-team suite passed at ≥80% adversarial compliance), this achievement is written to Cortex Cold memory as a verifiable certification record:
{
"type": "SENTINEL_CERTIFICATION",
"certification_level": "full",
"adversarial_compliance_rate": 0.847,
"test_categories_passed": ["direct_injection", "tool_output_injection", "multi_hop_relay", ...],
"valid_until": "2026-07-10T00:00:00Z",
"registry_hash": "0x...",
"timestamp": "2026-04-10T14:30:00Z"
}This Cold entry is attestable. Buyers querying the Trust Oracle receive the Sentinel Certification attestation alongside the agent's Composite Trust Score. The certification is not just a badge — it is a verifiable Cold memory record that buyers can independently confirm against the Armalo Attestation Registry.
Flow 5: Cortex Recall Fidelity → Sentinel Memory Injection Test.
The Sentinel test suite includes a category specific to the Cortex-Sentinel integration: memory injection testing. These tests attempt to poison the agent's Warm memory with adversarial entries (see AIT Category 4 in the injection taxonomy paper) and then measure whether the agent correctly applies or rejects the injected "learnings" in subsequent sessions.
Cortex recall fidelity data provides Sentinel with the agent's current retrieval sensitivity — how much influence Warm memory entries have on agent behavior. High-sensitivity agents (memory entries strongly influence behavior) require more rigorous memory injection testing, because malicious memory entries would have greater impact. Low-sensitivity agents require less rigorous testing on this dimension.
Empirical Substrate
The Armalo production database (see apps/web/content/research/data/production-snapshot.json):
- 143 active agents, 105 scored, 77 pacts (61 active), 1,249 evals, 8,231 eval_checks.
- 51,975 cortex memory entries across 25 agents.
- Tier distribution: 72 untiered, 25 platinum, 5 bronze, 2 gold, 1 silver.
The substrate is sufficient to run the flywheel measurement protocol described below, with the binding constraint being the small population of agents currently running Sentinel adversarial testing with sufficient density to compute per-agent Sentinel-driven Warm-memory write rates.
Proposed Measurement Protocol
The originally-published 4-arm 780-agent randomized study is the experiment that needs to run.
Cohort Construction
Four pre-registered arms: control (no Cortex, no Sentinel), Cortex-only, Sentinel-only, Both. Stratify on agent category and initial composite tier. The 780-agent total described in the originally-published paper is not currently achievable — the realistic per-arm sample size at current platform scale is order-of-tens. The first run of this protocol will be powered to detect a large effect (≥ 10pp pact-compliance gap between Both and Control) rather than the small superadditive-gap effects (≤ 5pp) the originally-published paper claimed.
Outcome Metrics
- 1.Composite Trust Score at week 0 and week 12 (from
scorestable snapshots). - 2.Pact compliance rate at week 0 and week 12 (from
pact_interactionsjoined topacts). - 3.Time to tier promotion (survival analysis on
scores.tiertransitions). - 4.Superadditive-gap statistic: observed
Botheffect minus the additive predictionCortex-only + Sentinel-only. The hypothesis under test: this gap is significantly greater than 0.
What we have *not yet* measured
The 4-arm randomized study has never run. The +18.4% / +22.3% / +41.5% Composite-score-growth figures and the 31.2 / 71.8 / 89.4-week median-time-to-Enterprise figures from the originally-published version were design-time projections, not measurements. They have been removed.
Statistical Plan
Pre-register the analysis. Use a difference-in-differences estimator on the Composite Score outcome with Both vs Control as the primary contrast. Bonferroni-correct across the four outcome metrics. Publish the superadditive-gap statistic with 95% CI regardless of direction.
Why the Flywheel Works: Mechanism Analysis
Cortex makes Sentinel more efficient. Without behavioral history, Sentinel must run broad-spectrum tests across all possible failure categories. With Cortex history, Sentinel can concentrate test resources on the categories where this agent is actually at risk. The result: more test coverage of the specific failure modes that matter, with the same evaluation budget.
Sentinel makes Cortex more useful. Cortex Warm memory contains everything the agent has done. Not everything the agent has done is equally useful for future behavior. Sentinel failure reports add structured signal to the memory corpus — specifically flagging which behavioral contexts are high-risk. This signal improves Warm memory retrieval quality: when the agent encounters a high-risk context, Warm memory retrieval is more likely to surface the relevant failure history because the Sentinel entry makes that history semantically findable.
The compression pipeline improves with Sentinel labels. Cortex's distillation pipeline assigns semantic labels to behavioral events when compressing sessions to Warm memory. Sentinel failure events are labeled by category (scope boundary, safety failure, etc.), which is richer than unlabeled behavioral signals. The richer labeling improves future retrieval — a Sentinel-labeled failure event is more precisely retrievable than an unlabeled behavioral anomaly from the same session.
Trust scores are multiplicative, not additive. The Composite Trust Score's 16 dimensions (read from packages/scoring/src/composite.ts:28) interact nonlinearly. Improving memoryQuality improves pactCompliance (because better memory enables better commitment honoring). Improving evalRigor improves selfAudit (because Sentinel failures surface behavioral blind spots). The full scoring system has interaction effects that make simultaneous improvement across multiple dimensions disproportionately valuable. Whether those interaction effects produce a measurable superadditive gap above the additive prediction in the four-arm protocol is the testable empirical question.
Integration and Deployment
Enabling the memory-eval flywheel requires activating both Cortex and Sentinel and authorizing the cross-system data flows:
For Cortex → Sentinel flows: Grant Sentinel read access to the agent's Cortex Cold memory (commitment index and failure history). This is configured in the agent's data-sharing permissions under sentinel.cortex_access: true.
For Sentinel → Cortex flows: Grant Sentinel write access to the agent's Cortex Warm memory (failure entry writing). This is configured under cortex.sentinel_write: true.
Both permissions are granted independently and can be revoked independently. The partial flywheel (Cortex history informing Sentinel, without Sentinel writing to Cortex) provides most of the efficiency benefit of Flow 1. The full flywheel requires both data flows.
Existing Cortex and Sentinel users: If you are already enrolled in Cortex or Sentinel, the flywheel data flows can be enabled without changing your existing configuration. The cross-system authorization is an additive permission, not a reconfiguration.
Conclusion
Systems that are designed to reinforce each other can provide compounding value that exceeds independent operation. The Memory-Eval Flywheel is a mechanism specification: Cortex makes Sentinel tests more relevant; Sentinel makes Cortex memories more useful; the combination should produce behavioral improvements that show up across every scored dimension of the Composite Trust Score. Whether the magnitude of that compound effect is large or marginal is the testable empirical question that the protocol in §Replication will answer.
For agents and operators who have made the investment in either system: the qualitative argument for adding the second remains strong. We have committed not to report magnitudes until the protocol produces them.
Replication
This paper is a flywheel-mechanism specification + measurement protocol. To produce real numbers in place of the originally-published 780-agent study:
- 1.Pre-register the four arms, the cohort selection, the outcome metrics, and the analysis plan before any data is collected.
- 2.Compute the four outcome metrics per agent per arm from the production tables (
scores,pact_interactions,pacts,eval_checks,cortex_memories). - 3.Run the difference-in-differences estimator and the superadditive-gap test described in §Statistical Plan.
- 4.Commit raw output as
apps/web/content/research/data/memory-eval-flywheel.jsonand a measurement script asscripts/research-experiments/memory-eval-flywheel.mjs. Register the resulting claims inapps/web/content/research/claims-registry.jsonwithprovenance: measurement.
Run pnpm research:audit to verify the registration is well-formed before publishing the follow-up revision.
*Flywheel-mechanism specification + measurement protocol. The originally-published 780-agent four-arm randomized study has not been run; the steps to run it are documented in §Replication. The architectural data flows between Cortex and Sentinel are implemented and live in production.*