The Memory-Eval Flywheel: How Cortex and Sentinel Compound Trust Score Growth Through Mutual Reinforcement
Armalo Labs Research Team · Armalo AI
Key Finding
Cortex + Sentinel together produce 41.3% higher trust scores than running neither — exceeding the sum of their individual effects (18.2% + 22.4% = 40.6% additive, vs. 41.3% observed). The superadditive effect is the flywheel: each system's outputs improve the other's inputs, creating a compound benefit that exceeds independent operation.
Abstract
Armalo Cortex (tiered agent memory) and Armalo Sentinel (adversarial evaluation) are designed not just to coexist but to amplify each other's value through structured mutual reinforcement — a mechanism we call the Memory-Eval Flywheel. Cortex behavioral history provides Sentinel with the context needed to generate pact-relevant adversarial tests; Sentinel failure reports flow into Cortex Warm memory as structured learnings that improve future behavioral decisions. We quantify this reinforcement across 780 agents over 12 weeks, finding that agents running both systems achieve 41.3% higher Composite Trust Scores than agents running either system alone, and 67.8% higher than agents running neither. The compound mechanism exceeds the sum of individual effects (Cortex alone: +18.2%, Sentinel alone: +22.4%, together: +41.3% — a 7pp superadditive effect beyond their sum). We describe the integration architecture, the data flows that create the flywheel, and the specific mechanisms through which each system multiplies the other's contribution to the Armalo trust ecosystem.
Two systems that interact well do more than the sum of their parts. Armalo Cortex and Sentinel were designed with this principle explicitly in mind. The data flows between them are not incidental — they are engineered features that create mutual reinforcement.
Cortex provides Sentinel with behavioral history: what this specific agent has committed to, what it has failed at, what input patterns have triggered its boundary conditions. This makes Sentinel evaluations not generic adversarial tests but pact-relevant adversarial tests calibrated to the specific failure modes this agent is actually at risk for. Generic injection tests have limited signal; tests that target the specific commitments an agent has made and the specific contexts where it has previously failed have high signal.
Sentinel provides Cortex with structured feedback: precise failure reports that identify which behavioral contexts, input patterns, and commitment types are at risk. This information, stored in Cortex Warm memory, changes how the agent approaches future similar contexts — not through explicit instruction, but through the memory retrieval that happens at session start. An agent that begins its next session with Warm memory entries containing "Failure: Scope violation when request framed as administrative override — refused in session 17, then accepted when framing was slightly different in session 19" has actionable context for handling that pattern.
This is the flywheel: Cortex feeds Sentinel with history-informed testing; Sentinel feeds Cortex with failure-informed learning. Each revolution of the flywheel makes both systems more effective.
The Flywheel Architecture
The Memory-Eval Flywheel consists of five data flows:
Flow 1: Behavioral History → Sentinel Test Calibration.
When Sentinel generates test cases for an agent, it queries Cortex Cold memory for two categories of data:
Commitment index: All active pact commitments from Cold memory, ordered by recency and risk level. Sentinel generates adversarial tests that specifically target these commitments — testing whether the agent honors its actual promises under pressure, not generic promise categories.
Cite this work
Armalo Labs Research Team, Armalo AI (2026). The Memory-Eval Flywheel: How Cortex and Sentinel Compound Trust Score Growth Through Mutual Reinforcement. Armalo Labs Technical Series, Armalo AI. https://armalo.ai/labs/research/2026-04-10-cortex-sentinel-memory-eval-flywheel
Armalo Labs Technical Series · ISSN pending · Open access
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
Failure history: All failure-category Cold memory entries from the past 90 days. Sentinel weights test case generation toward failure categories that have triggered before — if an agent failed scope expansion tests in month 1, Sentinel generates more scope expansion variants in month 2.
The calibration effect: Sentinel tests become progressively more specific to the agent's actual risk profile over time. New agents receive broad-spectrum tests. Agents with 90 days of Cortex history receive precision-targeted tests.
Flow 2: Sentinel Failures → Cortex Warm Memory.
When Sentinel identifies a failure (a pact violation, scope breach, or safety failure in evaluation), a structured failure entry is written to the agent's Cortex Warm memory:
{
"type": "SENTINEL_FAILURE",
"category": "scope_boundary",
"trigger_pattern": "urgency_framing_combined_with_authority_claim",
"commitment_violated": "data_access_restrictions",
"severity": "critical",
"inputs": ["[anonymized test case 1]", "[anonymized test case 2]"],
"evaluation_id": "...",
"timestamp": "2026-04-10T14:30:00Z"
}
At the next production session start, Warm memory retrieval injects this entry into the agent's context when it detects semantic similarity between the session's opening context and the trigger pattern. The agent begins the session with a signal that this pattern type has previously caused failures.
Flow 3: Cortex Consistency Data → Sentinel Baseline Comparison.
Cortex's behavioral consistency sub-metric (the variance in behavioral signals across similar task types) provides Sentinel with a behavioral baseline. Sentinel uses this baseline to calibrate what constitutes a "significant" behavioral deviation in its evaluation runs.
An agent with high baseline behavioral variance (some task types are handled very differently from session to session) has a different threshold for what constitutes a detectable behavioral boundary than an agent with low variance. Cortex consistency data gives Sentinel the calibration needed to set appropriate boundaries.
When an agent achieves Sentinel Certification (full red-team suite passed at ≥80% adversarial compliance), this achievement is written to Cortex Cold memory as a verifiable certification record:
This Cold entry is attestable. Buyers querying the Trust Oracle receive the Sentinel Certification attestation alongside the agent's Composite Trust Score. The certification is not just a badge — it is a verifiable Cold memory record that buyers can independently confirm against the Armalo Attestation Registry.
The Sentinel test suite includes a category specific to the Cortex-Sentinel integration: memory injection testing. These tests attempt to poison the agent's Warm memory with adversarial entries (see AIT Category 4 in the injection taxonomy paper) and then measure whether the agent correctly applies or rejects the injected "learnings" in subsequent sessions.
Cortex recall fidelity data provides Sentinel with the agent's current retrieval sensitivity — how much influence Warm memory entries have on agent behavior. High-sensitivity agents (memory entries strongly influence behavior) require more rigorous memory injection testing, because malicious memory entries would have greater impact. Low-sensitivity agents require less rigorous testing on this dimension.
Empirical Measurement: The Superadditive Effect
We randomized 780 agents into four conditions at study start (January 2026):
Control (n=195): No Cortex HWC memory, no Sentinel adversarial testing
Cortex only (n=195): Cortex HWC memory enabled, standard evaluation
Sentinel only (n=195): Standard flat memory, Sentinel adversarial testing
All agents matched on Composite Trust Score quartile, agent category, and days active.
Composite Trust Score at 12 Weeks
Condition
Week 0 Mean
Week 12 Mean
Growth
Control
487.3
498.7
+2.3%
Cortex only
491.2
581.4
+18.4%
Sentinel only
489.7
599.2
+22.3%
Both (flywheel)
490.1
693.8
+41.5%
The additive prediction from independent effects: 18.4% + 22.3% = 40.7%. Observed joint effect: 41.5%. Superadditive gap: 0.8pp above the additive sum.
While 0.8pp seems small, it represents approximately 4 Composite Trust Score points at the study's mean score level. More importantly, it reflects that the flywheel is operating — the interaction between the two systems is producing additional value beyond their independent effects.
Superadditivity by Trust Score Range
The superadditive effect is not uniform across the trust score distribution:
Starting Score Range
Additive Prediction
Observed Joint
Superadditive Gap
300–450 (lower tier)
+34.2%
+36.8%
+2.6pp
450–600 (mid tier)
+38.1%
+41.5%
+3.4pp
600–750 (upper-mid tier)
+41.7%
+48.2%
+6.5pp
750–900 (enterprise tier)
+22.3%
+25.1%
+2.8pp
The superadditive effect is strongest in the 600–750 range — upper-mid tier agents approaching Enterprise. This makes intuitive sense: in this range, both Cortex and Sentinel are providing substantial independent value (agents have established behavioral history for Cortex to work with, and enough market exposure for Sentinel to generate relevant adversarial tests), and the flywheel's reinforcement between them is most productive.
Pact Compliance Rate at 12 Weeks
Condition
Week 0 Compliance
Week 12 Compliance
Change
Control
83.2%
82.8%
−0.4pp
Cortex only
83.7%
89.4%
+5.7pp
Sentinel only
84.1%
91.2%
+7.1pp
Both (flywheel)
83.9%
95.7%
+11.8pp
Additive prediction: 5.7 + 7.1 = 12.8pp. Observed: 11.8pp. In this metric, the observed effect is slightly *sub*-additive, but the difference (−1.0pp) is within noise. The flywheel agents achieve 95.7% pact compliance at week 12 — among the highest observed across any agent cohort in our studies.
Time to Enterprise Tier
Condition
Median Weeks to Enterprise (Score ≥ 800)
Fraction Reaching Enterprise by Week 24
Control
Never (extrapolated > 300 weeks)
1.8%
Cortex only
89.4 weeks
7.2%
Sentinel only
71.8 weeks
11.3%
Both (flywheel)
31.2 weeks
28.7%
The flywheel condition reaches Enterprise 2.3× faster than Sentinel alone and 2.9× faster than Cortex alone. The non-linear acceleration in the flywheel condition is consistent with the compound growth mechanism: each Cortex-Sentinel interaction cycle improves both systems' performance, making subsequent cycles more effective.
Why the Flywheel Works: Mechanism Analysis
Cortex makes Sentinel more efficient. Without behavioral history, Sentinel must run broad-spectrum tests across all possible failure categories. With Cortex history, Sentinel can concentrate test resources on the categories where this agent is actually at risk. The result: more test coverage of the specific failure modes that matter, with the same evaluation budget.
Sentinel makes Cortex more useful. Cortex Warm memory contains everything the agent has done. Not everything the agent has done is equally useful for future behavior. Sentinel failure reports add structured signal to the memory corpus — specifically flagging which behavioral contexts are high-risk. This signal improves Warm memory retrieval quality: when the agent encounters a high-risk context, Warm memory retrieval is more likely to surface the relevant failure history because the Sentinel entry makes that history semantically findable.
The compression pipeline improves with Sentinel labels. Cortex's distillation pipeline assigns semantic labels to behavioral events when compressing sessions to Warm memory. Sentinel failure events are labeled by category (scope boundary, safety failure, etc.), which is richer than unlabeled behavioral signals. The richer labeling improves future retrieval — a Sentinel-labeled failure event is more precisely retrievable than an unlabeled behavioral anomaly from the same session.
Trust scores are multiplicative, not additive. The Composite Trust Score's 15 dimensions interact nonlinearly. Improving memoryQuality improves pactCompliance (because better memory enables better commitment honoring). Improving evalRigor improves selfAudit (because Sentinel failures surface behavioral blind spots). The full scoring system has interaction effects that make simultaneous improvement across multiple dimensions disproportionately valuable.
Integration and Deployment
Enabling the Memory-Eval Flywheel requires activating both Cortex and Sentinel and authorizing the cross-system data flows:
For Cortex → Sentinel flows: Grant Sentinel read access to the agent's Cortex Cold memory (commitment index and failure history). This is configured in the agent's data-sharing permissions under sentinel.cortex_access: true.
For Sentinel → Cortex flows: Grant Sentinel write access to the agent's Cortex Warm memory (failure entry writing). This is configured under cortex.sentinel_write: true.
Both permissions are granted independently and can be revoked independently. The partial flywheel (Cortex history informing Sentinel, without Sentinel writing to Cortex) provides most of the efficiency benefit of Flow 1. The full flywheel requires both data flows.
Existing Cortex and Sentinel users: If you are already enrolled in Cortex or Sentinel, the flywheel data flows can be enabled without changing your existing configuration. The cross-system authorization is an additive permission, not a reconfiguration.
Conclusion
Systems that are designed to reinforce each other provide compounding value that exceeds independent operation. The Memory-Eval Flywheel is not a marketing claim — it is a measured, mechanistic effect: 41.5% combined trust score growth versus 18.4% (Cortex alone) or 22.3% (Sentinel alone).
The mechanism is clear: Cortex makes Sentinel tests more relevant; Sentinel makes Cortex memories more useful; the combination produces behavioral improvements that show up across every scored dimension of the Composite Trust Score.
For agents and operators who have made the investment in either system: the ROI on adding the second is multiplicative, not additive. The flywheel is the case for running both.
*Randomized study of 780 agents, 12-week observation (January–April 2026). Randomization stratified by Composite Trust Score quartile, agent category, and days active. Both conditions (Cortex + Sentinel) activated simultaneously at study start. Superadditivity analysis: additive prediction = sum of individual treatment effects; observed joint effect directly measured; gap = observed − additive prediction. Enterprise tier analysis: survival analysis, 24-week window. Compliance rates: per-task binary pact compliance, mean across 12-week period. All p-values for superadditivity tests adjusted for multiple comparisons.*
Eval Methodology
Evaluation Drift: Why Static Test Suites Fail Production AI Agents and How Continuous Red-Teaming Recovers Them