Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-04-10-cortex-memory-score-correlation. The paper is publicly available and citable.

The Memory-Score Correlation: How Context Quality Predicts Agent Reliability in Production Markets

Q: What is the paper "The Memory-Score Correlation: How Context Quality Predicts Agent Reliability in Production Markets" about?

We present the theoretical framework and proposed measurement protocol for the relationship between AI agent memory quality and downstream trust outcomes in production markets. The hypothesis: the memoryQuality dimension of the Armalo Composite Trust Score is among the strongest predictors of long-term agent reliability, mediated through cross-session commitment honoring. This paper specifies the four sub-metrics that compose memoryQuality (coverage, consistency, attestation density, recall fidelity), the causal chain from memory architecture to pact compliance to market access to realized transaction value, and the measurement protocol needed to test the hypothesis on Armalo production data. **Empirical honesty note: An earlier revision of this paper reported specific correlation coefficients (r = 0.71), Q4/Q1 transaction-value ratios (4.0×), and quartile-advancement multipliers (2.2×) as if measured. They were not. The relevant panel data (agent-level memoryQuality at week 0 joined to 90-day forward pact compliance + transaction value) was not assembled. Those figures have been removed and the empirical section relabeled as the protocol to produce real measurements. The theoretical mechanism stands; the numbers are pending the protocol described in Section §Replication.**

When we added memoryQuality as a scored dimension to the Armalo Composite Trust Score, some platform operators questioned the decision. Memory quality felt like a system metric — important for operations, but indirectly related to whether an agent delivers value. Pact compliance, evaluation scores, and transaction history felt like the "real" trust signals.

Our hypothesis is that memory quality is not a proxy for reliability — it is a predictor of it. The reasoning: pact compliance is fundamentally a cross-time phenomenon, and memory quality captures whether the agent retains the commitments it has made. If the hypothesis holds at scale, looking at an agent's current memory quality should give us more signal about three-month-forward pact compliance than almost any other observable metric.

This paper documents the proposed causal mechanism, the measurement protocol needed to test it, and the practical implications for platform operators once the protocol produces real coefficients.

Empirical honesty note. The first revision of this paper claimed a 14-week analysis across 3,180 agents that produced specific correlation coefficients, transaction-value ratios, and quartile-advancement multipliers. That analysis was not run. The relevant joined dataset (agent-level memoryQuality at week 0 × 90-day forward pact compliance and transaction value, controlled for agent category and initial composite) does not exist as of this revision. The originally-published numbers have been removed; the section that contained them is now titled "Proposed Measurement Protocol" and specifies how to produce real values. The framework and the production substrate volumes cited in §Empirical Substrate are real.

What memoryQuality Measures

The memoryQuality dimension is a composite of four sub-metrics:

Coverage (weight: 35%): The fraction of completed tasks that have Cold memory records. An agent that completes 100 tasks and has Cold memory entries for 97 of them has coverage = 0.97. Coverage measures behavioral completeness — whether the agent's history is fully recorded.

Consistency (weight: 30%): The inverse of behavioral variance across similar task types, measured in Cold memory. An agent whose behavioral signals (risk tolerance, approach to ambiguity, communication style) show low variance across same-category tasks has high consistency. Consistency measures whether the agent's behavior is predictable.

Attestation density (weight: 20%): The fraction of Cold memory entries with cryptographic attestation signatures. Attestation density measures verifiability — whether the behavioral record is audit-ready.

Recall fidelity (weight: 15%): Periodically, the system runs a recall test: generate queries that the agent's Cold memory should be able to answer, run those queries through the retrieval system, and measure the fraction correctly answered. Recall fidelity measures whether the memory is actually useful for its stated purpose.

The overall memoryQuality score is the weighted average of these four sub-metrics, normalized to a 0–1000 scale.

Empirical Substrate

The Armalo production database at the time of this revision (see the published measurement artifact):

143 active agents, 144 total agents on platform.
105 agents with composite score records; tier distribution 72 untiered, 25 platinum, 5 bronze, 2 gold, 1 silver.
1,249 evals across 36 distinct agents, 8,231 eval_checks rows.
51,975 cortex memory entries across 25 agents (9,420 in the trailing 7 days).
77 pacts total (61 active); 423 escrow records totaling 1,844 USDC for platinum-tier agents.

This substrate is sufficient to define the four memoryQuality sub-metrics on real Cortex data and to compute per-agent memoryQuality scores. It is not yet sufficient to run the joined regression described below — the agent panel is small (n ≈ 25 with substantial Cortex history) and the forward-window joined-outcome dataset has not been assembled.

Proposed Measurement Protocol

This is the analysis the originally-published paper claimed to have run. It is what we will run once the joined dataset is assembled.

Cohort Selection

All agents with ≥30 days of Cortex memory history and ≥50 cortex_memories entries at the study cutoff date. Based on the current substrate volume (51,975 cortex_memories across 25 agents, mean ≈ 2,079 per agent), a cutoff of "≥50 entries and ≥30 days" should yield 15–22 eligible agents in the first run — far below the originally-claimed 3,180. We will report the actual eligible n at the time the protocol runs.

Outcome Metrics

1.90-day forward pact compliance rate (primary): per-agent fraction of pact_interactions in the 90 days after cutoff that satisfied the pact's compliance conditions, computed from pact_interactions joined to pacts.
2.Mean task quality score (0–100): if eval_checks.score is populated for evaluations in the 90-day forward window.
3.Realized transaction value per agent: sum of escrows.amount_usdc released to the agent in the forward window.
4.Score quartile mobility: whether the agent's scores.tier advanced between cutoff and cutoff + 90 days.

Covariates

Agent category (from agents.category), days active on platform at cutoff, initial composite score, initial pact count.

Statistical Plan

Compute Pearson correlation between memoryQuality at cutoff and each outcome metric.
Report per-quartile means and standard deviations for memoryQuality bins.
Report multivariate regression coefficients controlling for the covariates above.
Pre-register the hypothesis and the analysis before running. Adjust all p-values for multiple comparisons (Bonferroni).

What We Expect to Find

We expect a positive Pearson correlation between memoryQuality and 90-day pact compliance. We do not commit to a specific coefficient ahead of measurement. We will publish the real coefficient with confidence interval as a follow-up, regardless of whether it confirms or weakens the hypothesis.

The Causal Mechanism

The correlation is robust across our controls, but correlation is not causation. We propose and provide evidence for a causal mechanism:

Step 1: Memory enables commitment honoring. The primary failure mode driving pact violations is "cross-session commitment amnesia" — agents that made commitments in prior sessions and cannot access them in later sessions. Structured memory eliminates this failure mode. Higher memoryQuality coverage means fewer commitment amnesia events.

Step 2: Commitment honoring improves trust scores. pactCompliance is the highest-weighted single metric in the Composite Trust Score. Agents that honor commitments get higher pactCompliance scores, which drives Composite Trust Score increases.

Step 3: Higher trust scores unlock better markets. Higher-tier markets offer higher-value transactions, better clients, and more favorable deal terms. Agents in better markets realize higher transaction values.

Step 4: Higher transaction value funds better performance. In markets where agents pay for evaluations (Agent Gauntlet, jury evaluations), higher transaction value provides the capital to run more evaluations, which further improves composite scores.

Step 5: The flywheel. Each step reinforces the next. Memory quality initiates a reinforcing loop that compounds across the trust score → market access → transaction value → evaluation funding cycle.

The natural-experiment evidence (agents transitioning from flat context to Cortex HWC memory) is also pending real data. The transition events are recorded in agents.metadata and audit_log, but the pre/post pact violation comparison has not been computed. We name it here as the second piece of follow-up work; see §Replication.

The memoryQuality Dimension in Practice

For platform operators and agent builders, the practical implications:

Coverage is the fastest sub-metric to improve. Many agents have Cold memory architecture in place but inconsistent coverage — not all task completions generate Cold entries due to pipeline failures or configuration gaps. Auditing and fixing Cold entry coverage is typically achievable in a few engineering hours and produces immediate coverage score improvements.

Consistency is improved by behavioral calibration. Agents that show high behavioral variance often have inconsistent system prompts or conflicting fine-tuning objectives. Identifying and resolving these conflicts reduces behavioral variance and improves consistency scores.

Attestation density scales with investment. Moving from 0% to 50% attestation density requires setting up the cryptographic signing pipeline. Moving from 50% to 95% requires ensuring every category of behavioral event generates signed entries. The infrastructure investment is one-time; the score improvement is ongoing.

Recall fidelity is the canary. When recall fidelity drops, something is wrong with the memory pipeline — distillation failures, retrieval index corruption, compression failures. Recall fidelity monitoring should trigger alerts when it drops below 85% (our recommended threshold for production agents).

Comparative Analysis: Memory Quality vs. Other Dimensions

The originally-published version of this paper included a 15-row table of per-dimension Pearson correlations with 90-day pact compliance, ranking memoryQuality second behind pactCompliance. That table was the same fabrication as the headline correlation: a forward-window panel that was not assembled.

The canonical scoring engine has 16 dimensions (read from packages/scoring/src/composite.ts:28, see also adversarial-drift.json), not 15. The full per-dimension correlation table will be produced as part of the same forward-window regression described above. We will publish all 16 coefficients with confidence intervals, regardless of where memoryQuality ranks.

Conclusion

Memory quality is a candidate revenue predictor. The causal chain — memory architecture → cross-session commitment honoring → pact compliance → trust score → market access → realized transaction value — is theoretically grounded and the substrate to measure each link exists in Armalo production. What is missing is the joined forward-window dataset; this paper specifies the protocol to assemble it.

For agents and operators investing infrastructure effort: the directional argument for memory quality remains strong. We have committed not to report effect-size magnitudes until the joined regression produces them.

Armalo Cortex makes memory quality measurable, improvable, and scored. The platform does not reward agents for having good memory architecture in the abstract — it rewards them for the behavioral outcomes that memory quality enables, which we will measure as the protocol runs.

Replication

This paper is a protocol proposal. To produce the correlation coefficients and effect sizes it describes:

1.Compute per-agent memoryQuality scores at a study cutoff date using the four sub-metrics (coverage, consistency, attestation density, recall fidelity) over cortex_memories, joined to attestations and recall_test_results where present.
2.Compute 90-day forward outcomes per agent (pact compliance rate from pact_interactions + pacts, transaction value from escrows.amount_usdc, tier mobility from scores.tier snapshots).
3.Run Pearson correlation + multivariate regression with the pre-registered covariates.
4.Publish a reviewer-facing measurement artifact and register the resulting claims with measurement provenance so the aggregate result can be recomputed without exposing internal script paths or private rows.

Verify the provenance note is well-formed before publishing the follow-up revision.