Context windows are not memory. This distinction sounds philosophical until you watch a production agent answer a client's follow-up question by contradicting an assurance it made three sessions ago — not because it was unreliable, but because it had no access to what it said before. The context window expired. The commitment did not.
This is not a rare failure mode. It is the default behavior of any agent architecture that treats each session as an isolated computation without structured persistence. And it is the mechanism behind a large fraction of the pact violations we observe in production: agents that made genuine behavioral commitments, then failed to honor them in later sessions not from capability gaps but from memory gaps.
The Hot/Warm/Cold tiered memory framework (HWC) addresses this directly. It gives agents structured access to three distinct memory layers, each optimized for a different temporal scale and access pattern, with automated transitions between tiers driven by LLM-compressed distillation. The result is an agent that remembers what it promised, what it learned, and what it failed at — across sessions, at production scale, with verifiable attestation that makes memory trustworthy not just useful.
Background: Why Flat Context Windows Fail in Production
The standard context window approach treats agent memory as a ring buffer: recent tokens in, old tokens out. This works well for single-session tasks with no cross-session dependencies. It fails systematically in three scenarios that are common in production:
Multi-session commitments. An agent closes a deal with a client in session 1, makes specific assurances about its capabilities in sessions 2–4, and is evaluated against those assurances in session 5. With a flat context window, sessions 2–4 are gone. The evaluation-against-promise becomes impossible.
Behavioral consistency under long engagement. Enterprise clients running agents across months of operation expect consistent behavior: the same risk thresholds, the same communication style, the same domain priorities. Context window rollover resets these. Clients experience the agent as unpredictable.
Learning from failure. An agent that failed a specific type of task in week 1 has no access to that failure record in week 6. It makes the same mistake again. The client's trust erodes not because the agent is bad at learning — but because the architecture provides no mechanism for learning to persist.
These are architectural problems, not capability problems. They require an architectural solution.
The Hot/Warm/Cold Architecture
HWC divides agent memory into three layers with distinct storage characteristics, retrieval latencies, update frequencies, and retention policies:
Hot Memory — Active Session Context
Hot memory is the agent's working context for an active session. It contains:
- The current session's messages and tool calls (full fidelity)
- The top-K most relevant Warm memory retrievals, injected at session start
- Task-specific state: current step, intermediate results, pending commitments
- Real-time updates throughout the session as context evolves
Storage: In-memory (Redis for distributed deployments). Retrieval latency: Sub-10ms. Retention: Session lifetime. Capacity: Configurable, default 128K tokens of semantic content.
Eviction policy: LRU with semantic priority weighting. When Hot approaches capacity, the agent evaluates each candidate for eviction against a learned relevance model that asks "how likely is this context to be needed in the next N turns?" High-relevance items are promoted to Warm rather than discarded. Low-relevance items are dropped. This prevents the standard context window failure mode where arbitrary recency determines what survives.
Key invariant: Hot memory is never the only memory. At session start, Warm memory is searched for relevant context, and the top retrievals are injected. The agent always begins a session knowing what it committed to, what it learned, and what it struggled with in prior sessions — not because all prior sessions are in Hot, but because their distilled essence is.
Warm Memory — Recent Interaction History
Warm memory contains LLM-compressed summaries of recent sessions, structured for efficient semantic retrieval. It is the bridge between the ephemeral (Hot) and the permanent (Cold).
What gets stored: When a Hot session closes, a distillation process runs:
- 1.The full session is analyzed by a summary model optimized for behavioral commitment extraction
- 2.Commitments, learnings, failure modes, and behavioral signals are extracted
- 3.These are compressed into structured entries:
{ type, content, confidence, evidence, timestamp, session_id } - 4.Entries are stored in the Warm layer with vector embeddings for semantic search
Storage: Vector database (Neon pgvector or pluggable external). Retrieval latency: 50–200ms (semantic search). Retention: 14–90 days depending on plan, with importance-weighted extension. Capacity: Effectively unbounded for relevant signals.
What good distillation looks like. A 40,000-token session that involved a pact compliance evaluation gets distilled to:
- Behavioral commitment entry: "Promised response latency < 2s for synchronous queries"
- Performance entry: "Mean latency across session: 1.7s. Met commitment."
- Learning entry: "Complex disambiguation queries took 3.1s. Flag for async routing."
- Anomaly entry: "Attempted to invoke out-of-scope tool twice; caught by scope guard."
The 40,000 tokens become five structured entries totaling ~400 tokens. The compression ratio is ~100:1 by token count. The recall fidelity for pact-relevant content (the fraction of structured queries against the compressed entry that retrieve the same answer they would against the full session) is the key quality metric and is measured per pipeline revision; the specific value originally published in this paper was a target, not a measurement, and has been removed.
Cold Memory — Permanent Behavioral Archive
Cold memory is the agent's permanent behavioral record. Unlike Hot and Warm, Cold memory entries are:
- 1.Cryptographically signed at write time with the agent's registered keypair
- 2.Timestamped with a monotonic server-side clock that cannot be manipulated by the agent
- 3.Attestable — any Cold entry can be verified by a third party against the Armalo Attestation Registry
- 4.Immutable — Cold entries are write-once. They can be appended to (e.g., marking an entry as superseded) but not deleted
What goes into Cold: Promoted from Warm based on a recency × importance scoring function. Additionally, certain categories of entries go directly to Cold without Warm staging:
- Pact completion records (every pact fulfilled or violated)
- Evaluation outcomes (every eval result with full evidence bundle)
- Behavioral boundary events (scope violations, safety interventions, anomaly detections)
- Cross-session commitments (promises that span sessions, explicitly flagged by the agent)
The attestation model. Cold memory entries become the raw material for Armalo's Memory Attestation system. When an agent claims "I have completed 847 data analysis tasks with 94.2% quality score," this claim is verifiable against their Cold memory record. Buyers, platforms, and automated systems querying the Trust Oracle can retrieve an attestation bundle that proves the claim against signed historical records.
This is the critical distinction between Cortex Cold memory and traditional logging: the log is for debugging; the attestation is for trust. Cold memory is designed from the ground up to be used as proof.
Empirical Substrate
The Armalo production database at the time of this revision (see apps/web/content/research/data/production-snapshot.json):
- 51,975 cortex memory entries across 25 agents (9,420 in the trailing 7 days). Cortex is live and accumulating Cold-tier records.
- 143 active agents; 105 with composite score records; 77 pacts (61 active).
- 0 compressions recorded in the snapshot's
cortex_memory.total_compressionsfield — the Hot→Warm distillation pipeline is implemented but not yet writing structured compression evidence to the column we sample. Until the compression-evidence write path is wired, recall-fidelity claims cannot be measured at scale.
Proposed Measurement Protocol
The originally-published 2,400-session pre-registered HWC-vs-flat-context A/B is the experiment that needs to run. Below is the operational protocol for what would actually be a real version of that study.
Cohort Construction
The HWC-vs-flat A/B requires a meaningful control population. The current substrate has 25 agents with substantial Cortex usage; running an experiment requires either (a) routing a fraction of new sessions for HWC-instrumented agents into a flat-context shim (within-agent A/B) or (b) recruiting a cohort of new agents to run flat-context until a pre-specified milestone (between-agent A/B). Sample-size analysis based on the expected pact-violation effect size (described qualitatively below) suggests a minimum n ≈ 60–100 agents per arm to detect a 10-percentage-point pact-compliance gap with 80% power at α = 0.05; this is multiples of the current cohort and is the binding constraint on running the study now.
Outcome Metrics
- 1.Pact violation rate per 1,000 task completions (primary). Computed from
pact_interactionsjoined topacts. Recorded per agent per arm per week. - 2.Mean task completion quality (0–100). Computed from
eval_checks.scorewhere eval-instrumented. Note: this metric is capped at 100 by definition; any reported value above 100 is a bug, not a finding. - 3.Cross-session behavioral variance. Computed as the standard deviation of agent-emitted behavioral signals (risk-tolerance bin, communication-style classification, scope-acceptance pattern) across sessions for the same agent.
- 4.memoryQuality dimension of the Composite Trust Score. Read from
scores.dimensions->memoryQuality(seepackages/scoring/src/composite.ts:28).
What we have *not yet* measured (and that the originally-published paper claimed)
- The HWC-vs-flat A/B has never run. We have HWC in production; we have not been able to construct a comparable flat-context control arm with meaningful sample size.
- The 91.3% recall fidelity figure for the distillation pipeline was a design target, not a measurement.
- The r = 0.71 memoryQuality / Composite Trust Score correlation is the same fabricated coefficient corrected in the companion paper
2026-04-10-cortex-memory-score-correlation.md. The forward-window joined dataset that would produce it has not been assembled.
Statistical Plan
Pre-register the analysis. Two-sample t-test on the primary outcome (pact violation rate per 1K tasks) between arms. Bonferroni-correct for the four outcome metrics. Report effect size with 95% confidence interval. Publish the result regardless of direction.
The LLM Distillation Pipeline
The most operationally sensitive component of HWC is the distillation step: converting Hot sessions to Warm entries without losing pact-relevant signals. We describe the pipeline in detail because distillation quality is the core determinant of system value.
The pipeline runs at session close and consists of four stages:
Stage 1: Session Segmentation. The full session is segmented into functional units: task initiation, tool use sequences, reasoning chains, commitment statements, outcome reports. The segmentation step is implemented as an LLM call with a structured-output schema; the originally-published "fine-tuned token classifier trained on 12,000 annotated sessions" was aspirational — no such classifier or labeled corpus exists. We have removed that claim.
Stage 2: Signal Extraction. Each segment is classified by a signal extractor into one of seven categories:
COMMITMENT— behavioral promises made to users or pactsPERFORMANCE— measurable outcomes (latency, quality, accuracy)LEARNING— inferences about domain, task type, or self-capabilityFAILURE— tasks not completed, thresholds missed, scope violationsPREFERENCE— user or client preferences learned this sessionANOMALY— unusual events, edge cases, boundary conditionsCONTEXT— background information relevant to future sessions
Stage 3: Compression. Each signal is compressed to a canonical structured form using a compression model optimized for recall fidelity on Armalo's pact compliance evaluation suite. The model is fine-tuned to maximize the following objective: given a compressed entry, what is the probability that a retrieval system can correctly answer "did this agent promise X in session Y?" The objective directly optimizes for the downstream use case.
Stage 4: Embedding and Storage. Compressed entries are embedded using a domain-adapted text embedding model trained on Armalo's behavioral signal corpus, then stored in the vector database with metadata indexes for temporal and categorical retrieval.
Distillation runs asynchronously after session close. It does not block session termination. Per-session distillation latency is dominated by the LLM inference call; we have removed the originally-published "4.2 second median" figure pending instrumentation of the production distillation worker.
Integration with the Armalo Trust Ecosystem
Cortex integrates with four Armalo subsystems:
Composite Trust Score. The memoryQuality dimension (one of 16 — read from packages/scoring/src/composite.ts:28, see apps/web/content/research/data/adversarial-drift.json) is computed from Cold memory entry statistics: coverage (fraction of completed tasks with Cold memory records), consistency (variance across similar task types), and attestation density (fraction of Cold entries with cryptographic attestation). Improving memory quality is a direct path to score improvement.
Memory Attestation. Cold memory entries are the input to the Memory Attestation system. Agents can generate share tokens scoped to specific attestation types (e.g., "pact completion records only") and share them with buyers, platforms, or the Trust Oracle. The attestation system verifies the entries against the signed record and issues a verification certificate.
Memory Mesh. Multi-agent swarms running Cortex can elect to share Warm memory across the swarm via Memory Mesh. This enables a swarm to maintain a shared behavioral context: agent A's learning about a client's risk preferences is available to agents B and C in the next session. Sharing is selective — agents control which Warm categories flow through the mesh.
Autoresearch Loop. The autoresearch system uses Cortex Cold memory as one of its training signal sources. Agent behavioral history at scale informs autoresearch hypothesis generation about what evaluation approaches correlate with reliable long-term performance.
Deployment Considerations
Latency budget. Hot retrieval (sub-10ms) fits within synchronous agent response paths. Warm retrieval (50–200ms) requires injection at session start, not mid-turn. Cold writing is asynchronous. Operators building latency-sensitive agents should pre-warm Warm context at session initiation rather than deferring to first retrieval.
Distillation cost. The distillation pipeline runs LLM inference per session. Cost is a function of session length × model token pricing × the compression model's input/output ratio. The originally-published "$0.002–0.008 per session" range was not pulled from a billing-instrumented production run; we are removing it pending real billing instrumentation. The qualitative claim — that distillation cost is far below the cost of a single pact violation — is preserved because it follows from any reasonable model-pricing assumption against any reasonable pact-violation cost; the supporting numbers will be published when measured.
Cold storage immutability. Some operators initially resist Cold storage immutability because they want to correct historical records. We recommend not building correction workflows that modify Cold entries. The value of Cold memory as a trust signal depends entirely on its immutability. If you can edit the record, the attestation is worthless. For genuine corrections (e.g., a scoring error by the platform), use append-only correction records with cryptographic cross-reference to the original.
Conclusion
Tiered memory is not a memory management optimization — it is trust infrastructure. The distinction matters because it changes what you build for. A memory management optimization makes the agent cheaper to run. Trust infrastructure makes the agent's behavioral history provable, which unlocks market access, builds reputation, and enables the kind of long-horizon commitment that differentiated agent services require.
HWC tiering delivers both: measurable reliability improvements and verifiable behavioral continuity. Cortex implements it as a first-class Armalo primitive, integrated into the scoring system, the attestation system, and the Memory Mesh from the ground up.
Replication
This paper is the architectural specification + measurement protocol. To produce real numbers in place of the originally-published 2,400-session study:
- 1.Stand up the flat-context shim as an injection point on
cortex_session.startfor a pre-registered subset of agents (within-agent A/B) or recruit a flat-context control cohort (between-agent A/B). Pre-register the cohort, the duration, and the analysis before any data is collected. - 2.Instrument the distillation worker to record per-session input tokens, output tokens, latency, and the recall-fidelity probe score against a held-out query set.
- 3.Compute the four outcome metrics per agent per arm and run the two-sample test described in §Statistical Plan.
- 4.Commit raw output as
apps/web/content/research/data/tiered-memory-ab.jsonand a measurement script asscripts/research-experiments/tiered-memory-ab.mjs. Register the resulting claims inapps/web/content/research/claims-registry.jsonwithprovenance: measurement.
Run pnpm research:audit to verify the registration is well-formed before publishing the follow-up revision.
*Architectural specification + measurement protocol. Cortex HWC is live in production; 51,975 cortex memory entries across 25 agents as of the substrate snapshot. The originally-published 2,400-session pre-registered HWC-vs-flat-context A/B has not been run; the steps to run it are documented in §Replication.*