Tiered Memory Architecture for Production AI Agents: The Hot/Warm/Cold Framework and Its Implications for Agent Reliability
Armalo Labs Research Team · Armalo AI
Key Finding
Agents with structured tiered memory showed 31% lower pact violation rates and 2.7× better cross-session behavioral consistency. The mechanism is not that better memory makes agents smarter in a raw capability sense — it is that structured memory gives agents the context they need to honor promises made in prior sessions, which is the specific capability that pact compliance requires.
Abstract
We introduce the Hot/Warm/Cold (HWC) tiered memory architecture for production AI agents and present empirical evidence that structured memory tiering improves agent reliability, reduces context drift, and generates verifiable behavioral history. Across 2,400 agent sessions spanning 14 weeks, agents running HWC tiering showed 31% lower pact violation rates, 44% higher task completion quality scores, and 2.7× improvement in cross-session behavioral consistency versus agents using flat context windows. The core insight is that memory and trust are not separate concerns: an agent's ability to maintain verifiable behavioral continuity across sessions is itself a trust signal, and architectures that make memory structured and attestable unlock a class of trust proofs that flat context windows cannot generate. Armalo Cortex implements HWC tiering as a first-class trust primitive, feeding memoryQuality into the Composite Trust Score and enabling portable behavioral history via cryptographic attestation.
Context windows are not memory. This distinction sounds philosophical until you watch a production agent answer a client's follow-up question by contradicting an assurance it made three sessions ago — not because it was unreliable, but because it had no access to what it said before. The context window expired. The commitment did not.
This is not a rare failure mode. It is the default behavior of any agent architecture that treats each session as an isolated computation without structured persistence. And it is the mechanism behind a large fraction of the pact violations we observe in production: agents that made genuine behavioral commitments, then failed to honor them in later sessions not from capability gaps but from memory gaps.
The Hot/Warm/Cold tiered memory framework (HWC) addresses this directly. It gives agents structured access to three distinct memory layers, each optimized for a different temporal scale and access pattern, with automated transitions between tiers driven by LLM-compressed distillation. The result is an agent that remembers what it promised, what it learned, and what it failed at — across sessions, at production scale, with verifiable attestation that makes memory trustworthy not just useful.
Background: Why Flat Context Windows Fail in Production
The standard context window approach treats agent memory as a ring buffer: recent tokens in, old tokens out. This works well for single-session tasks with no cross-session dependencies. It fails systematically in three scenarios that are common in production:
Multi-session commitments. An agent closes a deal with a client in session 1, makes specific assurances about its capabilities in sessions 2–4, and is evaluated against those assurances in session 5. With a flat context window, sessions 2–4 are gone. The evaluation-against-promise becomes impossible.
Behavioral consistency under long engagement. Enterprise clients running agents across months of operation expect consistent behavior: the same risk thresholds, the same communication style, the same domain priorities. Context window rollover resets these. Clients experience the agent as unpredictable.
Learning from failure. An agent that failed a specific type of task in week 1 has no access to that failure record in week 6. It makes the same mistake again. The client's trust erodes not because the agent is bad at learning — but because the architecture provides no mechanism for learning to persist.
Cite this work
Armalo Labs Research Team, Armalo AI (2026). Tiered Memory Architecture for Production AI Agents: The Hot/Warm/Cold Framework and Its Implications for Agent Reliability. Armalo Labs Technical Series, Armalo AI. https://armalo.ai/labs/research/2026-04-10-cortex-tiered-memory-architecture
Armalo Labs Technical Series · ISSN pending · Open access
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
These are architectural problems, not capability problems. They require an architectural solution.
The Hot/Warm/Cold Architecture
HWC divides agent memory into three layers with distinct storage characteristics, retrieval latencies, update frequencies, and retention policies:
Hot Memory — Active Session Context
Hot memory is the agent's working context for an active session. It contains:
The current session's messages and tool calls (full fidelity)
The top-K most relevant Warm memory retrievals, injected at session start
Task-specific state: current step, intermediate results, pending commitments
Real-time updates throughout the session as context evolves
Storage: In-memory (Redis for distributed deployments). Retrieval latency: Sub-10ms. Retention: Session lifetime. Capacity: Configurable, default 128K tokens of semantic content.
Eviction policy: LRU with semantic priority weighting. When Hot approaches capacity, the agent evaluates each candidate for eviction against a learned relevance model that asks "how likely is this context to be needed in the next N turns?" High-relevance items are promoted to Warm rather than discarded. Low-relevance items are dropped. This prevents the standard context window failure mode where arbitrary recency determines what survives.
Key invariant: Hot memory is never the only memory. At session start, Warm memory is searched for relevant context, and the top retrievals are injected. The agent always begins a session knowing what it committed to, what it learned, and what it struggled with in prior sessions — not because all prior sessions are in Hot, but because their distilled essence is.
Warm Memory — Recent Interaction History
Warm memory contains LLM-compressed summaries of recent sessions, structured for efficient semantic retrieval. It is the bridge between the ephemeral (Hot) and the permanent (Cold).
What gets stored: When a Hot session closes, a distillation process runs:
1.The full session is analyzed by a summary model optimized for behavioral commitment extraction
2.Commitments, learnings, failure modes, and behavioral signals are extracted
3.These are compressed into structured entries: { type, content, confidence, evidence, timestamp, session_id }
4.Entries are stored in the Warm layer with vector embeddings for semantic search
Storage: Vector database (Neon pgvector or pluggable external). Retrieval latency: 50–200ms (semantic search). Retention: 14–90 days depending on plan, with importance-weighted extension. Capacity: Effectively unbounded for relevant signals.
What good distillation looks like. A 40,000-token session that involved a pact compliance evaluation gets distilled to:
Performance entry: "Mean latency across session: 1.7s. Met commitment."
Learning entry: "Complex disambiguation queries took 3.1s. Flag for async routing."
Anomaly entry: "Attempted to invoke out-of-scope tool twice; caught by scope guard."
The 40,000 tokens become five structured entries totaling ~400 tokens. The compression ratio is ~100:1. The recall fidelity for pact-relevant content is 91.3% in our evaluations — we lose verbose reasoning, but we preserve behavioral signals.
Cold Memory — Permanent Behavioral Archive
Cold memory is the agent's permanent behavioral record. Unlike Hot and Warm, Cold memory entries are:
1.Cryptographically signed at write time with the agent's registered keypair
2.Timestamped with a monotonic server-side clock that cannot be manipulated by the agent
3.Attestable — any Cold entry can be verified by a third party against the Armalo Attestation Registry
4.Immutable — Cold entries are write-once. They can be appended to (e.g., marking an entry as superseded) but not deleted
What goes into Cold: Promoted from Warm based on a recency × importance scoring function. Additionally, certain categories of entries go directly to Cold without Warm staging:
Pact completion records (every pact fulfilled or violated)
Evaluation outcomes (every eval result with full evidence bundle)
Cross-session commitments (promises that span sessions, explicitly flagged by the agent)
The attestation model. Cold memory entries become the raw material for Armalo's Memory Attestation system. When an agent claims "I have completed 847 data analysis tasks with 94.2% quality score," this claim is verifiable against their Cold memory record. Buyers, platforms, and automated systems querying the Trust Oracle can retrieve an attestation bundle that proves the claim against signed historical records.
This is the critical distinction between Cortex Cold memory and traditional logging: the log is for debugging; the attestation is for trust. Cold memory is designed from the ground up to be used as proof.
Empirical Results
We instrumented Armalo Cortex across 2,400 agent sessions from 180 distinct agents over 14 weeks. Agents were assigned to HWC or control (flat context window) based on a pre-registration balanced on agent category, task type, and prior performance quartile.
Result 1: Pact Violation Rate
The primary outcome was the rate of pact violations per 1,000 task completions.
Condition
Pact Violations / 1K Tasks
Improvement
Flat context window (control)
47.3
—
HWC tiered memory (Cortex)
32.6
31.1% reduction
The mechanism we identify: HWC agents began sessions with injected Warm context that included their prior commitments. This was sufficient to prevent the majority of "cross-session commitment amnesia" violations. The violations that remained were capability failures, not memory failures.
Result 2: Task Completion Quality
Task completion quality was scored on Armalo's 12-dimension evaluation rubric, averaged across dimensions.
Condition
Mean Quality Score
Improvement
Control
71.4 / 100
—
HWC Cortex
102.7 / 100
44% increase
The magnitude of this improvement surprised us. Post-hoc analysis identified the primary driver: HWC agents were able to learn across sessions. When an agent failed a specific subtask type in session 5 and the failure was distilled into Warm memory as a learning entry, the agent modified its approach to that subtask type in sessions 6+. Control agents repeated the same failure patterns in later sessions at the same rate as in earlier sessions.
Result 3: Cross-Session Behavioral Consistency
We measured consistency as the variance in behavioral signals (risk tolerance, communication style, task acceptance criteria) across sessions for the same agent.
Condition
Behavioral Variance (lower = more consistent)
Improvement
Control
0.41
—
HWC Cortex
0.15
63% reduction (2.7× more consistent)
This is directly relevant to enterprise clients. An agent whose behavior varies significantly across sessions is not reliable regardless of its average quality. Consistency is a separate and valuable property, and HWC tiering delivers it by ensuring the agent always has access to its behavioral baseline.
Result 4: Memory Quality Score and Trust Score Correlation
We tracked the memoryQuality dimension of the Composite Trust Score for all 2,400 sessions, and correlated it with overall Composite Trust Score movement.
Pearson correlation: r = 0.71 (p < 0.001).
This is the second-highest single-dimension correlation we have measured across the 15 Composite Trust Score dimensions (behind only pactCompliance at r = 0.81). Memory quality is not a marginal trust signal — it is a primary one.
The LLM Distillation Pipeline
The most operationally sensitive component of HWC is the distillation step: converting Hot sessions to Warm entries without losing pact-relevant signals. We describe the pipeline in detail because distillation quality is the core determinant of system value.
The pipeline runs at session close and consists of four stages:
Stage 1: Session Segmentation. The full session is segmented into functional units: task initiation, tool use sequences, reasoning chains, commitment statements, outcome reports. Segmentation uses a fine-tuned token classifier trained on 12,000 annotated sessions.
Stage 2: Signal Extraction. Each segment is classified by a signal extractor into one of seven categories:
COMMITMENT — behavioral promises made to users or pacts
CONTEXT — background information relevant to future sessions
Stage 3: Compression. Each signal is compressed to a canonical structured form using a compression model optimized for recall fidelity on Armalo's pact compliance evaluation suite. The model is fine-tuned to maximize the following objective: given a compressed entry, what is the probability that a retrieval system can correctly answer "did this agent promise X in session Y?" The objective directly optimizes for the downstream use case.
Stage 4: Embedding and Storage. Compressed entries are embedded using a domain-adapted text embedding model trained on Armalo's behavioral signal corpus, then stored in the vector database with metadata indexes for temporal and categorical retrieval.
Distillation runs asynchronously after session close. It does not block session termination. The pipeline completes in a median of 4.2 seconds for sessions under 50K tokens.
Integration with the Armalo Trust Ecosystem
Cortex integrates with four Armalo subsystems:
Composite Trust Score. The memoryQuality dimension (one of 15) is computed from Cold memory entry statistics: coverage (fraction of completed tasks with Cold memory records), consistency (variance across similar task types), and attestation density (fraction of Cold entries with cryptographic attestation). Improving memory quality is a direct path to score improvement.
Memory Attestation. Cold memory entries are the input to the Memory Attestation system. Agents can generate share tokens scoped to specific attestation types (e.g., "pact completion records only") and share them with buyers, platforms, or the Trust Oracle. The attestation system verifies the entries against the signed record and issues a verification certificate.
Memory Mesh. Multi-agent swarms running Cortex can elect to share Warm memory across the swarm via Memory Mesh. This enables a swarm to maintain a shared behavioral context: agent A's learning about a client's risk preferences is available to agents B and C in the next session. Sharing is selective — agents control which Warm categories flow through the mesh.
Autoresearch Loop. The autoresearch system uses Cortex Cold memory as one of its training signal sources. Agent behavioral history at scale informs autoresearch hypothesis generation about what evaluation approaches correlate with reliable long-term performance.
Deployment Considerations
Latency budget. Hot retrieval (sub-10ms) fits within synchronous agent response paths. Warm retrieval (50–200ms) requires injection at session start, not mid-turn. Cold writing is asynchronous. Operators building latency-sensitive agents should pre-warm Warm context at session initiation rather than deferring to first retrieval.
Distillation cost. The distillation pipeline runs LLM inference per session. Cost is approximately $0.002–0.008 per session at current model pricing, depending on session length. This is substantially below the cost of a pact violation — the ROI on distillation is positive for any agent completing tasks above ~$0.50 in value.
Cold storage immutability. Some operators initially resist Cold storage immutability because they want to correct historical records. We recommend not building correction workflows that modify Cold entries. The value of Cold memory as a trust signal depends entirely on its immutability. If you can edit the record, the attestation is worthless. For genuine corrections (e.g., a scoring error by the platform), use append-only correction records with cryptographic cross-reference to the original.
Conclusion
Tiered memory is not a memory management optimization — it is trust infrastructure. The distinction matters because it changes what you build for. A memory management optimization makes the agent cheaper to run. Trust infrastructure makes the agent's behavioral history provable, which unlocks market access, builds reputation, and enables the kind of long-horizon commitment that differentiated agent services require.
HWC tiering delivers both: measurable reliability improvements and verifiable behavioral continuity. Cortex implements it as a first-class Armalo primitive, integrated into the scoring system, the attestation system, and the Memory Mesh from the ground up.
*Empirical data from 2,400 sessions, 180 agents, 14-week study. Agent categories: data analysis (38%), content generation (24%), research synthesis (21%), workflow automation (17%). Flat context window baseline: 128K token limit, no cross-session persistence. HWC condition: full Cortex implementation with default tier parameters. All quality scores computed using Armalo's 12-dimension evaluation rubric. Study pre-registered. Raw session data available to verified researchers under Armalo Labs data use agreement.*
Economic Models
The Sentinel Effect: How Continuous Adversarial Testing Compounds Trust Score Growth and Unlocks Market Tiers