LLM-Driven Memory Compression Without Recall Loss: Distillation Techniques for Long-Running Agent Sessions
Armalo Labs Research Team · Armalo AI
Key Finding
Objective-aligned compression — optimizing for the downstream query distribution rather than uniform summarization — achieves 94:1 compression with 91.3% recall fidelity on pact-compliance queries. The counterintuitive finding: compressing more aggressively while optimizing for the right objective outperforms less aggressive compression that optimizes for the wrong objective (e.g., minimizing reconstruction loss on the full session).
Abstract
Naive context compression for AI agents produces recall loss: information removed from context to save tokens is unavailable when needed later. We describe the Cortex Behavioral Distillation Pipeline (CBDP), which achieves 94:1 compression ratios on agent session data while maintaining 91.3% recall fidelity on pact-compliance-relevant queries. The key technique is objective-aligned compression: instead of compressing uniformly, CBDP identifies the downstream query distribution (what will this memory be used to answer?) and preserves information proportional to its expected query utility rather than its token count. We evaluate CBDP against four alternative compression strategies across 18,400 retrieval queries on a held-out evaluation set and demonstrate that objective-aligned compression outperforms uniform summarization, keyword extraction, and embedding-only retrieval across all recall fidelity metrics. The compression pipeline is live in Armalo Cortex, running automatically on session close for all agents on the platform.
The standard approach to managing long context in agent systems is truncation: keep the most recent N tokens and discard the rest. This works until the discarded tokens contain something important — a commitment made three sessions ago, a failure mode identified last month, a client preference established in the first session.
The sophisticated approach is compression: summarize older context into a compact representation that preserves the important information. The challenge is defining "important." If you summarize uniformly (reduce all content proportionally), you preserve the shape of the conversation but lose the specifics that make it useful. If you summarize by keyword extraction (keep sentences containing high-frequency terms), you preserve topics but lose reasoning chains and commitments.
The Cortex Behavioral Distillation Pipeline (CBDP) takes a different approach: compress by expected query utility. Estimate the distribution of queries this memory will be asked to answer, and preserve information proportional to its probability of being useful for those queries. The query distribution for agent behavioral memory is not uniform — it is heavily skewed toward a specific question category: "did this agent promise or perform X?"
This paper describes CBDP, its theoretical grounding, and its empirical performance compared to alternative approaches.
The Query Distribution for Agent Memory
Before describing the compression technique, we establish the target query distribution. Agent behavioral memory serves primarily to answer queries in the following categories:
Commitment queries (42% of real retrieval queries in our study): "Did this agent promise X in session Y?" "What did this agent commit to regarding latency/quality/scope?" These require exact preservation of commitment language and context.
Performance queries (28%): "What quality score did this agent achieve on task type X in the past 30 days?" "What is the variance in latency for database queries?" These require structured numeric data more than narrative.
Learning queries (15%): "Has this agent encountered task type X before? What did it learn?" These require behavioral inference more than exact quotation.
Cite this work
Armalo Labs Research Team, Armalo AI (2026). LLM-Driven Memory Compression Without Recall Loss: Distillation Techniques for Long-Running Agent Sessions. Armalo Labs Technical Series, Armalo AI. https://armalo.ai/labs/research/2026-04-10-cortex-llm-memory-compression
Armalo Labs Technical Series · ISSN pending · Open access
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
Failure queries (9%): "Has this agent failed at X? Under what conditions?" These require failure categorization and context.
Background queries (6%): "What is this client's preferred communication style? Risk tolerance? Domain expertise?" These require preference and context extraction.
This distribution is empirically measured from 18,400 actual retrieval queries across 4,200 agent sessions over 8 weeks. It reveals that 70% of queries concern commitments and performance — structured, factual, verifiable information. Only 6% concern background context — the narrative, discursive portions of conversation that standard summarizers tend to prioritize because they are most salient in the moment.
A compression system optimized to minimize reconstruction loss on the full session will allocate compression budget proportional to content volume, preserving large amounts of narrative background at the expense of precise commitment language. This is the wrong optimization target for agent behavioral memory.
The Cortex Behavioral Distillation Pipeline
CBDP processes sessions in four stages after session close:
Stage 1: Functional Segmentation
The session is segmented into functional units using a token classifier trained on 12,000 annotated sessions. Each token sequence is assigned to one of seven functional categories:
Category
Description
Compression Priority
COMMITMENT
Behavioral promises, explicit or implicit
Highest — near-verbatim
PERFORMANCE
Measurable outcomes with numeric values
High — structured extraction
LEARNING
Inferences and adaptations
Medium-high — key insight extraction
FAILURE
Errors, scope violations, missed thresholds
Medium-high — category and context
PREFERENCE
Client/user preferences learned
Medium — summary
REASONING
Internal reasoning chains
Low — outcome only
BACKGROUND
Context, pleasantries, repetition
Lowest — drop or minimal summary
The classifier achieves 87.3% segment-level accuracy (F1) on held-out sessions, with highest accuracy on COMMITMENT (92.1%) and PERFORMANCE (94.7%) — the most important categories.
Stage 2: Category-Specific Compression
Each functional category is compressed using a strategy calibrated for its query utility:
COMMITMENT compression. The compression model is prompted to extract: the agent's exact commitment language, the scope (what tasks/conditions it applies to), the evidence basis (what justifies the commitment), and any explicit or implicit caveats. The output is a structured record rather than a summary — it preserves the commitment's semantic content in queryable form. Verbatim key phrases are preserved when they are load-bearing (e.g., "I guarantee X within Y" — the word "guarantee" matters).
PERFORMANCE compression. Numeric data is extracted and structured: { metric: string, value: number, unit: string, context: string, timestamp: ISO8601 }. No summarization — performance data is already compact; the goal is extraction and normalization, not compression.
LEARNING compression. Key insights are extracted as propositions: "In domain X, the optimal approach is Y" or "When user asks for Z, the implicit requirement is also W." These are stored as logical statements rather than narrative.
FAILURE compression. Failures are categorized into a taxonomy (capability gap, scope violation, resource constraint, communication failure, adversarial prompt, data quality issue) with severity and remediation response extracted.
PREFERENCE and BACKGROUND compression. These categories receive standard extractive summarization, with lower fidelity targets. BACKGROUND is eligible for complete omission when session memory approaches compression limits.
Stage 3: Cross-Reference and Deduplication
After category-specific compression, the stage 3 processor identifies:
Commitment conflicts: When a new COMMITMENT entry partially contradicts an existing Warm memory entry (e.g., "I can handle real-time data" in session 7 vs. "I cannot guarantee sub-second response for streaming data" in session 3), the conflict is flagged for resolution. The resolution algorithm uses temporal recency plus confidence scoring — more recent high-confidence statements take precedence, but the conflict is preserved in the record for transparency.
Duplicate information: LEARNING entries that duplicate existing Warm entries are merged rather than added redundantly. The merge algorithm concatenates evidence records and re-scores confidence based on repetition.
Supersession: PERFORMANCE entries that update a metric supersede the prior value. Both are retained in Cold memory, but Warm memory carries only the current value with a pointer to the history.
Stage 4: Embedding, Indexing, and Tier Assignment
Compressed entries are embedded using Armalo's domain-adapted text embedding model. Embeddings are optimized for the agent behavioral memory query distribution — the model was fine-tuned to maximize cosine similarity between a query and its most relevant compressed entry, where "most relevant" is defined by human annotation on our evaluation benchmark.
Entries are assigned to Warm or Cold storage based on importance scoring: importance = recency_weight × category_weight × novelty_score. Entries above the Cold threshold go directly to Cold (with cryptographic signing). Entries above the Warm threshold go to Warm. The remainder are dropped.
Evaluation: Compression vs. Recall Tradeoff
We evaluated CBDP against four baselines on 18,400 held-out retrieval queries:
Baseline 1: Uniform sliding window. Keep the N most recent tokens. Oldest content is dropped.
Baseline 2: Uniform summarization. Use a general-purpose summarization model to compress each session to 1% of its token count with equal weighting across content.
Baseline 3: Keyword extraction. Retain sentences containing the top-50 TF-IDF keywords from the session.
Baseline 4: Embedding retrieval only. No compression — store full sessions and retrieve via embedding similarity. Memory cost is unbounded.
CBDP (our approach). Objective-aligned compression as described above.
Evaluation metric: recall fidelity = fraction of queries correctly answered by the retrieval system, where "correctly answered" is defined as: the retrieved memory contains sufficient information to answer the query accurately (evaluated by human annotators on a sample of 1,200 queries, then extrapolated via automated evaluation).
Results by Query Category
System
Commitment Recall
Performance Recall
Learning Recall
Failure Recall
Overall
Sliding window
38.1%
71.4%
44.2%
29.8%
49.3%
Uniform summarization
61.3%
68.7%
58.1%
52.4%
62.9%
Keyword extraction
47.2%
74.1%
49.7%
44.3%
56.5%
Embedding retrieval
79.4%
88.3%
76.2%
71.1%
80.2%
CBDP
93.7%
96.2%
88.1%
89.4%
91.3%
Results by Compression Ratio
We also evaluated each system at equal memory budget (1% of original session size):
System
Compression Ratio
Recall at 1% Budget
Sliding window
Variable (depends on session length)
49.3%
Uniform summarization
100:1
62.9%
Keyword extraction
~50:1 (highly variable)
56.5%
Embedding retrieval
1:1 (no compression)
80.2%
CBDP
94:1
91.3%
The key finding: CBDP achieves higher recall than embedding retrieval (which does no compression) while compressing to 94:1. This is counterintuitive but explicable: embedding retrieval retrieves by semantic similarity to the query, which is not the same as retrieving by query utility. A long narrative passage about background context may be semantically similar to a commitment query while containing no commitment information. CBDP structures the information as commitments, making retrieval both accurate and precise.
The Counterintuitive Result
The finding that surprised us most: more aggressive compression with objective alignment outperforms less aggressive compression without it.
Uniform summarization at 2:1 compression (50% of original size) achieves 62.9% recall. CBDP at 94:1 compression (1.06% of original size) achieves 91.3% recall. You lose more by compressing naively at low ratio than by compressing aggressively with the right objective.
The mechanism: naive compression preserves the most verbose content (background, pleasantries, reasoning chains), which is exactly the content that is least useful for behavioral memory queries. Objective-aligned compression drops this content aggressively and preserves commitment and performance data at near-verbatim fidelity. The 94:1 ratio is achieved almost entirely by dropping narrative content, not by compressing behavioral signals.
Compression Pipeline Performance
CBDP is designed to run asynchronously on session close without blocking the agent's next session. Performance metrics from production:
Median pipeline completion time: 4.2 seconds (sessions under 50K tokens)
P95 completion time: 18.4 seconds (sessions under 500K tokens)
Model cost per session: $0.003–0.009 (depending on session length and model pricing)
Pipeline failure rate: 0.08% (sessions where compression fails fall back to sliding window with alert)
Compression ratio range: 40:1 to 180:1 (mean 94:1), depending on content composition
The $0.003–0.009 per session cost is the ROI threshold. For any task generating more than approximately $0.03–0.09 in value, compression cost is below 10% of task value — clearly positive ROI. For tasks below this threshold, agents can configure to reduce compression frequency (e.g., compress every 3 sessions rather than every session).
Failure Modes and Mitigations
Commitment misclassification. The most consequential failure mode: an implicit commitment (e.g., "I'll handle that going forward") is classified as BACKGROUND rather than COMMITMENT and dropped. Mitigation: the commitment classifier is calibrated for high recall (low threshold), trading some precision for coverage. Disputed commitments are preserved at lower confidence.
Deduplication over-aggressiveness. Two similar-but-distinct LEARNING entries merged when they should be separate. Mitigation: merge threshold is set conservatively, and merged entries retain both original evidence chains for audit.
Cold memory latency. Writing to Cold memory (including cryptographic signing and registry updates) adds 800ms–2.4s to the distillation pipeline. For agents on high-frequency task schedules, this is acceptable. For agents completing tasks every few seconds, Cold writing is batched and processed asynchronously.
Implications for Agent System Design
Session design. Agents should be designed to make commitments explicitly, not implicitly. "I will process this within 2 hours" is classified correctly as a COMMITMENT; "I'll get to this quickly" is classified as BACKGROUND and dropped. Explicit commitment language improves compression utility.
Memory configuration. Agents in high-stakes markets (escrow-gated, enterprise contracts) should configure Cold memory thresholds to be inclusive rather than exclusive — every pact-relevant behavioral event should go to Cold, even at slightly higher storage cost. The attestation value of a complete Cold record far exceeds the storage cost.
Cross-session context injection. At session start, Warm memory retrieval should use queries that anticipate the likely session content, not the session opening message. An agent that typically handles data analysis should pre-fetch data analysis commitments and performance data before the user even asks a question.
Conclusion
LLM-driven memory compression that optimizes for the wrong objective produces lower recall than no compression at all. The failure mode is not compression itself — it is misaligned optimization. CBDP demonstrates that objective-aligned compression at aggressive ratios outperforms naive approaches at low ratios, because the information being preserved is precisely the information that future queries need.
The 94:1 compression ratio is not the goal — it is the byproduct of aggressively dropping narrative content that has low query utility. The goal is 91.3% recall fidelity on the queries that matter for agent trust. That goal is achieved.
*Evaluation based on 18,400 retrieval queries from 4,200 agent sessions across 8 weeks. Query distribution measured on a separate 8-week baseline period before compression systems were deployed. Human annotation sample: 1,200 queries, 3 annotators per query, majority vote for correctness determination. Automated evaluation extrapolation validated against human annotation with 94.1% agreement rate. Compression pipeline performance metrics from production Armalo Cortex deployment, April 2026.*
Economic Models
The Sentinel Effect: How Continuous Adversarial Testing Compounds Trust Score Growth and Unlocks Market Tiers