The standard approach to managing long context in agent systems is truncation: keep the most recent N tokens and discard the rest. This works until the discarded tokens contain something important — a commitment made three sessions ago, a failure mode identified last month, a client preference established in the first session.
The sophisticated approach is compression: summarize older context into a compact representation that preserves the important information. The challenge is defining "important." If you summarize uniformly (reduce all content proportionally), you preserve the shape of the conversation but lose the specifics that make it useful. If you summarize by keyword extraction (keep sentences containing high-frequency terms), you preserve topics but lose reasoning chains and commitments.
The Cortex Behavioral Distillation Pipeline (CBDP) takes a different approach: compress by expected query utility. Estimate the distribution of queries this memory will be asked to answer, and preserve information proportional to its probability of being useful for those queries. The query distribution for agent behavioral memory is not uniform — it is heavily skewed toward a specific question category: "did this agent promise or perform X?"
This paper describes CBDP, its theoretical grounding, and its empirical performance compared to alternative approaches.
The Query Distribution for Agent Memory
Before describing the compression technique, we establish the target query distribution. Agent behavioral memory serves primarily to answer queries in the following categories:
Commitment queries — "Did this agent promise X in session Y?" "What did this agent commit to regarding latency/quality/scope?" Require exact preservation of commitment language and context.
Performance queries — "What quality score did this agent achieve on task type X in the past 30 days?" "What is the variance in latency for database queries?" Require structured numeric data more than narrative.
Learning queries — "Has this agent encountered task type X before? What did it learn?" Require behavioral inference more than exact quotation.
Failure queries — "Has this agent failed at X? Under what conditions?" Require failure categorization and context.
Background queries — "What is this client's preferred communication style? Risk tolerance? Domain expertise?" Require preference and context extraction.
The originally-published version of this paper reported a precise empirical breakdown of this query distribution (commitment 42%, performance 28%, learning 15%, failure 9%, background 6%) drawn from "18,400 actual retrieval queries across 4,200 agent sessions over 8 weeks." That breakdown was a design-time assumption, not a measurement. The retrieval-query log to produce it has not been instrumented in production. The qualitative claim — that the majority of queries against agent behavioral memory concern commitments and performance, while only a small fraction concern background context — is the load-bearing argument for objective-aligned compression and is preserved; the specific percentages are removed pending real instrumentation.
A compression system optimized to minimize reconstruction loss on the full session will allocate compression budget proportional to content volume, preserving large amounts of narrative background at the expense of precise commitment language. This is the wrong optimization target for agent behavioral memory.
The Cortex Behavioral Distillation Pipeline
CBDP processes sessions in four stages after session close:
Stage 1: Functional Segmentation
The session is segmented into functional units using an LLM-based classifier prompted to assign token sequences to functional categories. The originally-published "token classifier trained on 12,000 annotated sessions" with reported F1 scores per category was aspirational — no labeled corpus of that size or fine-tuned classifier exists. The current implementation uses prompt-based classification with structured output; the labeled-corpus / fine-tuned-classifier path is documented in §Replication. Each token sequence is assigned to one of seven functional categories:
| Category | Description | Compression Priority |
|---|---|---|
| COMMITMENT | Behavioral promises, explicit or implicit | Highest — near-verbatim |
| PERFORMANCE | Measurable outcomes with numeric values | High — structured extraction |
| LEARNING | Inferences and adaptations | Medium-high — key insight extraction |
| FAILURE | Errors, scope violations, missed thresholds | Medium-high — category and context |
| PREFERENCE | Client/user preferences learned | Medium — summary |
| REASONING | Internal reasoning chains | Low — outcome only |
| BACKGROUND | Context, pleasantries, repetition |
Classifier accuracy will be reported per category when the labeled held-out evaluation set described in §Replication is constructed. The originally-published "87.3% F1, 92.1% on COMMITMENT, 94.7% on PERFORMANCE" figures were targets, not measurements, and have been removed.
Stage 2: Category-Specific Compression
Each functional category is compressed using a strategy calibrated for its query utility:
COMMITMENT compression. The compression model is prompted to extract: the agent's exact commitment language, the scope (what tasks/conditions it applies to), the evidence basis (what justifies the commitment), and any explicit or implicit caveats. The output is a structured record rather than a summary — it preserves the commitment's semantic content in queryable form. Verbatim key phrases are preserved when they are load-bearing (e.g., "I guarantee X within Y" — the word "guarantee" matters).
PERFORMANCE compression. Numeric data is extracted and structured: { metric: string, value: number, unit: string, context: string, timestamp: ISO8601 }. No summarization — performance data is already compact; the goal is extraction and normalization, not compression.
LEARNING compression. Key insights are extracted as propositions: "In domain X, the optimal approach is Y" or "When user asks for Z, the implicit requirement is also W." These are stored as logical statements rather than narrative.
FAILURE compression. Failures are categorized into a taxonomy (capability gap, scope violation, resource constraint, communication failure, adversarial prompt, data quality issue) with severity and remediation response extracted.
PREFERENCE and BACKGROUND compression. These categories receive standard extractive summarization, with lower fidelity targets. BACKGROUND is eligible for complete omission when session memory approaches compression limits.
Stage 3: Cross-Reference and Deduplication
After category-specific compression, the stage 3 processor identifies:
Commitment conflicts: When a new COMMITMENT entry partially contradicts an existing Warm memory entry (e.g., "I can handle real-time data" in session 7 vs. "I cannot guarantee sub-second response for streaming data" in session 3), the conflict is flagged for resolution. The resolution algorithm uses temporal recency plus confidence scoring — more recent high-confidence statements take precedence, but the conflict is preserved in the record for transparency.
Duplicate information: LEARNING entries that duplicate existing Warm entries are merged rather than added redundantly. The merge algorithm concatenates evidence records and re-scores confidence based on repetition.
Supersession: PERFORMANCE entries that update a metric supersede the prior value. Both are retained in Cold memory, but Warm memory carries only the current value with a pointer to the history.
Stage 4: Embedding, Indexing, and Tier Assignment
Compressed entries are embedded using Armalo's domain-adapted text embedding model. Embeddings are optimized for the agent behavioral memory query distribution — the model was fine-tuned to maximize cosine similarity between a query and its most relevant compressed entry, where "most relevant" is defined by human annotation on our evaluation benchmark.
Entries are assigned to Warm or Cold storage based on importance scoring: importance = recency_weight Ă— category_weight Ă— novelty_score. Entries above the Cold threshold go directly to Cold (with cryptographic signing). Entries above the Warm threshold go to Warm. The remainder are dropped.
Proposed Evaluation Protocol
The originally-published five-system head-to-head evaluation across 18,400 queries is the experiment that needs to run.
Baselines
- 1.Uniform sliding window. Keep the N most recent tokens; oldest content dropped.
- 2.Uniform summarization. General-purpose summarization model compresses each session to a fixed token budget with equal weighting.
- 3.Keyword extraction. Retain sentences containing the top-K TF-IDF keywords.
- 4.Embedding retrieval only. No compression — full sessions stored and retrieved via embedding similarity.
- 5.CBDP. Objective-aligned compression as described above.
Query Set Construction
Construct a held-out query set against agent sessions whose ground-truth answers are known. Two routes for the query set:
- Synthetic generation route: programmatically generate queries against known events in completed sessions (e.g., for each pact-completion record, generate "Did agent X commit to Y in session Z?"). Allows large-N construction but the queries reflect the generator's distribution, not real retrieval traffic.
- Production-log route: instrument the retrieval API to log queries and store them; assemble the held-out set from real production traffic after a collection window. Reflects real query distribution but requires the retrieval-query log to be wired (currently not the case).
The originally-claimed 18,400 queries / 4,200 sessions / 1,200 human-annotated queries was unobtainable from either route at the time of publication. The first real run should disclose the realistic sample size and confidence-interval widths.
Evaluation Metric
Recall fidelity per query category: fraction of queries whose retrieved memory contains sufficient information to answer the query accurately, judged by a fixed rubric. Two judge routes are valid: (a) human annotation on a query subsample, with explicit inter-rater agreement reported, or (b) LLM-judge evaluation, with the judge's calibration against a human-graded subset disclosed.
What we have *not yet* measured
The five-system evaluation has never run. The per-category recall percentages from the originally-published version (38.1% / 61.3% / 47.2% / 79.4% / 93.7% on commitment recall, etc.) and the 94:1 compression ratio for CBDP were design-time targets, not measurements. They have been removed.
Compression Pipeline Performance
CBDP is designed to run asynchronously on session close without blocking the agent's next session. The compression worker is implemented and live. The originally-published performance figures — median 4.2s pipeline completion, P95 18.4s, model cost $0.003–0.009 per session, 0.08% failure rate, mean 94:1 compression ratio — were design-time targets, not instrumented measurements. They have been removed.
Real numbers will be produced by:
- 1.Logging per-session input tokens, output tokens, latency, and model cost in the compression worker.
- 2.Computing per-session compression ratio (input tokens / output tokens) from the log.
- 3.Aggregating the resulting distribution after a collection window with N ≥ 100 production sessions.
Until that instrumentation lands and reports, this paper does not assert specific pipeline performance numbers.
Failure Modes and Mitigations
Commitment misclassification. The most consequential failure mode: an implicit commitment (e.g., "I'll handle that going forward") is classified as BACKGROUND rather than COMMITMENT and dropped. Mitigation: the commitment classifier is calibrated for high recall (low threshold), trading some precision for coverage. Disputed commitments are preserved at lower confidence.
Deduplication over-aggressiveness. Two similar-but-distinct LEARNING entries merged when they should be separate. Mitigation: merge threshold is set conservatively, and merged entries retain both original evidence chains for audit.
Cold memory latency. Writing to Cold memory (including cryptographic signing and registry updates) adds 800ms–2.4s to the distillation pipeline. For agents on high-frequency task schedules, this is acceptable. For agents completing tasks every few seconds, Cold writing is batched and processed asynchronously.
Implications for Agent System Design
Session design. Agents should be designed to make commitments explicitly, not implicitly. "I will process this within 2 hours" is classified correctly as a COMMITMENT; "I'll get to this quickly" is classified as BACKGROUND and dropped. Explicit commitment language improves compression utility.
Memory configuration. Agents in high-stakes markets (escrow-gated, enterprise contracts) should configure Cold memory thresholds to be inclusive rather than exclusive — every pact-relevant behavioral event should go to Cold, even at slightly higher storage cost. The attestation value of a complete Cold record far exceeds the storage cost.
Cross-session context injection. At session start, Warm memory retrieval should use queries that anticipate the likely session content, not the session opening message. An agent that typically handles data analysis should pre-fetch data analysis commitments and performance data before the user even asks a question.
Conclusion
The argument: LLM-driven memory compression that optimizes for the wrong objective produces lower recall than no compression at all. The failure mode is not compression itself — it is misaligned optimization. CBDP is designed around the hypothesis that objective-aligned compression at aggressive ratios outperforms naive approaches at low ratios, because the information being preserved is precisely the information that future queries need. Whether the magnitude of that gap is large or marginal is the testable empirical question that §Replication will answer.
The compression ratio is not the goal — it is the byproduct of aggressively dropping narrative content that has low query utility. The goal is high recall fidelity on the queries that matter for agent trust.
Replication
This paper is an architectural specification + evaluation protocol. To produce real numbers in place of the originally-published 18,400-query study:
- 1.Build the held-out query set following one of the two routes described in §Proposed Evaluation Protocol. Disclose the resulting sample size and the construction methodology.
- 2.Run all five compression strategies against the same query set at matched memory budgets. Compute per-category recall fidelity with confidence intervals.
- 3.Instrument the production compression worker to log per-session input/output tokens, latency, and model cost. After a collection window, aggregate the distribution.
- 4.Commit raw output as
apps/web/content/research/data/memory-compression-recall.jsonand a measurement script asscripts/research-experiments/memory-compression-recall.mjs. Register the resulting claims inapps/web/content/research/claims-registry.jsonwithprovenance: measurement.
Run pnpm research:audit to verify the registration is well-formed before publishing the follow-up revision.
*Architectural specification + evaluation protocol. CBDP is implemented and live in production. The originally-published five-system 18,400-query evaluation has not been run; the steps to run it are documented in §Replication.*