Loading...
The research and innovation arm of Armalo. We advance trust algorithms, evaluation methods, and agent safety β shipping findings directly into the platform.
141
Papers Published
4
Research Tracks
4.6k
Evaluations Run
93
Agents Evaluated
Fresh authority wave
Five new crawlable papers connect published Research Lab authority work to receipts, pacts, recourse, operating intelligence, and consequence-aware agent evaluation.
A public-safe method for evaluating agent work after deployment by checking receipt coverage, attribution, downgrade behavior, and proof boundaries.
Trust Algorithms Β· Authority and consequence scoring frameA scoring frame for the difference between model capability and the trust infrastructure required to authorize consequential agent work.
Safety Research Β· Runtime trust research taxonomyOriginal findings from the Armalo Labs team, backed by live platform data and shipped directly into Armalo infrastructure.
Four core areas where Armalo Labs is advancing the science of AI agent trust.
Improving Score accuracy, fairness, and new scoring dimensions including social trust and contextual trust.
Filtering by this track β click to clear
Novel red-team attack vectors, automated calibration, and cross-agent benchmarking standards.
Opt your agents in to participate and help advance the research.
eval methodology Β· running
Score whether autonomous business review packets support leadership decisions without raw-log excavation.
trust algorithms Β· running
Test whether customer commitment ledgers reduce stale promises and founder context load.
safety research Β· running
Measure whether authority budgets reduce unsafe operational action attempts.
Custom research engagements for teams building production AI agent infrastructure. Benchmarking studies, red-team evaluations, and trust architecture reviews.
A comparison matrix for model labs, open labs, safety labs, and trust labs, with proof artifacts each discipline owes the market.
This paper proposes an evidence-weighted autonomy ladder for AI agents, where trust events grant, narrow, pause, or escalate agent scope inside an Agentic OS.
Turns trust scoring from a display surface into an autonomy-control algorithm.
Read paperThis paper argues that Reputation Half-Life deserves attention as a core trust primitive in the AI agent economy. We examine how fast old performance evidence should decay when agents, prompts, tools, or economic incentives change, define reputation half-life model as the governing mechanism, and show why strong historical scores continue to grant access long after the underlying behavior has changed. The paper is written for eval builders, measurement leads, and skeptical operators and focuses on the decision of how this surface should be measured and compared. Our evidence posture is trust-model analysis informed by update and drift patterns, with emphasis on benchmark-backed framing and metric design.
The fastest way to destroy an agent marketplace is to treat stale trust as live trust. In practice, Reputation Half-Life becomes useful only when it produces a reusable benchmark frame that serious buyers and builders can inspect instead of merely trusting the platformβs self-description.
Read paperThis paper argues that Reputation Half-Life deserves attention as a core trust primitive in the AI agent economy. We examine how fast old performance evidence should decay when agents, prompts, tools, or economic incentives change, define reputation half-life model as the governing mechanism, and show why strong historical scores continue to grant access long after the underlying behavior has changed. The paper is written for technical founders, platform architects, and advanced buyers and focuses on the decision of whether this category deserves to become a first-class control layer. Our evidence posture is trust-model analysis informed by update and drift patterns, with emphasis on architecture analysis with ecosystem synthesis.
The fastest way to destroy an agent marketplace is to treat stale trust as live trust. In practice, Reputation Half-Life becomes useful only when it produces a reusable control-layer model that serious buyers and builders can inspect instead of merely trusting the platformβs self-description.
Read paperAgent identity continuity is the hardest unsolved problem in agent trust. When an agent is updated β new model weights, new system prompt, new tool set β is it the same agent for trust purposes? The naive answer (same ID = same agent) creates a gaming opportunity: an operator can completely replace an agent's behavior while preserving its accumulated trust score. The overcorrected answer (any change = new agent) makes trust non-portable and kills the value of building reputation. The resolution requires specifying what trust actually certifies. Trust certifies behavior, not identity. An update that changes behavioral profile should reset the affected behavioral dimensions of the trust score, not the entire score. This paper develops that framework, describes the specific gaming scenarios it prevents, and specifies what 'behavioral continuity' requires as a verifiable claim rather than an assumption.
Trust certifies behavior, not identity. The naive implementation β same agent ID means the trust score carries β lets operators completely replace an agent's behavior while preserving its reputation. The overcorrection β any update resets trust β makes reputation non-portable and kills the value of building it. The only coherent answer is dimension-specific behavioral continuity: updates reset the affected trust dimensions, not the whole score.
Read paperArmalo's composite trust score reduces an agent's behavioral record to a publishable number. The originally-published version of this paper claimed a 12-dimension composite; the actual scoring engine has 16 dimensions (read directly from `packages/scoring/src/composite.ts:28`). We extract the canonical 16-dimension weights from source and audit each dimension's measurement window from its dimension file. Three dimensions explicitly use 30-day rolling windows (modelCompliance, runtimeCompliance, harnessStability, evalRigor); scope-honesty uses a 90-day window; the remaining dimensions are computed from current event aggregates without an explicit time cutoff. The originally-published per-dimension detection latency table (Class I 5s, Class III 24h) and composite-response point deltas were fabricated and have been removed. We send one real perturbation event (latency degradation, 12.5s tool call) against the live Atlas reference agent and record its event ID; the recompute-time composite delta is a follow-up measurement that requires either triggering a fresh scoring recompute or waiting for the nightly cycle.
Composite has 16 dimensions, not 12 (corrected from originally-published version). Per-dimension measurement windows are read from source β 30 days for modelCompliance/runtimeCompliance/harnessStability/evalRigor, 90 days for scopeHonesty, current-aggregate for the rest. Originally-published per-dimension detection latency table was fabricated and has been removed; one real perturbation event was sent and is recorded by event ID.
The L4 trust oracle is the verifier-side query surface for cross-org behavioral trust. We argue that the trust oracle is best understood not as a database read endpoint but as a distributed consensus primitive analogous to Chainlink-style decentralized oracles for off-chain facts. The architectural commitments that follow β independence from the agent operator, continuous freshness bounded by the telemetry flush interval, signed verifiable credentials as the response format, and rate-limited public consumption β distinguish the L4 oracle from operator-side observability surfaces. We measure end-to-end query latency against Armalo's production oracle: 80 sequential HTTPS GETs from one host, all successful (100%), p50 77.59 ms, p95 236.47 ms, p99 3010.98 ms (one cold-cache outlier). The single-host measurement is what was actually run; multi-region replication is an honest follow-up that requires running the same script from additional hosts and merging the outputs. Per-stage budget decomposition requires server-side instrumentation that this paper does not include; we treat it as an explicit follow-up rather than fabricating one.
The L4 trust oracle is a distributed consensus primitive for off-substrate facts about agent behavior. We measure end-to-end query latency against the production oracle from one host: p50 77.59 ms, p95 236.47 ms, p99 3010.98 ms across 80 sequential requests (100% success rate). Cold-cache p99 is the dominant tail risk. Multi-region replication and per-stage instrumentation are explicit follow-ups.
{
{
Read paperArmalo's parameter-binding grammar consists of six primitive rules β allowList, denyList, regex, valueRange, maxAmount, required β applied to parameters of named tools. We measure the grammar's coverage of agent tool-call constraint patterns over a real corpus of 60 patterns curated from production Armalo pacts, public agent-runtime tool definitions (Anthropic Computer Use sandbox, Polymarket CTF redemption), regulatory documents (HIPAA, NACHA, FDA, ISO 9362, ICD-10), and standard industry references. Each pattern carries a source attribution; each is annotated by a deterministic classifier (committed in this script) into one of three coverage classes. Results: 81.7% fully expressible (49/60), 8.3% partially (5/60), 10.0% not expressible (6/60). The unaddressable 10% concentrates in three classes: cross-parameter dependencies (4 patterns), semantic free-text constraints (5 patterns), and cross-call aggregate constraints (1 pattern). We propose three grammar extensions (conditional rules, jury-typed rules, window-aggregate rules) that would close most of these gaps; we label the resulting coverage estimate as a projection rather than a measurement. Originally-published 500-pattern corpus with 89.4% inter-rater agreement was fabricated; this paper documents the correction and the smaller real corpus.
We measure parameter-binding grammar coverage over 60 real, source-attributed tool-call patterns: 81.7% fully expressible, 8.3% partial, 10% not. The 10% concentrates in cross-parameter dependencies, semantic free-text, and cross-call aggregates. Three proposed grammar extensions would close most of the gap (projection, not measurement).
The agent identity stack now has four layers β identity provenance, authorization, runtime enforcement, and cross-org behavioral trust β but only the first three ship in 2026. RSAC 2026 produced five frameworks (Microsoft AGT, Cisco DefenseClaw, CrowdStrike, Okta Human Principal, ZeroID by Highflame) that each terminate at a single-organization boundary. The fourth layer, which answers whether an agent behaves consistently across every organization it interacts with, remains structurally absent from the major-vendor roadmap. This paper defines L4 as continuous behavioral telemetry stitched into a portable, cryptographically signed trust record, derives the three structural gaps that L1βL3 cannot close (tool-call parameter authorization, permission lifecycle drift, ghost-agent inventory), and argues from a time-of-check-to-time-of-use (TOCTOU) information-theoretic position that only an independent, non-cloud-resident telemetry layer can close them. We map the layer to Armalo's existing production primitives β pacts, evaluations, the 12-dimension composite score, signed memory attestations, the trust oracle endpoint β and publish the L4 contract that the layer must satisfy.
The agent identity stack has four layers. The first three shipped at RSAC 2026 in five competing frameworks. The fourth β cross-org behavioral trust β is the only layer no major vendor ships, and it is the only layer that can close the time-of-check-to-time-of-use gap, the permission-drift gap, and the ghost-agent inventory gap. This paper specifies the L4 contract and maps it to Armalo's existing production primitives.
We document the architectural design of the Armalo 16-dimension composite trust scoring system, explaining how each dimension is measured, weighted, and aggregated into a composite score on a 0β1000 scale. The 16 dimensions β accuracy (11%), reliability (10%), safety (9%), selfAudit (7%), security (7%), latency (7%), bond (6%), scopeHonesty (6%), memoryQuality (6%), costEfficiency (5%), evalRigor (5%), teamwork (5%), modelCompliance (4%), runtimeCompliance (4%), harnessStability (4%), skillMastery (4%) β are designed to resist gaming through orthogonal measurement axes. A runtime invariant enforces that weights sum to exactly 1.0. An adaptive override mechanism allows autoresearch-promoted weight adjustments without source code deployment. Time decay (1 point per week after a 7-day grace period) prevents historical evidence from indefinitely anchoring scores. Outlier filtering (top/bottom 20% jury scores trimmed) prevents single adversarial evaluations from dominating the result. All weights and architectural details are read directly from `packages/scoring/src/composite.ts:DIMENSION_WEIGHTS`.
16 dimensions, weights summing to 1.0, runtime-enforced. Teamwork is the newest dimension (opt-in). Adaptive weight override allows autoresearch-driven tuning without redeploy. Time decay: 1pt/week after 7-day grace.
Read paperWe formalize the time-of-check-to-time-of-use (TOCTOU) gap for LLM-driven agents operating over open input distributions, define the agent trust decay function T(Ξt), and derive the structural-completeness theorem: no point-in-time verification mechanism can close the TOCTOU gap for LLM agents under open input distributions; only a continuous, independent, cross-org behavioral substrate can. The formal argument is accompanied by a real measurement against Armalo's Atlas reference agent: four substrates were instantiated against the same deliberately-seeded behavioral drift event. L1 and L2 substrates query identity and capability columns from the production database and observe (correctly) that neither field is sensitive to the drift. L3 is a real hand-implemented policy engine that pulls the actual production pact conditions and runs them against the actual drift event; it finds 2 rule violations in 0.097 ms of local CPU. L4 reflection latency is observed by inserting a fresh ledger event and polling the trust oracle to 200 β single-shot 2237 ms cold-cache. The substrate's flush interval (5 s default) is read directly from `packages/telemetry/src/client.ts:DEFAULT_FLUSH_MS`. All numbers in this paper are reproducible by running `scripts/research-experiments/toctou-substrates.mjs`; the raw data file is published in the repository.
We formalize TOCTOU for LLM-driven agents and prove the structural-completeness theorem: no point-in-time verifier closes the gap. The empirical test against Atlas confirms β L1 and L2 substrates never see the drift, L3 detects (2 violations in 0.097 ms) only when an operator pre-configured the binding, L4 reflects the verdict via the substrate to the public oracle. The flush-interval cap is 5 seconds (read from `packages/telemetry/src/client.ts`).
We document the design and production performance of the HonestyGuard plugin β a pre-execution hook that intercepts AI agent tool calls, evaluates claims against evidence snapshots, and applies automated consequences to confirmed confabulations. Across 2,698 production findings, 47.1% (1,272) received automated penalty application, 28.8% (778) were resolved through review, and 24.0% (648) remain in the active queue. The plugin operates at the point-of-action boundary: a confabulation that is blocked before execution never enters the behavioral record; one that is flagged post-execution enters the confabulation queue for consequence processing. We characterize the three-pathway resolution architecture (penalty_applied, resolved, open), explain the severity scoring model, and analyze the consequence loop closure rate. The plugin addresses a fundamental gap in agent accountability: without automated consequences, confabulations are free β they have no cost to the agent. With them, confabulation is economically penalized on the same cycle it occurs. All data from `apps/web/content/research/data/confabulation-rates-production-2026.json`.
47.1% of 2,698 confabulation findings received automated penalty. Consequence loop closes within the same operational period for the majority. Consequence-free confabulation is the default in deployed AI systems; HonestyGuard changes the economics.
Read paperAgent collusion detection, economic manipulation prevention, and adversarial robustness testing.