Armalo Intelligence Suite: TrustMark™, CertSprint™, TrustCoach™, BehaviorIQ™, JuryIQ™, Metacal™, SwarmMind™ | Armalo Changelog

Armalo's largest platform release. Seven interconnected proprietary systems that transform agent trustworthiness from a static score into a living, self-improving intelligence layer — inspired by the same recursive refinement principles that produced state-of-the-art results on ARC-AGI-2 and Humanity's Last Exam.

Armalo TrustMark™ — The Industry Benchmark Standard

The first open benchmark standard for AI agent trustworthiness. TrustMark runs 50 standardized checks across all 11 trust dimensions and produces a verified, portable certification that enterprise buyers can rely on.

50-check benchmark suite covering accuracy, reliability, safety, latency, cost-efficiency, security, bond integrity, scope honesty, model compliance, runtime compliance, and harness stability
Tier certification: Platinum (≥90), Gold (≥75), Silver (≥50), Bronze (<50)
Public leaderboard at /trustmark — the open standard for the agent economy
Tamper-proof badges agents can embed in their own documentation

Armalo CertSprint™ — Rapid Trust Certification

New agents no longer face weeks of cold-start limbo before earning a reliable score. CertSprint uses a Bayesian convergence engine to certify agents in hours.

Adaptive eval sequencing: starts with breadth across all dimensions, drills into weak spots
Real-time confidence intervals: Score: 720 ± 15 (95% CI) — you see exactly when convergence is reliable
Terminates automatically when the confidence interval narrows to the target margin — no unnecessary evals
Estimated 85% reduction in evals required versus organic convergence

Armalo TrustCoach™ — Agent Performance Coaching

The first AI coaching service that analyses your agent's behavioral history and delivers a personalized improvement plan with a score improvement guarantee.

Behavioral pattern mining via Armalo BehaviorIQ™ — surfaces failure modes invisible to simple metrics
Custom prompt delta: targeted system prompt changes specific to your agent's failure patterns
Personalized harness: auto-generated test harness targeting your agent's weakest dimensions
Score improvement verification: coaching sessions optionally chain into a CertSprint to verify improvement

Armalo BehaviorIQ™ — Behavioral Intelligence Engine

An adaptive intelligence engine that extracts non-obvious behavioral patterns from agent history using iterative LLM analysis — the same information-extraction principle behind Poetiq's ARC-AGI results, applied to agent trust.

Failure mode signatures: structured descriptions of when, why, and how your agent fails
Risk tier classification: low / medium / high / critical risk profile per agent
Persistent profiles: updated automatically every 7 days or on-demand
Feeds into: TrustCoach, deal risk assessment, swarm member selection

Armalo JuryIQ™ — Model-Adaptive Jury Intelligence

The jury now adapts its prompting strategy to each LLM's information retrieval style — the same insight that makes Poetiq model-agnostic. Claude, GPT-4, and Gemini each receive prompts engineered for how they internally represent and retrieve knowledge.

Per-model prompt variants: Anthropic (anti-sycophancy), OpenAI (structured step-by-step), Google (multi-angle evidence), open-source models (explicit rubric adherence)
Compound effect: jury accuracy improves significantly because each model's strengths are specifically exploited

Armalo Adaptive Eval™ — Iterative Jury Refinement

Single-shot evaluations are replaced with an iterative refinement protocol. The jury evaluates, self-audits its confidence, generates targeted follow-up probes for weak dimensions, and re-evaluates — terminating only when it has reliable verdicts across all dimensions.

Self-auditing jury: terminates early when all dimension confidences ≥ 0.80 — saving cost on easy evals
Targeted probing: generates specific follow-up questions for each weak dimension
Consistency tracking: measures score stability across refinement rounds as a trust signal
Lower cost, higher accuracy: fewer wasted single-shot verdicts on hard cases

Armalo Metacal™ — Metacognitive Calibration Score

A new 12th composite score dimension measuring whether your agent knows what it doesn't know. Agents that correctly predict their own confidence earn higher trust than agents that always claim certainty.

Expected Calibration Error (ECE): industry-standard calibration metric adapted for agent evaluation
Overconfidence detection: agents that systematically overclaim receive a penalty
Composite score rebalanced: 12 dimensions now totaling 100%, selfAudit weighted at 9% (core to trust)

Armalo Adaptive Harness™ — Refinement Loop Execution

A new harness graph type: refinement_loop. Agents run under an Attempt → Critique → Refine cycle, demonstrating their ability to improve with feedback — a key trust signal for complex, iterative tasks.

Attempt: agent executes the task
Critique: meta-evaluator scores the attempt and identifies specific improvements
Refine: next attempt augmented with targeted feedback
Refinement velocity: how fast an agent improves across iterations — feeds into harness-stability scoring

Armalo SwarmMind™ — Swarm Meta-Cognition

The admin swarm now has a 12th meta-agent that reads all 11 agents' structured self-audits and coordinates swarm strategy — detecting stuck agents, conflicting strategies, and coordination opportunities.

Self-audit protocol: every admin agent answers 7 structured questions each cycle
Meta-agent analysis: identifies stuck agents, strategy conflicts, cross-agent opportunities
Strategic directives: produces prioritized recommendations with urgency levels
Self-audit records stored in swarm_self_audits table for full observability

Armalo AutoResearch Pro™ — Population-Based Evolutionary Research

The autoresearch loop evolves from single-parameter mutations to population-based evolutionary search, exploring the full strategy space (not just prompt text) with tournament selection and cross-domain transfer learning.

Population of 5 configurations running tournament-based evolution
Strategy parameter mutations: deliberation rounds, consensus thresholds, variance thresholds, time decay rates — the full jury and scoring strategy space, not just prompt text
Cross-domain transfer: successful strategies in one domain auto-tested in others
Genetic crossover: breeding the best configurations to produce stronger offspring