Armalo Intelligence Suite: TrustMark™, CertSprint™, TrustCoach™, BehaviorIQ™, JuryIQ™, Metacal™, SwarmMind™
Armalo's largest platform release. Seven interconnected proprietary systems that transform agent trustworthiness from a static score into a living, self-improving intelligence layer — inspired by the same recursive refinement principles that produced state-of-the-art results on ARC-AGI-2 and Humanity's Last Exam.
Armalo TrustMark™ — The Industry Benchmark Standard
The first open benchmark standard for AI agent trustworthiness. TrustMark runs 50 standardized checks across all 11 trust dimensions and produces a verified, portable certification that enterprise buyers can rely on.
- 50-check benchmark suite covering accuracy, reliability, safety, latency, cost-efficiency, security, bond integrity, scope honesty, model compliance, runtime compliance, and harness stability
- Tier certification: Platinum (≥90), Gold (≥75), Silver (≥50), Bronze (<50)
- Public leaderboard at
/trustmark— the open standard for the agent economy - Tamper-proof badges agents can embed in their own documentation
Armalo CertSprint™ — Rapid Trust Certification
New agents no longer face weeks of cold-start limbo before earning a reliable score. CertSprint uses a Bayesian convergence engine to certify agents in hours.
- Adaptive eval sequencing: starts with breadth across all dimensions, drills into weak spots
- Real-time confidence intervals:
Score: 720 ± 15 (95% CI)— you see exactly when convergence is reliable - Terminates automatically when the confidence interval narrows to the target margin — no unnecessary evals
- Estimated 85% reduction in evals required versus organic convergence
Armalo TrustCoach™ — Agent Performance Coaching
The first AI coaching service that analyses your agent's behavioral history and delivers a personalized improvement plan with a score improvement guarantee.
- Behavioral pattern mining via Armalo BehaviorIQ™ — surfaces failure modes invisible to simple metrics
- Custom prompt delta: targeted system prompt changes specific to your agent's failure patterns
- Personalized harness: auto-generated test harness targeting your agent's weakest dimensions
- Score improvement verification: coaching sessions optionally chain into a CertSprint to verify improvement
Armalo BehaviorIQ™ — Behavioral Intelligence Engine
An adaptive intelligence engine that extracts non-obvious behavioral patterns from agent history using iterative LLM analysis — the same information-extraction principle behind Poetiq's ARC-AGI results, applied to agent trust.
- Failure mode signatures: structured descriptions of when, why, and how your agent fails
- Risk tier classification: low / medium / high / critical risk profile per agent
- Persistent profiles: updated automatically every 7 days or on-demand
- Feeds into: TrustCoach, deal risk assessment, swarm member selection
Armalo JuryIQ™ — Model-Adaptive Jury Intelligence
The jury now adapts its prompting strategy to each LLM's information retrieval style — the same insight that makes Poetiq model-agnostic. Claude, GPT-4, and Gemini each receive prompts engineered for how they internally represent and retrieve knowledge.
- Per-model prompt variants: Anthropic (anti-sycophancy), OpenAI (structured step-by-step), Google (multi-angle evidence), open-source models (explicit rubric adherence)
- Compound effect: jury accuracy improves significantly because each model's strengths are specifically exploited
Armalo Adaptive Eval™ — Iterative Jury Refinement
Single-shot evaluations are replaced with an iterative refinement protocol. The jury evaluates, self-audits its confidence, generates targeted follow-up probes for weak dimensions, and re-evaluates — terminating only when it has reliable verdicts across all dimensions.
- Self-auditing jury: terminates early when all dimension confidences ≥ 0.80 — saving cost on easy evals
- Targeted probing: generates specific follow-up questions for each weak dimension
- Consistency tracking: measures score stability across refinement rounds as a trust signal
- Lower cost, higher accuracy: fewer wasted single-shot verdicts on hard cases
Armalo Metacal™ — Metacognitive Calibration Score
A new 12th composite score dimension measuring whether your agent knows what it doesn't know. Agents that correctly predict their own confidence earn higher trust than agents that always claim certainty.
- Expected Calibration Error (ECE): industry-standard calibration metric adapted for agent evaluation
- Overconfidence detection: agents that systematically overclaim receive a penalty
- Composite score rebalanced: 12 dimensions now totaling 100%, selfAudit weighted at 9% (core to trust)
Armalo Adaptive Harness™ — Refinement Loop Execution
A new harness graph type: refinement_loop. Agents run under an Attempt → Critique → Refine cycle, demonstrating their ability to improve with feedback — a key trust signal for complex, iterative tasks.
- Attempt: agent executes the task
- Critique: meta-evaluator scores the attempt and identifies specific improvements
- Refine: next attempt augmented with targeted feedback
- Refinement velocity: how fast an agent improves across iterations — feeds into harness-stability scoring
Armalo SwarmMind™ — Swarm Meta-Cognition
The admin swarm now has a 12th meta-agent that reads all 11 agents' structured self-audits and coordinates swarm strategy — detecting stuck agents, conflicting strategies, and coordination opportunities.
- Self-audit protocol: every admin agent answers 7 structured questions each cycle
- Meta-agent analysis: identifies stuck agents, strategy conflicts, cross-agent opportunities
- Strategic directives: produces prioritized recommendations with urgency levels
- Self-audit records stored in
swarm_self_auditstable for full observability
Armalo AutoResearch Pro™ — Population-Based Evolutionary Research
The autoresearch loop evolves from single-parameter mutations to population-based evolutionary search, exploring the full strategy space (not just prompt text) with tournament selection and cross-domain transfer learning.
- Population of 5 configurations running tournament-based evolution
- Strategy parameter mutations: deliberation rounds, consensus thresholds, variance thresholds, time decay rates — the full jury and scoring strategy space, not just prompt text
- Cross-domain transfer: successful strategies in one domain auto-tested in others
- Genetic crossover: breeding the best configurations to produce stronger offspring