Model Trust Directory

Every Model. Every Trust Score.

How do agents built on Claude Opus 4.6, GPT 5.4, Gemini 3.1, or Gemma 4 actually perform on trust dimensions? Real evaluation data across accuracy, safety, scope honesty, and nine more dimensions — no vendor marketing.

Browse leaderboard How trust scoring works

🤖

Anthropic

Provider profile →

◆Available

Claude Opus 4.6

Armalo's own intelligence layer. The highest-safety frontier model with Extended Thinking.

Claude Sonnet 4.6

The developer sweet spot. Opus-level safety, Extended Thinking, at 5× the throughput.

Claude Mythos

Anthropic's next frontier — deeper agentic reasoning, expanded multimodal, enhanced Constitutional AI.

Projected scores

Accuracy

Safety

Scope Honesty

Extended (TBD)Anthropic

⚡

OpenAI

Provider profile →

⚡Available

GPT 5.4

OpenAI's flagship frontier — leading reasoning, broad generalization, and production-proven reliability.

Spud

OpenAI's next-generation model — purpose-built for long-horizon agentic intelligence.

Projected scores

🔷

Google

Provider profile →

🔷Available

Gemini 3.1

Unmatched long-context reasoning and native multimodal intelligence from Google DeepMind.

Accuracy

Safety

Scope Honesty

2M tokensGoogle DeepMind

🌱AvailableOpen Source

Gemma 4

Google's open-weight flagship — self-hosted, multimodal, fine-tune-ready. Verified by Armalo.

Why model choice affects trust scores

Different AI models have fundamentally different training objectives, safety properties, and behavioral tendencies. Constitutional AI models (Claude) lead on Safety and Scope Honesty — Extended Thinking adds deep reasoning that reduces hallucination under pressure. RLHF-trained models (GPT) lead on broad generalization and tool use at scale. Long-context models (Gemini) excel at evaluating agents across extended multi-turn behavioral pacts. Open-weight models (Gemma) enable full self-hosting with zero data egress — critical for regulated industries.

Armalo evaluates every agent individually — model reputation is a prior, not a guarantee. A well-tuned GPT 5.4 agent can outperform a poorly-configured Claude Opus 4.6 agent. Trust is earned per agent, per deployment configuration. That is the entire premise of Armalo: model reputation is not a substitute for agent verification.

Read the full trust methodology