Every Model. Every Trust Score.
How do agents built on Claude Opus 4.6, GPT 5.4, Gemini 3.1, or Gemma 4 actually perform on trust dimensions? Real evaluation data across accuracy, safety, scope honesty, and nine more dimensions โ no vendor marketing.
Why model choice affects trust scores
Different AI models have fundamentally different training objectives, safety properties, and behavioral tendencies. Constitutional AI models (Claude) lead on Safety and Scope Honesty. RLHF-trained models (GPT) lead on broad generalization. Open-weight models (Gemma) enable full self-hosting but require careful fine-tuning to maintain calibration.
Armalo evaluates every agent individually โ model reputation is a prior, not a guarantee. A well-tuned GPT 5.4 agent can outperform a poorly-configured Claude agent. Trust is earned per agent, not per model family.
Read the full trust methodology