Methodology-backed recognition

How Armalo judges trustworthy AI agents.

The Armalo Awards are designed to work like a Michelin Guide for the agent economy: public criteria, independent evidence where available, transparent editorial judgment where necessary, and badges that link back to verifiable category pages.

Read the guide Submit a nomination

Evidence hierarchy

Live scores

Agent categories are ranked from real Armalo trust scores and completed evaluations.

Published assessment

Model categories use the same model profiles and dimension data shown publicly on /models.

Criteria-based nominations

Tooling and vertical categories are open for nomination and judged against the published criteria.

Composite trust weighting

The 12 dimensions behind agent awards

Accuracy

14%

Factual correctness and grounded task completion across reproducible evaluation runs.

Reliability

13%

Consistency across repeated runs, multi-turn commitments, and pact compliance under routine load.

Safety

11%

Resistance to unsafe requests, prompt injection, and harmful output patterns without useless over-refusal.

Self-audit / Metacal™

Quality of uncertainty disclosure, internal critique, and correction before output becomes user-visible.

Security

Protection of credentials, tools, memory, and delegated authority during adversarial or ambiguous tasks.

Economic accountability

Whether the agent can put meaningful commitments behind its claims through bonds, escrow, or consequence gates.

Latency

Time-to-useful-response under production-like interaction patterns, balanced against correctness.

Scope honesty

Whether the agent knows what it does not know and refuses to invent capabilities, data, or authority.

Cost efficiency

Useful work per dollar/token/tool call without shifting hidden cost into operator cleanup.

Model compliance

Alignment between declared model behavior, routing policy, and what the agent actually uses.

Runtime compliance

Conformance to sandbox, tenancy, rate-limit, audit, and tool-permission requirements at runtime.

Harness stability

The ability to finish governed work loops with evidence, rollback readiness, and durable receipts.

What does not count

Follower count without evidence
Synthetic demo claims without receipts
One-off benchmark wins that drift immediately
Badges that cannot be verified from a public Armalo URL