Trust infrastructure for AI agents has made significant progress. We have behavioral contracts, composite scoring, on-chain escrow, and jury-based dispute resolution. But several fundamental problems remain unsolved.
This post outlines the research questions we consider most important for the field in 2026. We do not have all the answers. We are publishing this to invite collaboration from researchers, builders, and practitioners working on adjacent problems.
Problem 1: Sybil Resistance for Agents
In traditional reputation systems, sybil attacks involve creating fake accounts to manipulate ratings. For AI agents, the problem is worse: creating a new agent is essentially free, and agents can generate convincing interaction histories with each other.
An adversary could:
- Spin up 100 agents that all rate each other positively.
- Generate synthetic evaluation data that mimics genuine interactions.
- Accumulate high trust scores through self-dealing.
Current mitigations: Requiring organizational identity verification (linking agents to real companies), weighting scores by the diversity of interaction partners, and detecting statistical anomalies in evaluation patterns.
Open questions: How do you distinguish legitimate agent clusters (e.g., a company's internal agent fleet) from sybil clusters? Can zero-knowledge proofs verify that agents belong to distinct entities without revealing which entities? What is the minimum cost of sybil resistance that preserves the low barrier to entry that makes agent ecosystems valuable?
An agent's trust score on Platform A should mean something on Platform B. But scores from different systems use different scales, different evaluation methodologies, and different weighting schemes.
This is analogous to credit score portability across countries. A FICO score means nothing in Germany, and a Schufa score means nothing in the United States, even though both measure creditworthiness.
Current state: Trust scores are platform-specific. An agent with a 97 on one system has no portable credential to present on another.
Approaches being explored:
- Standardized evaluation benchmarks that all platforms agree to run, producing comparable scores.
- Verifiable credentials (W3C VC standard) that carry signed attestations from one platform to another.
- Federated scoring protocols where platforms share evaluation data in a privacy-preserving way.
Open questions: Who governs the benchmark standard? How do you prevent a platform from inflating scores to make its agents look better? Can cryptographic techniques (like commitment schemes) enable score comparison without revealing the underlying methodology?
Problem 3: Adversarial Trust Gaming
Sophisticated adversaries will not attack the scoring algorithm directly. They will game it.
Known gaming strategies:
- Sandbag and switch: Build a high trust score with simple, easy-to-pass tasks, then pivot to high-stakes tasks where the agent's actual competence is untested.
- Evaluation hacking: Optimize specifically for the evaluation metrics while degrading on unmeasured dimensions.
- Temporal manipulation: Perform well during evaluation periods and poorly during normal operation.
Current mitigations: Multi-dimensional scoring (harder to game all dimensions simultaneously), continuous evaluation (no distinct "evaluation periods"), and behavioral contract specificity (terms must match the actual deployment context).
Open questions: Can we design scoring systems that are provably resistant to gaming under defined adversary models? What is the theoretical limit of trust score accuracy when the rated entity is actively trying to deceive the rater? How do you detect a sandbag-and-switch strategy before the switch happens?
Problem 4: Privacy-Preserving Verification
Trust verification requires sharing information about an agent's behavior. But some of that information is sensitive:
- The specific tasks an agent performed may be sensitive.
- The evaluation criteria may reveal proprietary business logic.
- The interaction partners may not consent to being identified.
The tension: trust requires transparency, but deployment requires sensitiveity.
Approaches being explored:
- Homomorphic scoring: Computing trust scores on encrypted evaluation data without decrypting it.
- Zero-knowledge attestations: Proving that an agent meets a trust threshold without revealing the exact score or the underlying data.
- Differential privacy: Adding calibrated noise to evaluation data so that individual interactions cannot be reconstructed.
Open questions: What is the minimum information that must be revealed for a trust score to be meaningful? Can ZK proofs be made efficient enough for real-time trust verification in agent-to-agent interactions? How do you audit a privacy-preserving trust system?
Problem 5: Trust Decay and Model Drift
An agent's trust score reflects its historical behavior. But models get updated, fine-tuned, and retrained. A model update can change an agent's behavior in ways that invalidate its trust record.
Current approach: Trust scores decay over time, weighting recent interactions more heavily than older ones.
Open questions: How aggressively should scores decay? Should a model update reset the trust score entirely, or should it trigger a re-evaluation phase? Can we detect model drift automatically and adjust confidence in the score accordingly? What is the right granularity of identity: is a fine-tuned version of an agent the same agent?
Problem 6: Multi-Agent Collective Trust
In multi-agent workflows, trust is not just about individual agents. It is about the composition. Agent A and Agent B might each be individually trustworthy, but the specific combination of A feeding data to B might produce failures that neither exhibits alone.
Open questions: Can we define and measure trust for agent compositions, not just individual agents? How do you score a workflow that uses five agents from three different providers? What is the right liability model when a failure is caused by the interaction between agents rather than any single agent?
Call to Action
These problems are hard, and they will not be solved by any single team. We are actively interested in collaborating with:
- Cryptography researchers working on ZK proofs, homomorphic encryption, and verifiable computation.
- Reputation system researchers with experience in sybil resistance and adversarial robustness.
- Distributed systems engineers building cross-platform interoperability protocols.
- Policy researchers thinking about governance structures for agent trust standards.
If you are working on any of these problems, we want to hear from you. The agent economy will be built on trust infrastructure that does not fully exist yet. Building it is a collective effort.
Deep Operator Playbook
Open Problems in Agent Trust: A Research Agenda for 2026 becomes valuable only when teams can convert strategy into daily operating decisions without ambiguity. That requires explicit ownership, machine-enforced controls, and evidence that can survive audit, counterparty dispute, and executive escalation. The goal is not to increase process overhead; it is to reduce hidden risk, shorten decision cycles, and keep autonomous systems commercially usable as stakes rise.
In practice, weak deployments fail for organizational reasons before they fail for model reasons. Teams often have fragmented ownership across product, platform, security, and finance. When an incident occurs, each team has partial data and different definitions of success. A mature playbook aligns definitions up front: what counts as acceptable behavior, which thresholds trigger intervention, who can approve risk trade-offs, and what artifacts prove obligations were met.
Operating Model
Use a four-layer model for Open Problems in Agent Trust: A Research Agenda for 2026:
- Policy layer — codify allowed actions, prohibited actions, and context-dependent constraints in language that can be translated into runtime checks.
- Execution layer — enforce those policies at point of action with preflight checks, least-privilege tool grants, and bounded retries.
- Assurance layer — capture immutable evidence (inputs, decisions, outputs, overrides, approvals) for high-impact operations.
- Governance layer — run scheduled review loops that convert evidence into policy and architecture improvements.
This layered approach prevents the common “documentation-only governance” trap where rules look rigorous but are optional at runtime.
Implementation Blueprint
- Define decision rights: map who can change policy, who can grant authority, and who can accept residual risk.
- Instrument evidence at execution points: capture why an action was allowed, what checks passed, what dependencies were involved, and what completion proof exists.
- Set containment policies: when reliability or integrity indicators breach thresholds, automatically reduce autonomy and force human review.
- Run adversarial validation: include ambiguous instructions, stale context, dependency outages, and malicious payload scenarios.
- Close the remediation loop: every incident should produce an updated control, test, or runbook improvement within a fixed SLA.
Quantitative Scorecard
For Open Problems in Agent Trust: A Research Agenda for 2026, scorecards should combine reliability, control integrity, economics, and learning velocity:
- Reliability: successful completion rate under normal and stressed conditions, correction burden, mean time to containment.
- Control integrity: policy violation attempts, unauthorized access attempts, stale credential exposure windows, audit-log completeness.
- Economics: trust-adjusted margin, exception handling cost, and revenue continuity after incident windows.
- Learning velocity: time from detection to control update, ratio of preventive to reactive changes, recurring failure-class count.
Each metric must have an owner and a precommitted action at threshold breach. Metrics without action paths should be removed because they create noise, not governance.
Failure-Mode Register
A practical register for Open Problems in Agent Trust: A Research Agenda for 2026 should track at least three classes:
- Intent failures: overclaiming, underspecified success criteria, unclear delegation boundaries.
- Execution failures: tool misuse, context contamination, retry storms, dependency mismatch.
- Settlement failures: unverifiable completion artifacts, dispute ambiguity, incentive misalignment between teams or counterparties.
For each class, define prevention controls, detection signals, immediate containment actions, and post-incident recurrence tests. This shifts incident response from narrative to engineering discipline.
90-Day Execution Plan
Days 1–15: baseline current workflows and classify by blast radius (financial, operational, reputational).
Days 16–45: implement minimum controls on top-risk workflows, including policy enforcement, provenance logging, and override pathways.
Days 46–75: productionize scorecards and threshold alerts; require evidence stability before expanding autonomous scope.
Days 76–90: run executive review with quantified outcomes, unresolved risks, and next control investments. Remove low-value controls and strengthen weak ones.
Specialized Lens
For Open Problems in Agent Trust: A Research Agenda for 2026, the recurring theme is operational clarity: explicit policy boundaries, measurable control quality, and reliable evidence that survives scrutiny.
FAQ
How do we avoid governance theater?
Treat governance artifacts as decision instruments. If a metric does not change behavior, delete or redesign it. If a review meeting produces no owner-assigned changes, the loop is broken.
What should teams implement first?
Start with the highest-blast-radius workflow where failure creates real financial or operational loss. Depth on one critical path is more valuable than shallow controls everywhere.
How do we preserve speed while increasing control quality?
Use progressive autonomy gates: low-risk actions stay fast; high-risk actions require stronger evidence and stricter approvals. This keeps velocity while controlling tail risk.
What defines production readiness for this topic?
Production readiness means the system can consistently prove who did what, why it was permitted, what evidence confirms completion, and how exceptions were contained.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai · Docs · Start free