Formal Verification of AI Agent Trust Policies: From Logic to Production Enforcement
How TLA+, Alloy, Coq, and model checking techniques can verify AI agent trust policies before deployment—and where formal methods reach their limits with LLM-based systems.
Formal Verification of AI Agent Trust Policies: From Logic to Production Enforcement
The software industry learned an expensive lesson from concurrent systems in the 1990s: testing cannot exhaustively verify correctness for systems with large state spaces. Deadlocks, race conditions, and subtle protocol errors consistently evaded unit tests, integration tests, and even careful code review—only to surface as production incidents that cost Boeing, Ariane, and Therac-25 teams dearly. The solution was formal verification: mathematically proving that a system satisfies a specification across all possible states and transitions, not just the ones your test suite happened to exercise.
AI agent trust policies face an analogous challenge. A trust policy specifies when an agent may act, what resources it may access, how trust scores gate capabilities, how violations trigger penalties, and how multiple policies compose without contradiction. These policies operate over complex state spaces—agent trust scores, capability tiers, pact terms, evaluation histories, temporal constraints—with subtle composition rules that are easy to get wrong. A policy that looks correct in isolation may interact with another policy in unexpected ways. A policy that correctly handles the common case may allow dangerous edge cases that an adversary specifically engineers.
Formal verification offers a rigorous path to trust policy correctness. TLA+ can verify that trust gating protocols satisfy liveness and safety properties across all reachable states. Alloy can check that policy compositions are consistent and contradiction-free. Coq can produce machine-checked proofs that policy implementations satisfy their formal specifications. These tools have been used to verify AWS IAM, the Raft consensus algorithm, TLS protocol implementations, and operating system kernels. They can be applied to AI agent trust policies too—but with important limitations that practitioners must understand.
This post covers the practical application of formal methods to AI agent trust policies: what you can verify, what you cannot, how verification integrates with operational systems like Armalo, and where the boundaries of formal verification lie when LLM-based systems introduce inherent non-determinism.
The Trust Policy Verification Problem
Why Testing Fails for Trust Policies
Consider a trust gating policy with the following specification: an agent may access enterprise APIs if its composite trust score exceeds 750 and its security dimension score exceeds 80 and it has completed at least 100 successful interactions and no active pact violations exist. If any condition is violated, the agent is restricted to read-only operations. If the security dimension score falls below 60, the agent is fully suspended pending re-evaluation.
This policy seems straightforward. But now consider these questions:
What happens at boundaries? When a score is exactly 750, is the agent permitted or restricted? When a score atomically transitions from 751 to 749 (because two evaluations complete simultaneously), does the agent complete in-flight requests or get cut off mid-transaction?
What happens under concurrent updates? If an agent's pact violation is being resolved while simultaneously its trust score is being updated by a completed evaluation, which state wins? Can the policy observe an intermediate state where the violation is cleared but the score hasn't yet reflected the corrective evaluation?
What happens with composed policies? If an enterprise deployment applies both Armalo's trust gating policy and its own internal compliance policy, do the two policies compose without contradiction? Can a situation arise where one policy permits an action that another policy simultaneously prohibits?
What about temporal properties? If an agent's trust score drops below the threshold, should the policy immediately suspend all in-flight requests? Or should it allow a grace period? If the score recovers within the grace period, does the suspension ever actually take effect? This is a liveness question: should the policy guarantee that properly-trusted agents always eventually get to act, or only that improperly-trusted agents always eventually get stopped?
Testing can verify specific scenarios—the common case, the boundary cases you thought to test, the failure modes you anticipated. But testing cannot exhaustively cover the cross product of all possible trust scores, evaluation timings, concurrent operations, and policy compositions. For safety-critical trust policies, you want more: a proof that the policy is correct across all reachable states.
Formal Specification: The Foundation
Formal verification begins with a formal specification—a mathematically precise description of what the policy should do. This is harder than it sounds, because natural language is inherently ambiguous and informal descriptions of policies are routinely inconsistent.
Consider the specification for a graduated capability unlocking system:
Natural language: "Agents at Trust Tier 3 may access sensitive APIs. Agents at Tier 4 may additionally manage other agents. Agents at Tier 5 may access financial APIs."
Questions the natural language leaves open: What exactly is "sensitive APIs"? Does Tier 4 inherit all Tier 3 permissions? If an agent's tier drops from 4 to 2, do they immediately lose Tier 3 permissions? What if they have in-flight requests using those permissions?
A formal specification must answer these questions precisely. Here is a partial TLA+ specification for the tier capability system:
---------------------------- MODULE AgentTrustTiers ----------------------------
EXTENDS Integers, Sequences, FiniteSets
CONSTANTS
Agents, (* Set of all agent identifiers *)
Capabilities, (* Set of all capability identifiers *)
MaxTier (* Maximum trust tier (e.g., 5) *)
VARIABLES
tier, (* tier[a] = current trust tier of agent a *)
granted, (* granted[a] = set of capabilities currently granted to agent a *)
in_flight (* in_flight[a] = set of capabilities being used in current requests *)
(* Capability inheritance: each tier inherits capabilities from lower tiers *)
TierCapabilities(t) ==
CASE t = 0 -> {}
[] t = 1 -> {"read_public", "submit_evals"}
[] t = 2 -> TierCapabilities(1) \cup {"access_basic_apis"}
[] t = 3 -> TierCapabilities(2) \cup {"access_sensitive_apis"}
[] t = 4 -> TierCapabilities(3) \cup {"manage_agents"}
[] t = 5 -> TierCapabilities(4) \cup {"access_financial_apis"}
[] OTHER -> {}
(* Type invariants *)
TypeInvariant ==
/\ tier \in [Agents -> 0..MaxTier]
/\ granted \in [Agents -> SUBSET Capabilities]
/\ in_flight \in [Agents -> SUBSET Capabilities]
(* Safety: granted capabilities must be consistent with tier *)
CapabilityConsistency ==
\A a \in Agents : granted[a] \subseteq TierCapabilities(tier[a])
(* Safety: agents cannot use capabilities they don't have *)
InFlightSafety ==
\A a \in Agents : in_flight[a] \subseteq granted[a]
(* Action: Promote agent to higher tier *)
Promote(a, new_tier) ==
/\ new_tier > tier[a]
/\ new_tier <= MaxTier
/\ tier' = [tier EXCEPT![a] = new_tier]
/\ granted' = [granted EXCEPT![a] = TierCapabilities(new_tier)]
/\ UNCHANGED in_flight
(* Action: Demote agent to lower tier - with in-flight constraint *)
Demote(a, new_tier) ==
/\ new_tier < tier[a]
/\ new_tier >= 0
/\ in_flight[a] \subseteq TierCapabilities(new_tier) (* Safe demotion: no in-flight violations *)
/\ tier' = [tier EXCEPT![a] = new_tier]
/\ granted' = [granted EXCEPT![a] = TierCapabilities(new_tier)]
/\ UNCHANGED in_flight
(* Alternative: Force demote with in-flight termination *)
ForceDemote(a, new_tier) ==
/\ new_tier < tier[a]
/\ new_tier >= 0
/\ tier' = [tier EXCEPT![a] = new_tier]
/\ granted' = [granted EXCEPT![a] = TierCapabilities(new_tier)]
/\ in_flight' = [in_flight EXCEPT![a] = in_flight[a] \cap TierCapabilities(new_tier)]
(* Properties to verify *)
Spec == TypeInvariant /\ CapabilityConsistency /\ InFlightSafety
=================================================================================
This specification makes the ambiguities explicit and forces resolution:
- Capabilities are inherited (TierCapabilities is cumulative)
- Safe demotion requires no in-flight capability violations
- Force demotion terminates in-flight requests that exceed new permissions
TLC (the TLA+ model checker) can exhaustively verify that these properties hold across all reachable states given the specified actions.
TLA+ for Trust Gating Protocol Verification
Specifying the Armalo Trust Gating Protocol
A complete trust gating protocol involves more than static tier assignment. It must handle: score updates arriving from evaluations, pact violation detection, grace periods, concurrent requests, re-evaluation workflows, and recovery paths. Here is a TLA+ specification for a trust gating protocol:
---------------------------- MODULE TrustGating ----------------------------
EXTENDS Integers, Sequences, TLC
CONSTANTS
Agents,
MAX_SCORE, (* e.g., 1000 *)
GATE_THRESHOLD, (* e.g., 750 *)
SUSPEND_THRESHOLD, (* e.g., 500 *)
GRACE_PERIOD_TICKS (* e.g., 5 ticks before forced suspension *)
VARIABLES
score, (* score[a] = current composite trust score *)
pact_violations,(* pact_violations[a] = number of active pact violations *)
status, (* status[a] in {"active", "restricted", "suspended", "evaluating"} *)
grace_ticks, (* grace_ticks[a] = ticks remaining in grace period *)
pending_eval (* pending_eval[a] = whether re-evaluation is queued *)
Init ==
/\ score \in [Agents -> GATE_THRESHOLD..MAX_SCORE]
/\ pact_violations = [a \in Agents |-> 0]
/\ status = [a \in Agents |-> "active"]
/\ grace_ticks = [a \in Agents |-> 0]
/\ pending_eval = [a \in Agents |-> FALSE]
(* Score update from completed evaluation *)
ScoreUpdate(a, new_score) ==
/\ score' = [score EXCEPT![a] = new_score]
/\ IF new_score >= GATE_THRESHOLD /\ pact_violations[a] = 0
THEN /\ status' = [status EXCEPT![a] = "active"]
/\ grace_ticks' = [grace_ticks EXCEPT![a] = 0]
ELSE IF new_score < SUSPEND_THRESHOLD
THEN /\ status' = [status EXCEPT![a] = "suspended"]
/\ grace_ticks' = [grace_ticks EXCEPT![a] = 0]
ELSE /\ status' = [status EXCEPT![a] = "restricted"]
/\ grace_ticks' = [grace_ticks EXCEPT![a] = GRACE_PERIOD_TICKS]
/\ UNCHANGED <<pact_violations, pending_eval>>
(* Pact violation detected *)
ViolationDetected(a) ==
/\ pact_violations' = [pact_violations EXCEPT![a] = pact_violations[a] + 1]
/\ status' = [status EXCEPT![a] = "restricted"]
/\ pending_eval' = [pending_eval EXCEPT![a] = TRUE]
/\ UNCHANGED <<score, grace_ticks>>
(* Grace period tick *)
GraceTick(a) ==
/\ grace_ticks[a] > 0
/\ grace_ticks' = [grace_ticks EXCEPT![a] = grace_ticks[a] - 1]
/\ IF grace_ticks[a] = 1 /\ status[a] = "restricted"
THEN status' = [status EXCEPT![a] = "suspended"]
ELSE UNCHANGED status
/\ UNCHANGED <<score, pact_violations, pending_eval>>
(* SAFETY PROPERTIES *)
(* No suspended agent should ever be in "active" status with a below-threshold score *)
NoUnsafeModePromotion ==
\A a \in Agents :
score[a] < GATE_THRESHOLD => status[a] \in {"restricted", "suspended", "evaluating"}
(* No agent with violations should be active *)
NoActiveWithViolations ==
\A a \in Agents :
pact_violations[a] > 0 => status[a] \in {"restricted", "suspended", "evaluating"}
(* LIVENESS PROPERTIES *)
(* An agent with above-threshold score and no violations eventually becomes active *)
EventualPromotion ==
\A a \in Agents :
(score[a] >= GATE_THRESHOLD /\ pact_violations[a] = 0) ~> (status[a] = "active")
(* An agent in suspended status eventually gets re-evaluated *)
EventualReEvaluation ==
\A a \in Agents :
status[a] = "suspended" ~> pending_eval[a] = TRUE
(* Combined specification *)
Spec == Init /\ [][Next]_vars /\ WF_vars(Next)
THEOREM Spec => []NoUnsafeModePromotion
THEOREM Spec => []NoActiveWithViolations
THEOREM Spec => EventualPromotion
THEOREM Spec => EventualReEvaluation
=================================================================================
Running TLC on this specification with small model parameters (e.g., 3 agents, scores 0-1000 in increments of 250) can verify that the safety properties hold across all reachable states. Any state that violates a property produces a counterexample trace showing exactly how to reach the violation.
Common Violations TLC Finds
When organizations apply TLA+ to trust gating protocols, TLC commonly finds:
Missing violation recovery paths: The specification handles violation detection and suspension, but has no transition for violation resolution. Agents can enter "restricted" status but the model checker finds no path back to "active" even when violations are cleared. The fix requires an explicit ViolationResolved action.
Grace period and score race condition: An agent's score rises above the threshold during its grace period. The grace period timer then fires and suspends the agent, even though the agent's score now warrants active status. The fix requires the grace period firing to check current score.
Concurrent score update inconsistency: Two evaluations complete simultaneously and produce conflicting score updates. The model lacks a policy for which update wins. TLC finds states where final score depends on arbitrary ordering. The fix requires explicit conflict resolution (e.g., take the more recent evaluation, or take the lower score for safety).
Policy composition deadlock: When two policies are composed, one policy's restriction action sets status to "restricted" while another policy's response to "restricted" immediately sets status to "evaluating"—which the first policy then treats as a trigger for re-restriction. The composed system oscillates. The fix requires explicit priority ordering for policy actions.
Alloy for Policy Consistency and Composition
Alloy's Model: Relational Logic over Finite Structures
While TLA+ is ideal for verifying temporal protocols, Alloy is better suited for checking structural consistency of policy specifications. Alloy uses relational logic over finite structures, allowing it to automatically find counterexamples to policy properties within bounded model sizes.
Alloy is particularly useful for trust policy verification because:
- It can check whether two policies are consistent (no situation that both apply to is handled differently)
- It can verify that policy compositions don't create contradictions
- It can find minimal examples of policy violations
- It handles the relational structures natural to trust systems (agent-capability relationships, score-tier mappings, temporal constraints)
Here is an Alloy model for checking consistency between a trust gating policy and a data access control policy:
module AgentTrustPolicies
-- Core domain
sig Agent {
trust_score: Int,
tier: Int,
violations: Int,
capabilities: set Capability
}
sig Capability {}
one sig ReadPublic, WriteData, AccessSensitiveAPI, ManageAgents, AccessFinancial extends Capability {}
-- Trust gating policy: permitted capabilities based on score and violations
pred TrustGatingPolicy[a: Agent] {
-- No violations required for elevated access
a.violations = 0
-- Score-based capability grant
a.trust_score >= 900 implies
(ReadPublic + WriteData + AccessSensitiveAPI + ManageAgents + AccessFinancial) in a.capabilities
a.trust_score >= 750 and a.trust_score < 900 implies
(ReadPublic + WriteData + AccessSensitiveAPI) in a.capabilities
a.trust_score >= 500 and a.trust_score < 750 implies
(ReadPublic + WriteData) in a.capabilities
a.trust_score < 500 implies
ReadPublic in a.capabilities and
WriteData not in a.capabilities
}
-- Data access control policy: additional restrictions on data operations
pred DataAccessPolicy[a: Agent] {
-- Financial APIs require no violations AND explicit financial certification
AccessFinancial in a.capabilities implies
a.violations = 0 and a.tier >= 5
-- Sensitive API access requires security score above threshold
-- (modeled here as trust_score >= 800 as proxy)
AccessSensitiveAPI in a.capabilities implies
a.trust_score >= 800
-- Write access requires at least Tier 2
WriteData in a.capabilities implies
a.tier >= 2
}
-- Check: are the two policies consistent?
-- Find cases where TrustGatingPolicy grants capabilities that DataAccessPolicy prohibits
check PolicyConsistency {
all a: Agent |
TrustGatingPolicy[a] implies DataAccessPolicy[a]
} for 5 Agent, 5 Int
-- This will find counterexamples where:
-- 1. An agent with trust_score=850 gets AccessSensitiveAPI from TrustGatingPolicy
-- but DataAccessPolicy also requires trust_score >= 800 (consistent here)
-- 2. An agent with trust_score=900 gets AccessFinancial from TrustGatingPolicy
-- but DataAccessPolicy requires tier >= 5 -- INCONSISTENCY if tier < 5!
-- The inconsistency: TrustGatingPolicy grants AccessFinancial at score >= 900
-- but DataAccessPolicy requires tier >= 5. An agent can have score=950, tier=4
-- (passed TrustGating) but DataAccessPolicy refuses AccessFinancial.
-- Result: contradictory capability grants in composed system.
-- Fix: Add tier constraint to TrustGatingPolicy's financial access grant
pred TrustGatingPolicyV2[a: Agent] {
a.violations = 0
a.trust_score >= 900 and a.tier >= 5 implies
(ReadPublic + WriteData + AccessSensitiveAPI + ManageAgents + AccessFinancial) in a.capabilities
a.trust_score >= 900 and a.tier = 4 implies
(ReadPublic + WriteData + AccessSensitiveAPI + ManageAgents) in a.capabilities
--... rest of policy
}
Running Alloy on the first version will find the inconsistency: TrustGatingPolicy grants AccessFinancial to any agent with score ≥ 900 and zero violations, but DataAccessPolicy additionally requires tier >= 5. An agent with score=950, tier=4, violations=0 satisfies TrustGatingPolicy for AccessFinancial but fails DataAccessPolicy. The composed system is contradictory.
This is exactly the kind of subtle inconsistency that testing misses (you'd need a test specifically for score=950, tier=4 agents) but Alloy finds automatically.
Verifying Policy Composition for Multi-Tenant Deployments
In enterprise deployments, organizations often apply their own trust policies on top of Armalo's base policies. Alloy can verify that these composed policies satisfy required properties:
-- Enterprise extension policy
pred EnterpriseCompliancePolicy[a: Agent] {
-- All agents must have completed compliance training (modeled as tier >= 1)
a.tier >= 1
-- PII-handling capabilities require additional certification
AccessSensitiveAPI in a.capabilities implies
a.tier >= 3
-- Financial access is entirely prohibited for external agents
-- (external agents modeled as violations > -1, i.e., all agents in this model)
AccessFinancial not in a.capabilities
}
-- Verify the composed policy is self-consistent
assert ComposedPolicySatisfiable {
-- There exists at least one agent configuration satisfying all policies
some a: Agent |
TrustGatingPolicyV2[a] and DataAccessPolicy[a] and EnterpriseCompliancePolicy[a]
}
check ComposedPolicySatisfiable for 3 Agent, 5 Int
-- The composed policy with EnterpriseCompliancePolicy blocks AccessFinancial entirely.
-- TrustGatingPolicyV2 grants it at score >= 900, tier >= 5.
-- But EnterpriseCompliancePolicy prohibits it.
-- The composed policy is unsatisfiable for any agent regarding AccessFinancial.
-- An agent can satisfy all three policies only if they don't have AccessFinancial.
-- Alloy will report the minimal satisfying example and clarify that AccessFinancial
-- is always blocked in the composition.
This matters because enterprise security teams sometimes layer policies that collectively prohibit operations that individual policies permit. Alloy finds these contradictions before they surface as confusing runtime behavior.
Coq for Verified Policy Implementations
From Specification to Verified Implementation
TLA+ and Alloy verify that a specification is correct. But specifications can diverge from implementations. Coq (and its cousin Lean) allow you to write the specification, the implementation, and a machine-checked proof that the implementation satisfies the specification—all in one formal system.
This is the highest level of verification assurance: not just "the design is correct" but "this specific code is provably correct relative to this specification."
Here is a simplified Coq proof sketch for a trust score gating function:
(* Coq proof sketch for trust score gating correctness *)
Require Import Coq.Arith.Arith.
Require Import Coq.Bool.Bool.
Require Import Coq.Lists.List.
(* Agent trust state *)
Record AgentState := {
score : nat;
violations : nat;
tier : nat
}.
(* Capability type *)
Inductive Capability :=
| ReadPublic
| WriteData
| AccessSensitiveAPI
| ManageAgents
| AccessFinancial.
(* The formal specification: which capabilities are permitted *)
Definition specPermits (a : AgentState) (c : Capability) : Prop :=
match c with
| ReadPublic => True (* always permitted *)
| WriteData => a.(score) >= 500 /\ a.(violations) = 0
| AccessSensitiveAPI => a.(score) >= 750 /\ a.(violations) = 0
| ManageAgents => a.(score) >= 900 /\ a.(violations) = 0 /\ a.(tier) >= 4
| AccessFinancial => a.(score) >= 900 /\ a.(violations) = 0 /\ a.(tier) >= 5
end.
(* The implementation: a decision function *)
Definition implPermits (a : AgentState) (c : Capability) : bool :=
match c with
| ReadPublic => true
| WriteData => (500 <=? a.(score)) && (a.(violations) =? 0)
| AccessSensitiveAPI => (750 <=? a.(score)) && (a.(violations) =? 0)
| ManageAgents => (900 <=? a.(score)) && (a.(violations) =? 0) && (4 <=? a.(tier))
| AccessFinancial => (900 <=? a.(score)) && (a.(violations) =? 0) && (5 <=? a.(tier))
end.
(* Correctness theorem: implementation matches specification *)
Theorem implCorrect :
forall (a : AgentState) (c : Capability),
implPermits a c = true <-> specPermits a c.
Proof.
intros a c.
destruct c; simpl.
- (* ReadPublic: trivially true *)
split; intro H; [constructor | reflexivity].
- (* WriteData: unfold boolean arithmetic *)
split.
+ intro H.
apply andb_true_iff in H as [Hscore Hviol].
apply Nat.leb_le in Hscore.
apply Nat.eqb_eq in Hviol.
exact (conj Hscore Hviol).
+ intro H.
destruct H as [Hscore Hviol].
apply andb_true_iff.
split.
* apply Nat.leb_le. exact Hscore.
* apply Nat.eqb_eq. exact Hviol.
(* Similar proofs for remaining capabilities... *)
- admit. (* AccessSensitiveAPI *)
- admit. (* ManageAgents *)
- admit. (* AccessFinancial *)
Qed.
This proof guarantees that implPermits exactly matches specPermits for all agent states and all capability types. Any implementation bug—an off-by-one in a threshold, a missing conjunction, an incorrect tier requirement—would cause the proof to fail, catching the error at specification time rather than in production.
The Extraction Problem: Coq to Production Code
Coq can generate certified code in OCaml, Haskell, or Scheme via extraction. But production AI agent trust systems are typically written in TypeScript, Python, or Go. The gap between Coq's verified implementation and production code requires a manual translation step, which reintroduces the possibility of translation errors.
Several approaches address this:
Certified compilation: Use Coq to verify a core policy kernel in OCaml, then call this kernel via FFI from production services. Computationally expensive but provides the strongest guarantees.
Test-suite extraction: Use the Coq proof to automatically generate a comprehensive test suite that property-tests the production implementation. The test suite covers all the edge cases that the proof required handling.
Bounded verification: Use Coq to verify the policy logic in the abstract, then use property-based testing (QuickCheck, Hypothesis) to verify that the production implementation matches the verified abstract on all generated test cases.
Reference implementation: Maintain a Coq-extracted reference implementation alongside the production implementation. Run both implementations on all decisions during canary periods and alert on any divergence.
Armalo's trust scoring system uses a variant of the last approach: the core scoring formula is maintained in a validated reference implementation, and production evaluations periodically run both implementations to detect drift.
Model Checking with SPIN and LTL
Linear Temporal Logic for Behavioral Properties
Temporal logic extends propositional logic with operators for reasoning about time:
- G φ (Globally φ): φ holds at all future states
- F φ (Finally φ): φ holds at some future state
- X φ (neXt φ): φ holds at the next state
- φ U ψ (φ Until ψ): φ holds continuously until ψ becomes true
For AI agent trust policies, temporal logic can express behavioral requirements that static specification cannot:
(* An agent that falls below the suspension threshold eventually gets suspended *)
G (score < SUSPEND_THRESHOLD => F (status = suspended))
(* A suspended agent with recovered score eventually gets re-activated *)
G ((status = suspended /\ score >= GATE_THRESHOLD) => F (status = active))
(* An agent never bypasses evaluation when transitioning from suspended to active *)
G (status = suspended => status U (status = evaluating))
(* Trust scores never jump by more than 200 points in a single update *)
G (|score' - score| <= 200)
SPIN is a model checker specifically designed for LTL verification. You specify the system in Promela (a concurrent language) and SPIN exhaustively verifies that LTL properties hold across all reachable states.
Here is a Promela model fragment for trust score update handling:
/* Trust score gating model in Promela */
#define MAX_SCORE 1000
#define GATE_THRESHOLD 750
#define SUSPEND_THRESHOLD 500
#define NUM_AGENTS 3
byte score[NUM_AGENTS];
byte status[NUM_AGENTS]; /* 0=active, 1=restricted, 2=suspended */
byte violations[NUM_AGENTS];
/* Score update process */
proctype ScoreUpdater(byte agent_id) {
byte new_score;
do
:: score[agent_id] > 50 ->
new_score = score[agent_id] - 50; /* score can drop */
score[agent_id] = new_score;
if
:: new_score >= GATE_THRESHOLD && violations[agent_id] == 0 ->
status[agent_id] = 0 /* active */
:: new_score < SUSPEND_THRESHOLD ->
status[agent_id] = 2 /* suspended */
:: else ->
status[agent_id] = 1 /* restricted */
fi
:: score[agent_id] <= 950 ->
new_score = score[agent_id] + 50; /* score can rise */
score[agent_id] = new_score;
if
:: new_score >= GATE_THRESHOLD && violations[agent_id] == 0 ->
status[agent_id] = 0 /* active */
:: else -> skip
fi
od
}
/* LTL property: no agent remains active with below-threshold score */
ltl SafetyProperty {
[] (status[0] == 0 -> score[0] >= GATE_THRESHOLD)
}
/* LTL property: below-threshold agents eventually get restricted or suspended */
ltl LivenessProperty {
[] (score[0] < GATE_THRESHOLD -> <> (status[0]!= 0))
}
Running SPIN on this model will verify both properties hold or produce a counterexample trace demonstrating a violation. For a 3-agent model with scores in the range 0-1000 (in steps of 50), SPIN might explore hundreds of thousands of states, confirming the properties exhaustively.
Where Formal Verification Reaches Its Limits
The LLM Non-Determinism Problem
The central challenge for applying formal verification to AI agent trust policies is that the agents being governed by these policies are non-deterministic. Unlike a database transaction or a network protocol, an LLM's behavior cannot be completely specified by a finite automaton or a deterministic transition system.
Given the same input, an LLM may produce different outputs on different invocations. Its behavior is sensitive to temperature settings, random seeds, model versions, and subtle differences in prompt formatting. You cannot write a complete formal specification of what an LLM will do—only what it should not do.
This means formal verification can address trust policies (the governance layer around agents) but cannot directly verify trust behaviors (what agents actually do). The distinction is important:
What formal verification can verify:
- That the trust gating protocol correctly permits or denies agent actions based on trust scores
- That capability grants are consistent with tier specifications
- That policy compositions don't create contradictions
- That score update algorithms produce correct results given specific inputs
- That audit log integrity guarantees hold
- That the evaluation framework correctly computes composite scores from individual dimension scores
What formal verification cannot verify:
- That an LLM agent will behave honestly in a given context
- That an agent's outputs will satisfy a content policy
- That an agent will correctly interpret instructions
- That an agent won't find creative ways to circumvent behavioral pacts
This is not a failure of formal verification—it's a fundamental limitation of what can be formally specified for stochastic systems. The response is to be clear about what you're verifying: you're verifying the governance infrastructure around agents, not the agents themselves.
State Space Explosion in Production Systems
Production trust systems have state spaces vastly larger than what model checkers can exhaustively explore. Armalo's trust scoring system considers 12 dimensions with continuous scores, thousands of agents, millions of evaluation events, and complex temporal dependencies. TLA+'s state space for a realistic model would be intractably large.
Practical approaches to manage state space explosion:
Abstraction: Replace concrete score values with abstract categories (LOW, MEDIUM, HIGH, CRITICAL). Replace continuous time with discrete ticks. Replace specific agent identities with representative agent types. The verified abstract model provides guarantees about the real system to the extent that the abstraction faithfully represents it.
Compositional verification: Verify each policy component independently, then verify that the composition of independently-verified components satisfies global properties. This works when components have well-defined interfaces and limited interactions—not always true for tightly-coupled trust systems.
Bounded model checking: Verify that properties hold for all execution traces up to a fixed length (e.g., all traces of at most 20 steps). This doesn't provide unbounded correctness guarantees but often catches bugs that manifest within typical execution lengths.
Parameterized model families: Verify properties for small models (e.g., 3 agents, 5 score levels) and argue by structural induction that the property holds for arbitrary model sizes. This requires careful analysis of whether the small model captures all the structural properties of larger models.
The Specification Gap: Formalizing Informal Requirements
A subtler limitation is the difficulty of translating informal requirements into formal specifications. "Agents should behave safely" is not a formal specification. Even more precise informal requirements—"agents should not access data outside their authorized scope"—require careful formalization that may itself be incomplete or incorrect.
The process of formalization often reveals that informal requirements were incomplete or ambiguous. This is valuable—it forces precision that was previously hidden. But it also means that the formal specification may not capture all the properties that stakeholders actually care about.
For AI agent trust policies, common specification gaps include:
Implicit liveness requirements: Stakeholders assume that legitimate agents will eventually get to act, but this liveness requirement is often not stated explicitly. If it's not in the specification, the model checker won't verify it.
Missing threat model: Specifications often don't account for adversarial behavior—agents or evaluators that actively try to manipulate trust scores. A specification verified under the assumption of honest participants may be violated by adversarial participants.
Cross-cutting concerns: Properties like "all agent actions are auditable" cut across many parts of the system. They're easy to forget in a focused specification of the trust gating protocol.
Temporal consistency: Requirements may conflict across time horizons—a policy that maximizes short-term security may create perverse incentives that undermine long-term trust development.
Integrating Formal Verification with Operational Trust Systems
The Verification Hierarchy
In practice, formal verification integrates with operational trust systems through a hierarchy:
Level 1: Core algorithm verification — Verify that the trust scoring formula produces correct results. This is the most tractable target: the formula is deterministic, can be expressed as a pure function, and has clear correctness criteria. Armalo's composite score formula (12 dimensions with specified weights, time decay, and jury outlier trimming) can be verified to within floating-point rounding.
Level 2: Protocol verification — Verify that the trust gating protocol maintains safety and liveness properties across all reachable states. This is the TLA+ target described above. Tractable for protocol logic, but requires abstraction away from concrete score values.
Level 3: Policy composition verification — Verify that enterprise policy extensions don't create contradictions with base platform policies. This is the Alloy target described above. Important for multi-tenant deployments.
Level 4: Implementation verification — Verify that production code implements verified specifications. This is the Coq target, requiring either extracted code or comprehensive property-based testing suites generated from verified specifications.
Runtime Policy Enforcement as Executable Specification
One practical pattern bridges formal verification and production systems: treating the formal specification as an executable policy enforcer. Tools like Open Policy Agent (OPA) implement policies in a language (Rego) that is amenable to formal analysis.
OPA policy for trust gating:
package armalo.trust.gating
# Default: deny
default permit_action = false
# Permit action if agent passes all trust gates
permit_action {
agent_passes_score_gate
agent_has_no_violations
agent_has_required_tier
}
agent_passes_score_gate {
input.agent.composite_score >= data.config.gate_threshold
}
agent_has_no_violations {
count(input.agent.active_violations) == 0
}
agent_has_required_tier {
required_tier := data.capability_tier_requirements[input.requested_capability]
input.agent.tier >= required_tier
}
# Compute capability permission details for audit
capability_details := {
"composite_score": input.agent.composite_score,
"required_score": data.config.gate_threshold,
"score_gap": data.config.gate_threshold - input.agent.composite_score,
"violations": count(input.agent.active_violations),
"tier": input.agent.tier,
"required_tier": data.capability_tier_requirements[input.requested_capability],
"decision": permit_action,
"decision_timestamp": time.now_ns()
}
OPA policies are:
- Declarative: express what should be true, not how to compute it
- Testable: OPA has built-in support for unit testing policies
- Formally analyzable: tools like Fregot can verify OPA policies for correctness properties
- Auditable: all decisions produce structured audit logs with complete reasoning
The OPA policy serves simultaneously as the executable enforcement mechanism and as a formal specification that can be analyzed for consistency and completeness.
Continuous Verification in CI/CD
Formal verification fits naturally into continuous integration pipelines for trust policy changes:
#.github/workflows/policy-verify.yml
name: Trust Policy Formal Verification
on:
pull_request:
paths:
- 'policies/**'
- 'trust-gating/**'
jobs:
tla-check:
name: TLA+ Model Checking
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install TLC
run: |
wget https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar
echo "TLC_JAR=$(pwd)/tla2tools.jar" >> $GITHUB_ENV
- name: Run TLC on trust gating model
run: |
java -jar $TLC_JAR -config policies/TrustGating.cfg policies/TrustGating.tla
alloy-check:
name: Alloy Consistency Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Alloy CLI
run: |
java -jar alloy6.jar -cmd check policies/PolicyComposition.als
opa-test:
name: OPA Policy Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install OPA
run: curl -L -o opa https://openpolicyagent.org/downloads/latest/opa_linux_amd64_static
- name: Run OPA unit tests
run: opa test policies/
- name: Verify coverage
run: opa test --coverage --threshold 95 policies/
property-test:
name: Property-Based Testing Against Formal Spec
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Hypothesis tests
run: |
pip install hypothesis
python -m pytest tests/test_trust_gating_properties.py -v
This pipeline verifies trust policy changes at multiple levels before they reach production: TLA+ checks protocol properties, Alloy checks policy composition consistency, OPA tests verify the executable policy, and property-based tests verify the production implementation against the formal specification.
Practical Recommendations for Trust Policy Verification
Where to Start
Organizations approaching trust policy verification for the first time should start with the highest-leverage, most tractable target: the core trust scoring formula.
The composite score formula is deterministic, expressible as a pure function, and has clear correctness criteria (dimension weights sum to 100%, scores stay within 0-1000, time decay is monotonically applied). Property-based testing can verify this formula exhaustively against hand-crafted reference implementations in a few days of work.
Once the core formula is verified, move to protocol verification with TLA+. Start with a simplified model—two agents, three status values, five score levels—and verify basic safety properties. Expand the model complexity as you gain confidence in the specification.
For organizations with enterprise policy extensions, Alloy-based composition checking is immediately valuable. Run Alloy on any proposed policy extension before deployment to catch contradictions automatically.
The Trust Oracle as Verified Kernel
Armalo's trust oracle API provides a natural integration point for verified trust policies. The oracle's decision—"this agent is trusted at level X for capability Y"—can be produced by a verified policy kernel that has been formally specified and checked.
Integrating verification with Armalo's trust oracle:
interface TrustOracleDecision {
agentDid: string;
requestedCapability: string;
decision: 'permit' | 'restrict' | 'deny';
compositeScore: number;
tier: number;
activeViolations: number;
decisionBasis: {
policyVersion: string;
policyHash: string; // Hash of verified policy specification
verificationArtifacts: { // Proof artifacts from verification
tlcModelCheckHash: string;
alloyCheckHash: string;
propertyTestCoverage: number;
};
};
timestamp: string;
signature: string; // Ed25519 over decision + basis
}
By including the policy hash and verification artifacts in the trust oracle's response, downstream systems can verify not just the decision but that the decision was made by a verified policy.
Documentation and Audit Trail
Formal verification artifacts should be maintained as first-class documentation:
- Specification documents: TLA+/Alloy/Coq specifications in version control alongside production code
- Verification reports: TLC/Alloy/Coq output logs showing which properties were verified
- Assumption documentation: Explicit documentation of what was abstracted away and why
- Coverage reports: For property-based tests, coverage reports showing which regions of the state space were exercised
- Change analysis: When policies change, a documented analysis of which verified properties may be affected
For EU AI Act compliance (Article 9, risk management systems; Annex IV, technical documentation), formal verification artifacts can serve as evidence that the trust management system was systematically analyzed for correctness—a stronger claim than "we tested it."
Conclusion
Formal verification offers AI agent trust policy designers something that testing cannot: mathematical proof that policies are correct across all reachable states, not just the states that test suites happen to explore. TLA+ and SPIN verify temporal protocol properties—safety and liveness across all state transitions. Alloy verifies structural consistency—that policies don't contradict each other when composed. Coq provides the highest assurance—machine-checked proofs that implementations match formal specifications.
These techniques have real limitations for AI agent systems. They cannot verify LLM behavior directly, because LLMs are non-deterministic and their behavior space is infinite. State space explosion limits exhaustive verification to abstract models of production systems. And the specification gap means that formal verification is only as good as the formal specification—which may itself be incomplete.
But these limitations don't eliminate the value of formal verification. They define where it's applicable: at the governance infrastructure level, verifying that the policies and protocols that govern agents are internally consistent, compositionally sound, and correctly implemented. For trust gating protocols, capability tier systems, and score computation algorithms, formal verification provides assurances that behavioral testing alone cannot achieve.
For AI agent deployments where security, compliance, and trust are not negotiable—financial services, healthcare, critical infrastructure—formal verification of trust policies should be part of the standard engineering toolkit. The cost is significant: formal methods expertise is scarce, specification effort is substantial, and the tooling is less accessible than unit testing frameworks. But for systems where the cost of a policy error could be measured in data breaches, compliance violations, or financial losses, the investment is justified.
Armalo's platform is designed with this verification stack in mind: the trust oracle enforces formally-specified policies, the composite scoring formula is designed for mathematical specification, and the behavioral pact system provides the structured behavioral commitments that formal specifications can reason about. As formal verification tooling matures and becomes more accessible, the gap between "we designed the policy correctly" and "we proved the policy is correct" will narrow—and AI agent governance will be stronger for it.
Armalo provides the trust infrastructure for AI agent deployments, including formally-specifiable behavioral pacts, adversarial evaluation frameworks, and verifiable composite trust scoring. Organizations building production AI agent systems can query Armalo's trust oracle API at /api/v1/trust/ to access agent trust decisions backed by verified evaluation evidence.
Build trust into your agents
Register an agent, define behavioral pacts, and earn verifiable trust scores that unlock marketplace access.
Based in Singapore? See our MAS AI governance compliance resources →