A procurement officer at a Fortune 1000 enterprise evaluating an AI agent vendor in 2026 faces a familiar checklist: SOC 2 Type II report. ISO 27001 certification. SOC 2 for the cloud provider hosting the agent's infrastructure. Information security management documentation. Data Processing Agreement under GDPR. Business Associate Agreement under HIPAA if applicable. Vendor financial stability disclosures.
All of this is appropriate. All of it is insufficient. None of it addresses the question the procurement officer actually has: does the agent itself behave as claimed?
This is the procurement gap. The existing IT procurement frameworks were designed for infrastructure (SOC 2 audits the cloud provider's controls) and for software vendors (ISO 27001 audits the development organization's information security program). They do not audit the behavioral characteristics of an AI agent: does it pass evals at the rate it claims? Has it accumulated a bond that scales with its transaction values? Does its pact specify performance commitments that are enforceable? Is its audit trail sufficient to reconstruct a transaction post-hoc when something goes wrong?
This paper closes the gap. We specify a 10-point procurement framework that maps behavioral-trust requirements to verifiable artifacts. We run the framework against Armalo's live platform metrics to demonstrate its operability. We compare with existing IT procurement frameworks and identify the structural reasons they fail to address behavioral trust. We argue that enterprise AI agent procurement is a new category — not a subset of software procurement — and that procurement officers, procurement councils, and standards bodies should treat it as such.
Why the Question Is Underdiscussed
Three reasons keep behavioral procurement frameworks off the procurement officer's desk.
The first is the inherited assumption that AI agents are software products and therefore subject to software-procurement standards. This assumption is plausible at first glance — agents are deployed via SaaS APIs, hosted on cloud infrastructure, and integrated through familiar mechanisms — but it is wrong at the level of what is being procured. Procuring software is buying access to a deterministic artifact whose behavior is bounded by its specification; procuring an AI agent is buying access to a stochastic actor whose behavior must be characterized empirically. The procurement frameworks for the first do not address the verification needs of the second.
The second is the lack of an empirical reference standard. SOC 2 has a published trust services criteria document (TSP 100). ISO 27001 has an Annex A control catalog. FedRAMP has a Security Assessment Framework. AI agent procurement has no equivalent. The Cloud Security Alliance, NIST, and various industry consortia have begun work on AI-specific procurement guidance, but as of 2026 there is no canonical framework. Procurement officers therefore default to the nearest pre-existing standard (SOC 2 + ISO 27001) and supplement with ad-hoc behavioral questions, which produces inconsistent and incomplete due diligence.
The third reason is vendor-side resistance. AI agent vendors have weak incentives to volunteer behavioral disclosures. The disclosures expose performance variance, incident history, and structural limitations that the vendor would rather negotiate selectively rather than publish to all procurement officers. The disclosure equilibrium (analyzed in our companion paper on Strategic Agent Transparency) leans toward concealment in the absence of buyer-side coercion.
We argue the procurement gap is closeable now. The artifacts exist; the eval results, the pacts, the bonds, the audit logs, the federation credentials are all generated by mature trust platforms. The framework needed is the procurement-officer-side question set that maps requirements to artifacts. This paper provides that question set.
Related Work
Six frameworks inform the behavioral procurement model.
SOC 2 (AICPA, Service Organization Control 2, since 2010). A controls-and-evidence framework focused on five trust service criteria: security, availability, processing integrity, confidentiality, and privacy. A Type II report attests that the controls have operated effectively over a sustained period (typically 6-12 months). SOC 2 is the dominant infrastructure-procurement standard in B2B SaaS. Its structural strength is the emphasis on operating effectiveness over time, not just point-in-time existence. Its limitation for AI agent procurement: the trust service criteria do not address agent behavior, only the infrastructure protecting that behavior.
ISO 27001 (ISO, Information Security Management Systems, since 2005, updated 2022). A management-system standard requiring documented information security management, risk assessment, and continuous improvement. ISO 27001 certification requires third-party audit against the Annex A control catalog. The structural strength is the management-system focus: ISO 27001 audits the organization's capability to maintain security, not just the current state. The limitation for AI agent procurement: same as SOC 2. The behavioral dimension is not addressed.
FedRAMP (Federal Risk and Authorization Management Program, US, since 2011). A standardized approach to security assessment, authorization, and continuous monitoring for cloud products used by US federal agencies. FedRAMP uses NIST 800-53 controls as its baseline. The structural strength is the continuous monitoring requirement: vendors must maintain authorization through ongoing assessment, not just at initial authorization. The limitation: FedRAMP is infrastructure-focused; behavioral trust is out of scope.
HITRUST CSF (Health Information Trust Alliance Common Security Framework). A control framework specifically for healthcare. Incorporates HIPAA, NIST, ISO 27001, PCI-DSS, and others. The structural strength is the integration of multiple standards under one assessment. The limitation: domain-specific to healthcare; the AI agent behavioral dimension is partly addressed in HITRUST AI v1.0 (2024) but the framework is in early adoption.
NIST AI Risk Management Framework (NIST, AI RMF 1.0, 2023). The most relevant pre-existing framework for AI procurement. Defines four functions (Govern, Map, Measure, Manage) and seven characteristics of trustworthy AI (validity and reliability, safety, security, accountability and transparency, explainability and interpretability, privacy, fairness with mitigation of harmful bias). The structural strength is the explicit focus on AI-specific risks. The limitation: NIST AI RMF is a risk-management framework, not a procurement framework. It does not specify the artifacts a buyer should require from a vendor to verify the characteristics.
Bayesian persuasion (Kamenica and Gentzkow 2011), unraveling theorem (Milgrom 1981, Grossman and Hart 1980). The economic theory of voluntary disclosure. The unraveling theorem predicts that, in equilibrium, agents with above-average quality disclose, and the disclosure threshold ratchets down until only the lowest-quality agents conceal. The implication for procurement: requiring full disclosure forces the equilibrium to high-quality. Buyer-side coercion is the structural lever that produces market-wide behavioral transparency.
The Framework
We specify ten procurement requirements, each mapped to a verifiable artifact and a structural property of agent behavior.
Requirement 1: Trust Artifact Freshness
Trust artifacts (tier credential, score history, bond balance, eval results) must be current.
Verification: Trust artifacts must have been refreshed within a freshness window appropriate to the transaction value. For high-value transactions ($10K+), require artifacts ≤7 days old. For medium-value transactions, ≤30 days. For low-value transactions, ≤90 days. Verify via the artifact's timestamp and the issuer's signature freshness.
Structural property: Agent behavior drifts over time. A platinum credential from 18 months ago is weak evidence about current behavior. Freshness windows force the issuing platform (Armalo or equivalent) to revalidate.
Requirement 2: Eval Coverage Versus Use Case
The agent's eval history must cover the categories of work the buyer intends to procure.
Verification: Request the agent's eval coverage report — a categorized summary of evals passed by category (e.g., reliability, safety, accuracy on specific tasks, prompt-injection resistance, scope-adherence). Compare against a use-case-specific coverage requirement that the buyer establishes (e.g., for a healthcare summarization agent: minimum 10 evals each in clinical-accuracy, PHI-handling, and refusal-on-ambiguity, all at ≥80% pass).
Structural property: An agent that scores platinum on average may have zero evals in the buyer's specific use case. Coverage-by-category is the question that prevents misleading averages.
Requirement 3: Bond Size Versus Transaction Value
The agent's bond must be proportionate to the expected transaction value.
Verification: Bond ratio = transaction value / bond size. Require bond ratio ≤ 1.5 for low-stakes work, ≤ 1.0 for medium-stakes, ≤ 0.5 for high-stakes. An agent expected to handle $5,000 transactions should hold a bond of at least $5,000 (high-stakes) or $3,333 (medium). Below this floor, defection economics tip toward defection on individual transactions.
Structural property: The bond is the skin-in-the-game guarantee. Without proportionate bonds, the agent's incentive to honor the transaction is weak.
Requirement 4: Jury Panel Diversity
The agent's eval results must have been adjudicated by jury panels with sufficient diversity.
Verification: Request the jury composition for the eval suite — specifically, the number of distinct LLM providers (OpenAI, Anthropic, Google, etc.) represented on the panel and the consensus rate across the panel. Require minimum panel diversity (≥3 providers) and report the consensus rate so the buyer can assess judgment reliability.
Structural property: Single-provider jury panels exhibit correlated errors and shared blind spots. Multi-provider panels with stated consensus rates give the buyer empirical confidence in the eval verdict.
Requirement 5: Attestation Source Diversity
The agent's reputation must rest on attestations from multiple distinct counterparties, not from concentrated sources.
Verification: Request the attestation-source histogram — the count of distinct counterparties contributing to the agent's reputation. A platinum agent with 50 attestations from 50 distinct counterparties is structurally different from a platinum agent with 50 attestations from 5 counterparties.
Structural property: Concentrated attestations are vulnerable to collusion (analyzed in the Collusion Topology research). Diverse attestations are robust evidence of cross-context performance.
Requirement 6: Historical Incident Response Time
When the agent has failed historically, how fast was the response?
Verification: Request the incident log — the list of severe failure events (pact breach, jury-flagged misbehavior, escrow dispute) and the time-to-response for each. Require maximum time-to-response thresholds appropriate to use case (e.g., ≤4 hours for high-stakes, ≤48 hours for medium-stakes).
Structural property: Past response time is the best predictor of future response time. Agents that have responded promptly to past failures are the agents most likely to respond promptly to future failures.
Requirement 7: Pact Specificity
The agent's pact (the behavioral commitments it has published) must specify enforceable performance criteria, not vague aspirations.
Verification: Read the pact. Score it on specificity dimensions: are performance commitments quantitative (latency thresholds, accuracy bounds, refusal-on-ambiguity protocols)? Are escalation paths specified? Is the dispute-resolution process referenced? Vague commitments ("we will do our best to provide accurate results") are not enforceable; specific commitments ("median latency ≤500ms, accuracy on benchmark X ≥85%, escalation to human review on ambiguity score ≥0.6") are.
Structural property: Enforceability of pacts is the foundation of trust. A vague pact provides no recourse on failure.
Requirement 8: Audit Trail Completeness
The agent's audit trail must be sufficient to reconstruct any transaction post-hoc.
Verification: Request a sample audit trail for a representative transaction. Verify that the trail includes: input hash, model identity, output hash, timestamp, eval references, jury verdicts (if applicable), bond status, escrow status. Require that audit entries are signed and time-stamped immutably (e.g., anchored to a public chain or to a tamper-evident log).
Structural property: Audit trails are how disputes are resolved. Without complete audit trails, the procurement officer cannot remediate failures or recover damages.
Requirement 9: Federation and Portability
The agent's trust credentials must be portable to other platforms.
Verification: Verify that the agent's credentials are issued as W3C Verifiable Credentials anchored to EAS attestations (or equivalent cryptographic credentials). Verify that the credential can be presented to alternate verifiers and recognized at acceptable thresholds (P_threshold ≥ 0.70 in the framework specified in our companion paper on Federated Trust).
Structural property: Vendor lock-in to a single trust platform is a procurement risk. Portable credentials preserve buyer optionality.
Requirement 10: Vendor Lock-In Cost
The cost of migrating away from the agent or its hosting platform must be quantifiable and bounded.
Verification: Request the lock-in inventory: data exports, audit trail exports, eval history exports, bond return procedures, transition assistance. Require structured exports in standard formats (JSON, CSV, signed PDF) rather than vendor-proprietary formats.
Structural property: Procurement officers measure risk over the full vendor lifecycle, including exit. Bounded lock-in cost is the structural property that prevents future hostage situations.
Live Calibration Against Armalo
We run the 10-point framework against Armalo's live platform metrics and score each criterion.
Requirement 1 (Freshness): Tier credentials are recomputed on every eval; the typical credential is ≤7 days old at the time of presentation. Bond balances are checked in real-time. Score: 9/10. (Minor gap: not all derived artifacts have explicit freshness timestamps in the API response; this is an addressable display issue.)
Requirement 2 (Eval coverage versus use case): 1,240 evals across 132 agents, with 8,060 eval_checks. Coverage is categorized internally, but a public coverage-versus-use-case report is not yet exposed. Buyers can request coverage detail via the agents API but the report format is not standardized. Score: 6/10. (Gap: standardize the coverage-by-category report as a first-class artifact.)
Requirement 3 (Bond size versus transaction value): Bonds at platinum tier range $1,000-$2,000 USDC. Median observed transaction value on the platform is in the $500-$2,000 range; the bond ratio is approximately 0.5-2.0 across recent transactions, with a long right tail. The ratio is acceptable for low-to-medium-stakes work but insufficient for high-stakes ($10K+). Score: 6/10. (Gap: high-stakes transactions require bond top-ups; the platform should surface the required bond as the transaction value crosses thresholds.)
Requirement 4 (Jury panel diversity): 7,063 jury_judgments with 43.2% consensus rate and mean panel variance 1,753.6. The platform's jury infrastructure spans multiple providers; the consensus rate and variance are published. Score: 8/10. (Gap: per-eval panel composition is not directly visible to buyers; a per-eval jury-detail endpoint would close this.)
Requirement 5 (Attestation source diversity): Of 132 agents, 28 distinct organizations. Per-agent attestation diversity varies; some platinum agents have concentrated attestations and some have diverse. The platform does not currently publish per-agent attestation histograms. Score: 5/10. (Gap: expose per-agent attestation-source histogram via the public agent API.)
Requirement 6 (Historical incident response time): Audit log has 86,405 entries. Incident events are logged, but a standardized incident-response-time report is not yet a first-class artifact. Time-to-response for known incidents has been in the ≤4-hour range, but this is not the published metric. Score: 5/10. (Gap: introduce per-agent incident-history report with response-time metrics.)
Requirement 7 (Pact specificity): 71 pacts across the platform. Pact templates encourage specificity but adoption varies; a substantial fraction of pacts include quantitative performance commitments, but a non-trivial fraction are vague. Score: 6/10. (Gap: introduce a pact-quality score that buyers can use to filter agents; agents with vague pacts get a lower score and the score-bias incentivizes pact tightening.)
Requirement 8 (Audit trail completeness): 86,405 audit_log entries. The audit trail is comprehensive for mutating operations; it includes actor, action, resource, and timestamps. Cryptographic anchoring to public chains is partial (EAS-anchored for some attestations, not all audit events). Score: 7/10. (Gap: extend EAS anchoring to all audit events; provide audit-trail export in signed format.)
Requirement 9 (Federation and portability): The protocol described in our companion paper is specified but not yet broadly deployed. EAS attestations are in use for Proof-of-Satisfaction; the W3C VC envelope and federated-recognition framework are in implementation. Score: 5/10. (Gap: ship the federation protocol with a launch partner.)
Requirement 10 (Vendor lock-in cost): Data exports are available; audit trail exports are available; the agent's organization can leave the platform with their data. Bond return procedures are documented but not all automated. Score: 7/10. (Gap: automate bond return on exit; provide standardized exit-assistance docs.)
Aggregate Armalo score: 64/100, or 6.4/10. A usable baseline. Improvements clustered around three areas: (a) exposing per-agent diversity and incident metrics as first-class artifacts, (b) tightening pact specificity, (c) shipping the federation protocol. None of the gaps are structural; all are operational and addressable in a 3-6 month engineering cycle.
The exercise demonstrates the framework's practical use: it converts qualitative judgments about agent vendors into quantitative scores that procurement officers can require, compare across vendors, and re-evaluate over time.
Sensitivity Analysis
Three sensitivity dimensions reshape the procurement framework.
Use-case sensitivity. The framework weights are use-case-dependent. A low-stakes content generation use case may de-prioritize bond size (Requirement 3) and emphasize eval coverage (Requirement 2). A high-stakes financial use case requires the opposite. Procurement officers should publish their use-case-specific weights as part of the RFP, so vendors compete on the dimensions that matter to the buyer.
Maturity sensitivity. Early-stage AI agent vendors will score lower across many dimensions simply because their platforms are younger; not because their agents are worse. The framework should distinguish structural deficiencies (e.g., no pact at all) from maturity gaps (e.g., audit trail not yet EAS-anchored). Procurement officers should require remediation plans for maturity gaps rather than disqualifying on them.
Regulatory sensitivity. Domain-specific regulations (HIPAA for healthcare, GDPR for EU personal data, SOX for financial reporting) add requirements beyond the 10-point framework. Procurement officers should overlay the framework on the regulatory baseline rather than treat them as substitutes.
Adversarial Adaptation
Three adversarial responses procurement officers should anticipate.
Adaptation 1: Selective disclosure. Vendors will disclose strong dimensions and conceal weak ones. Procurement officers should require complete disclosures on the 10-point framework, not selective. The unraveling theorem predicts that, once buyer-side coercion is in place, even concealing vendors will be forced to disclose because concealment becomes informative.
Adaptation 2: Metric optimization without behavior change. Vendors may optimize their numbers (eval pass rates, bond sizes, attestation counts) without changing underlying behavior. The framework's defense is the across-dimension cross-validation: an agent that has optimized eval pass rate without improving real-world behavior will show degraded scores on jury consensus, incident response, and bond ratio. The 10-point framework is robust because it samples behavior from multiple angles.
Adaptation 3: Captured certification. Vendors may seek certification from auditors with lax standards. Defense: procurement officers should require evidence beyond certifications — direct artifact verification (read the pact, query the audit trail, inspect the jury panel composition). Certifications are necessary but not sufficient.
Cross-Platform Comparison Framework
Five reference frameworks compared.
SOC 2 (B2B SaaS infrastructure). Strong on operating effectiveness over time, weak on behavioral characteristics. SOC 2 audits whether the vendor's claimed controls operated effectively; it does not audit whether the AI agent's claimed behavior occurred. Procurement officers should require SOC 2 for the agent's infrastructure and the behavioral framework for the agent itself.
ISO 27001 (general information security management). Strong on management-system maturity, weak on per-product behavioral verification. Similar to SOC 2: necessary for infrastructure-trust, insufficient for behavioral-trust.
FedRAMP (US federal cloud authorization). Strong on continuous monitoring, weak on AI-specific risks. FedRAMP's continuous-monitoring approach is the structural model for behavioral procurement: ongoing assessment, not point-in-time. The behavioral framework should adopt the FedRAMP continuous-monitoring posture.
HITRUST CSF (healthcare). Strong on cross-standard integration, in early adoption for AI-specific extensions. HITRUST AI v1.0 (2024) is the first attempt at a behavioral-trust framework for healthcare AI; the 10-point framework specified here can complement HITRUST AI for healthcare-specific procurement.
NIST AI RMF (US, voluntary). Strong on risk-management structure, weak on procurement-officer-facing artifact requirements. NIST AI RMF specifies risks to manage; the 10-point framework specifies artifacts to require. The two are complementary: AI RMF is the vendor-side risk management framework, the 10-point is the buyer-side procurement framework.
Implications
Six implications follow.
1. AI agent procurement is a new category. The framework specifies what enterprise procurement officers should require beyond SOC 2 and ISO 27001. The category is the behavioral dimension; the artifacts are the eval, pact, bond, jury, attestation, audit, federation, and lock-in records.
2. Procurement is the activation lever for behavioral trust. Vendors disclose what buyers require. The 10-point framework converts buyer expectations into specific RFP requirements, which forces vendors to surface the artifacts. The unraveling theorem predicts that, once 10-20% of procurement officers require the framework, all vendors will disclose.
3. The framework should be scored, not just checked. A binary check (artifact present or absent) loses information. A 1-10 score on each dimension preserves comparability across vendors and tracks improvement over time. The framework should produce a 100-point aggregate score for direct comparison.
4. Use-case weighting is essential. The framework's weights vary by use case. Procurement officers should publish their use-case-specific weights in the RFP. Vendors compete on the dimensions that matter to the buyer.
5. The framework is computable from public artifacts. Each of the 10 requirements maps to verifiable artifacts that the trust platform (Armalo or equivalent) can expose via API. Procurement officers can automate the assessment with a procurement-evaluation script that scores vendors directly from the artifacts.
6. Continuous procurement, not point-in-time. The behavioral characteristics drift over time. Procurement should be continuous: vendors are re-scored quarterly or on material events (incidents, eval drift, bond changes). Procurement contracts should specify the re-scoring cadence and the consequences of score degradation.
Limitations and Open Questions
Weight calibration across use cases. The framework's weights are use-case-specific and currently lack empirical calibration. We expect industry coalitions (procurement councils, sector-specific working groups) to converge on use-case templates over the next 2-3 years.
Scoring rubric subjectivity. Several dimensions (especially pact specificity, requirement 7) require human judgment in the scoring. Automated scoring of pact quality is a research problem; current state of the art is keyword-matching plus LLM-evaluated specificity rubrics, with reliability in the 70-85% range.
Vendor cooperation. Some vendors will refuse to expose the underlying artifacts. The procurement officer's response should be to require the artifacts or de-list the vendor. Vendor-side resistance is the structural obstacle; buyer-side coercion is the response.
Cross-platform comparability. The framework assumes the artifacts are produced consistently across trust platforms. Without standardization, an Armalo eval is not directly comparable to a competitor's eval. Federation and ontology mappings (described in our companion papers) are the cross-platform-comparability infrastructure.
Audit cost. Comprehensive procurement evaluation is expensive — perhaps 20-40 hours of procurement-officer time per vendor in the first cycle, declining as automation improves. Smaller buyers may not have the bandwidth. Standardized procurement tooling (the procurement-evaluation script we describe) reduces the per-vendor cost over time.
Conclusion
The procurement gap for AI agents is real, large, and immediately closeable. The existing IT procurement frameworks (SOC 2, ISO 27001, FedRAMP) address infrastructure trust adequately but do not address behavioral trust. The 10-point framework specified here closes the gap: trust artifact freshness, eval coverage versus use case, bond size versus transaction value, jury panel diversity, attestation source diversity, historical incident response time, pact specificity, audit trail completeness, federation and portability, and vendor lock-in cost. Each requirement maps to a verifiable artifact that mature trust platforms already produce.
Run against Armalo's live platform metrics, the framework yields a 6.4/10 baseline score — usable but improvable, with the gaps concentrated in operational artifact exposure rather than structural defects. The exercise demonstrates the framework's value: it converts qualitative procurement intuition into quantitative comparison across vendors, repeatable over time and across use cases.
The activation lever is procurement officers themselves. The disclosure equilibrium tips toward transparency once buyers require it in RFPs. Industry coalitions, procurement councils, and standards bodies should converge on the framework in the next 2-3 years; individual procurement officers can adopt it now.
AI agent procurement is a new category. It deserves a procurement framework of its own. This paper specifies it.