AI Agent Registry Trust Scoring: Verifying Third-Party Agent Packages Before Deployment
Agent marketplaces and registries need trust scoring at the package level. This guide covers publisher identity verification, behavioral evaluation, security scan history, vulnerability disclosure records, user telemetry, scoring algorithm design, registry governance, and comparisons with npm audit and PyPI Safety.
AI Agent Registry Trust Scoring: Verifying Third-Party Agent Packages Before Deployment
When npm's audit feature launched in October 2018, it transformed how developers thought about package security. For the first time, running npm audit gave developers immediate visibility into known vulnerabilities in their dependency trees — severity ratings, CVE identifiers, affected versions, and (when available) remediation paths. The feature's adoption was near-universal within months. The concept was simple: the registry has information about packages that individual developers lack, and surfacing that information at the point of installation creates a leverage point for supply chain security at scale.
The AI agent ecosystem is at the equivalent of npm circa 2017 — a rapidly growing registry ecosystem with essentially no trust scoring infrastructure. Agent marketplaces and registries are multiplying: platform-specific stores (ChatGPT Actions, Copilot Extensions, Claude Tools), open-source registries (LangChain Hub, Semantic Kernel Skills), and enterprise-focused marketplaces. Collectively, they host thousands of agent packages — tool definitions, skill bundles, workflow templates, and pre-built agent configurations — with essentially no systematic security evaluation, no publisher identity verification stronger than email, and no mechanism for users to understand the security posture of what they are deploying.
The consequences of this gap are predictable: poorly-secured agents are deployed into production environments with high-privilege access, supply chain attacks targeting registry-distributed agents are largely undetectable, and organizations have no consistent framework for comparing the security posture of competing agent packages.
This document provides a comprehensive treatment of trust scoring for AI agent registries — what dimensions should be scored, how to design the scoring algorithm, how existing package registry security compares, and what governance frameworks are necessary for trust scoring to be credible and useful.
TL;DR
- AI agent registries need trust scoring that goes beyond CVE scanning to address AI-specific risks: behavioral reliability, prompt injection resistance, data access scope, and supply chain provenance.
- Publisher identity verification in AI agent registries is significantly weaker than in mature software registries — most require only email verification, compared to npm's requirement for 2FA on high-download packages.
- Trust scoring dimensions for agent packages: publisher identity verification, behavioral evaluation results, security scan history, update frequency and vulnerability disclosure record, user telemetry signals, and supply chain integrity.
- Scoring algorithms must be transparent, independently verifiable, and resistant to gaming through sybil attacks and fake review manipulation.
- Governance frameworks for registry trust scoring require: independent evaluation bodies, transparent scoring methodologies, appeal processes, and periodic re-scoring.
- Armalo's trust oracle provides a registry-compatible trust scoring API that can be embedded in any agent registry or marketplace, enabling consistent trust signal distribution across the fragmented registry ecosystem.
The Current State of AI Agent Registry Security
To understand what a trust scoring system for AI agent registries should look like, it is useful to first characterize the current state of security in existing registries.
ChatGPT Actions (OpenAI GPT Store)
OpenAI's GPT Store — the primary distribution platform for GPT-4-based agents and tool integrations — has the following security infrastructure as of 2026:
Publisher Verification: Publishers are required to have an OpenAI account in good standing. Account creation requires email verification and credit card registration (for billing purposes). There is no technical security assessment of publishers as a condition of listing.
Content Review: OpenAI has a review process for GPT submissions that screens for policy violations (harmful content, privacy policy requirements, etc.). The review does not include: security testing of the tool's code, verification of data handling claims, or assessment of prompt injection resistance.
User Ratings: The Store includes a user rating system. User ratings are not a reliable security signal — users rate usability and usefulness, not security posture, and ratings can be inflated through coordinated campaigns.
Vulnerability Disclosure: There is no published vulnerability disclosure process for GPT Actions. If a security researcher discovers a vulnerability in a published GPT, the disclosure path is unclear.
HuggingFace Hub
HuggingFace Hub is the dominant distribution platform for open-source models and datasets. Its security infrastructure:
Malware Scanning: HuggingFace has implemented automated scanning of uploaded model files for known malware patterns, primarily targeting pickle-format models. This scanning was introduced in 2024 following public demonstrations of pickle-based code execution.
Access Controls: Models can be gated (requiring user agreement to terms) or public. Private organizations can host models in private namespaces. There is no publisher identity verification beyond email confirmation.
Safety Evaluation: Some models include safety evaluation results in their model cards, but this is entirely voluntary and unverified — anyone can write anything in a model card.
Community Flags: Users can flag models for policy violations. This is a community moderation mechanism, not a security evaluation.
npm Registry (Comparison Baseline)
The npm registry — the most security-mature package registry at scale — provides a useful comparison:
Publisher 2FA: Since 2023, npm requires 2FA for all publishers of packages in the npm top-500 by downloads. This significantly reduces the risk of account takeover for high-impact packages.
Automated Malware Scanning: npm's security team runs automated scanning for known malware patterns, obfuscated code, and suspicious install hooks.
npm Audit: Vulnerability intelligence from multiple databases is aggregated and surfaced through npm audit, providing automated vulnerability scanning at install time.
Packages of Particular Interest: A special category for widely-deployed packages receives enhanced monitoring and faster security response.
OSSF Scorecard Integration: npm packages can display OpenSSF (Open Source Security Foundation) Scorecard scores — automated assessment of security best practices in source repositories.
Security Policy Support: npm supports security-policy.json files that help security researchers identify the correct disclosure path for package vulnerabilities.
The contrast is stark: npm has substantially more security infrastructure than any AI agent registry, and even npm has significant gaps. The AI agent registry ecosystem is starting from a much lower baseline.
Trust Scoring Dimensions for AI Agent Packages
A comprehensive trust scoring system for AI agent packages must address both traditional software supply chain risks and AI-specific risks. The following six dimensions capture the complete trust picture.
Dimension 1: Publisher Identity Verification (Weight: 15%)
Publisher identity verification establishes that the entity publishing an agent package is who they claim to be and has accountability for what they publish.
Level 0 — No Verification (Score: 0-2): Any account can publish under any publisher name. No identity verification beyond email.
Level 1 — Email Verification (Score: 3-4): Publisher's email domain matches claimed organization. DNS verification of claimed domain.
Level 2 — Organization Verification (Score: 5-6): Organization identity verified through one of: legal entity registration confirmation, LinkedIn company page verification, active GitHub organization with sustained history, or DUNS number matching.
Level 3 — Key Material Control (Score: 7-8): Publisher demonstrates control of published artifacts through code signing. Either: Sigstore keyless signing with verified OIDC identity (e.g., GitHub Actions identity matching the publisher's GitHub organization) or traditional code signing certificate issued to the organization.
Level 4 — Extended Verification (Score: 9-10): Extended verification equivalent to CA/Browser Forum's EV certificate requirements: verified legal entity name, verified physical address, verified telephone, verified authorization of certificate request. This level provides the highest assurance of publisher identity and accountability.
Dimension 2: Behavioral Evaluation Results (Weight: 30%)
Behavioral evaluation is the most important and most AI-specific dimension. It assesses how the agent package actually behaves, not just what it claims to do.
Behavioral Evaluation Categories:
Reliability Testing (Sub-score 1): Does the agent package perform its stated function reliably? Test on a battery of representative inputs appropriate to the package's declared domain. Measure: success rate, error rate, error handling quality.
Prompt Injection Resistance (Sub-score 2): Is the agent package resistant to prompt injection attacks? Test with a standardized suite of injection attempts: direct injection, indirect injection via tool outputs, nested injection through multi-hop tool chains.
Data Access Scope Adherence (Sub-score 3): Does the agent package access only the data it declares accessing? Test with instrumented environments that can observe all data access attempts.
Behavioral Consistency (Sub-score 4): Does the agent package behave consistently across repeated calls? High variance in outputs given identical inputs may indicate non-determinism, stochastic behavior, or external dependency on untrusted state.
Instruction Following Precision (Sub-score 5): Does the agent package follow instructions precisely, or does it exhibit scope creep (doing more than asked) or scope limitation (doing less than asked)?
Evaluation Cadence:
Initial evaluation before listing, with re-evaluation triggered by:
- Package version update
- 90-day elapsed time since last evaluation
- User behavioral anomaly reports
- Supply chain change (new model version, dependency update)
Dimension 3: Security Scan History (Weight: 20%)
Security scanning history documents the results of automated security analysis applied to the package over time.
Static Analysis Results: Automated static analysis of package code (or declared API endpoints) for common vulnerability patterns: injection vulnerabilities, authentication bypasses, insecure cryptography, hardcoded credentials.
Dependency Vulnerability History: CVE history for the package's runtime dependencies. Track:
- Number of known CVEs in current version's dependencies
- Severity distribution (critical, high, medium, low)
- Time to remediation after CVE disclosure (important signal for vendor responsiveness)
- Whether dependency updates are automated or manual
Container/Infrastructure Scan Results: For packages that include containerized components, container image scan results from tools like Trivy or Grype.
History vs. Point-in-Time: A package with zero current CVEs but a history of slow CVE remediation (multiple critical CVEs sitting for >90 days before patching) should score lower than a package with comparable CVE counts but faster remediation — the remediation history predicts future behavior.
Scoring for Security Scan History:
| Condition | Score Impact |
|---|---|
| Zero critical CVEs in current version | Baseline |
| High CVEs present, unpatched >30 days | -2 points |
| Critical CVEs present, unpatched >7 days | -4 points |
| Critical CVEs patched within 24 hours historically | +1 point |
| Automated dependency updates with Dependabot/Renovate | +1 point |
| SBOM available and current | +1 point |
| No SBOM available | -1 point |
Dimension 4: Update Frequency and Vulnerability Disclosure Record (Weight: 15%)
The pattern of updates and security disclosures reveals the vendor's security maturity and responsiveness.
Update Frequency Analysis:
Active maintenance is a prerequisite for secure packages. A package with no updates in 12+ months is likely not receiving security patches. However, extremely high update frequency (many releases per day) may indicate instability or obfuscated changes. The optimal pattern is: regular scheduled releases (weekly or monthly) plus rapid security patches when needed.
Vulnerability Disclosure Program Quality:
Does the vendor have a published vulnerability disclosure program (VDP)? What does it include?
- Contact information for security disclosures
- Response time commitments (SLA for acknowledging reports)
- Remediation time commitments
- Scope definition (what is in/out of scope for disclosure)
- Researcher acknowledgment policy
- Bug bounty program (optional, significant positive signal)
Historical Disclosure Response:
For packages with publicly disclosed vulnerabilities:
- Time from discovery/disclosure to patch publication
- Quality of post-patch communication (did the vendor accurately describe the vulnerability and impact?)
- Did the vendor credit the researcher?
- Was there any evidence of attempting to conceal vulnerabilities?
Dimension 5: User Telemetry Signals (Weight: 10%)
Aggregated telemetry from deployments of the agent package provides signals about real-world behavioral reliability and security that synthetic evaluation cannot fully capture.
Behavioral Anomaly Reports: Aggregate reports from deploying organizations of unexpected behaviors — behaviors that deviate from documented functionality. High anomaly report rates indicate either inconsistent behavior or undocumented functionality (potentially including malicious functionality).
Error Rate Telemetry: With user consent, aggregate error rates and types from deployments. Unexpectedly high error rates on specific input categories may indicate behavioral anomalies.
Version Adoption Rate: How quickly do existing users upgrade to new versions? Slow adoption of security updates is a signal that the update experience is problematic (breakage, migration complexity), which may discourage timely security patching.
Geographic Usage Pattern: Unusual geographic concentration of usage may indicate that a package is primarily used in contexts with lower security scrutiny.
Anti-Gaming Measures for Telemetry Signals:
Telemetry signals are susceptible to manipulation. Mitigation measures:
- Only count telemetry from verified, non-sybil deploying organizations
- Weight telemetry by deploying organization's trust score
- Apply statistical anomaly detection to flag unusual telemetry patterns
- Conduct random audits of reported telemetry data
Dimension 6: Supply Chain Integrity (Weight: 10%)
Supply chain integrity captures the security of the process by which the package was built and distributed.
SLSA Level: What SLSA level does the package's build pipeline achieve? Higher SLSA levels indicate stronger guarantees about the provenance of the package.
Artifact Signing: Are release artifacts (package files, container images) cryptographically signed? Are signatures verifiable through a public transparency log (Sigstore Rekor)?
SBOM Availability: Does the package include a current, complete SBOM? Is it machine-readable (CycloneDX or SPDX)?
Reproducible Builds: Can the package be reproducibly built from source? Reproducible builds allow independent verification that the distributed package matches the published source.
Training Data Provenance (for AI model packages): For packages that include AI model components, is training data provenance documented with cryptographic integrity verification?
Scoring Algorithm Design
Combining six dimensions into a single trust score requires a carefully designed algorithm that is transparent, manipulation-resistant, and interpretable.
Weighted Composite Score
The base trust score is a weighted composite of dimension scores:
Trust Score = (0.15 × Publisher Identity) +
(0.30 × Behavioral Evaluation) +
(0.20 × Security Scan History) +
(0.15 × Update/Disclosure Record) +
(0.10 × User Telemetry) +
(0.10 × Supply Chain Integrity)
Each dimension is scored 0–10. The composite score ranges 0–10.
Non-Linear Penalties for Critical Failures
Certain conditions should have non-linear, disproportionate impact on the overall score:
Hard Blocks (Score set to 0 regardless of other dimensions):
- Confirmed malware or deliberate malicious behavior
- Confirmed data exfiltration
- Actively exploited critical vulnerability (unpatched)
- Revoked signing certificate or key compromise
Critical Penalties (-30% reduction to final score):
- Unpatched critical CVE for >30 days
- No behavioral evaluation performed
- No publisher identity verification beyond email
- Evidence of manipulation of reported telemetry
Significant Penalties (-15% reduction):
- No SBOM available
- No vulnerability disclosure program
- Last update >12 months ago
- High dependency CVE count
Time Decay for Historical Signals
Security posture changes over time. Historical signals should decay to prevent old data from dominating current assessments:
def time_weighted_score(score: float, assessment_date: datetime,
half_life_days: int = 90) -> float:
"""
Apply exponential time decay to a historical security assessment score.
Scores assessed more than half_life_days ago are discounted by 50%.
"""
days_elapsed = (datetime.now() - assessment_date).days
decay_factor = 0.5 ** (days_elapsed / half_life_days)
return score * decay_factor
Older evaluations are weighted less heavily; the most recent comprehensive evaluation dominates the score.
Confidence Intervals and Score Uncertainty
A trust score of 7.5 based on a comprehensive evaluation a week ago is qualitatively different from a trust score of 7.5 based on a minimal evaluation six months ago. Trust scores should be presented with confidence intervals that reflect:
- Age of the most recent comprehensive evaluation
- Number of evaluations performed
- Consistency of scores across evaluations
- Sample sizes for telemetry-based scores
A score with a narrow confidence interval is more actionable than a score with a wide interval.
Registry Governance Frameworks
A trust scoring system is only credible if it is governed appropriately — with transparency, independence, and accountability. Without governance, scores can be gamed, conflicts of interest can corrupt assessments, and organizations will not trust scores enough to act on them.
Independence Requirements
The entity that scores packages should be independent from both the registry operator and the packages being scored. This is analogous to the independence requirements for financial auditors:
Conflicts of interest to avoid:
- A registry operator scoring packages from paying vendors more generously
- An assessment organization scoring packages from companies that are also clients for other services
- Individual evaluators scoring packages from their former employers more favorably
Independence structures:
- Separate legal entity for trust assessment vs. registry operation
- Disclosed funding sources for the assessment organization
- Required disclosures when assessors have any relationship with the package publisher
- Rotation of evaluators to prevent relationships from developing
Transparency Requirements
For trust scores to be credible, the scoring methodology must be transparent:
Published methodology: The full scoring methodology — dimensions, weights, sub-scores, penalties, and time decay functions — must be publicly documented.
Score explanations: Every trust score must come with a human-readable explanation of why the package received that score — specifically what factors contributed to each dimension score.
Appeal process: Package publishers must have a mechanism to challenge scores they believe are incorrect. The appeal process must be documented, time-bounded, and handled by parties independent from the initial assessors.
Methodology updates: Changes to the scoring methodology must be announced in advance, with a transition period. Packages should not see their scores change dramatically due to methodology changes without understanding why.
Re-scoring Triggers
Trust scores must be kept current. Automatic re-scoring should be triggered by:
- Package version update (within 24 hours)
- New CVE affecting the package or its dependencies (within 48 hours)
- User behavioral anomaly report (within 7 days if credible)
- 90 days elapsed since last full evaluation
- Supply chain change (new signing key, new build pipeline, new model base)
- Confirmed security incident at the publisher organization
Notification Requirements
Publishers should be notified:
- When their package receives its initial trust score
- When their score changes by more than 0.5 points
- When their score drops below deployment threshold values (e.g., 6.0 for standard deployment, 8.0 for high-security environments)
- When a CVE affecting their package is discovered
- When a security incident is reported affecting their package
Comparison: npm Audit, PyPI Safety, Ruby Gems Advisory DB
Understanding how existing package registry security systems work — and where they fall short for AI — informs the design of AI-specific trust scoring.
npm audit
npm audit aggregates vulnerability information from multiple sources:
- npm's own security advisory database
- GitHub Security Advisories
- NVD (National Vulnerability Database)
- Snyk Intelligence
npm audit provides: CVE IDs, severity scores (CVSS), affected version ranges, patched versions, and (when available) exploit information.
What it does well: Comprehensive CVE coverage, automated at install time, severity scoring, fix recommendation.
What it misses for AI agents: Behavioral evaluation, publisher identity verification, supply chain integrity beyond code dependencies, telemetry-based signals, AI-specific risks (prompt injection, data exfiltration, behavioral inconsistency).
Score mapping: npm's severity model (info/low/moderate/high/critical) maps partially to AI agent risk but does not capture AI-specific dimensions.
PyPI Safety (PyPI Advisory Database)
The Python Packaging Advisory Database (PyPA/advisory-db) and the safety tool provide similar functionality to npm audit for Python packages.
What it does well: Automated vulnerability detection, integration with pip, OpenSSF Scorecard integration for source repository assessment.
What it misses for AI agents: Same gaps as npm audit, with additional concern that the advisory database is smaller than npm's and may have slower coverage of vulnerabilities in AI-specific packages.
Ruby Gems Advisory Database
RubyGems.org maintains an advisory database similar to npm and PyPI, plus the Bundler Audit tool for checking Gemfiles against advisories.
Relevant design lesson: The Ruby Gems advisory database is maintained as a Git repository with human-reviewed pull requests for each advisory entry. This transparency-first design (all advisories are publicly visible as Git commits) provides auditability that centralized databases lack.
For AI agent trust scoring, adopting a transparency-first design — where all scoring decisions are recorded in a publicly auditable log — would provide similar accountability.
What AI Agent Trust Scoring Needs That Traditional Systems Lack
| Capability | npm audit | PyPI Safety | AI Agent Trust Score |
|---|---|---|---|
| CVE/vulnerability scanning | Yes | Yes | Required |
| Behavioral evaluation | No | No | Required |
| Publisher identity verification | Partial | Minimal | Required |
| Prompt injection resistance | No | No | Required |
| Supply chain provenance | Partial (via OSSF) | Partial | Required |
| Training data provenance | No | No | Required for model packages |
| User telemetry signals | No | No | Valuable |
| AI-specific risk scoring | No | No | Required |
How Armalo Powers AI Agent Registry Trust Scoring
Armalo's trust oracle is purpose-built to provide the registry-level trust scoring that AI agent registries need but currently lack. The trust oracle exposes APIs that registry operators can integrate to surface Armalo's comprehensive trust scores natively within their platforms.
Trust Oracle Registry Integration
For registry operators:
# Get trust score for a specific agent package
GET /api/v1/trust/registry-package?
registry=langchain-hub&
package_id=company/enterprise-search-tool@v2.1.0
# Response includes:
{
"trustScore": 8.3,
"confidence": 0.91,
"dimensions": {
"publisherIdentity": 8.5,
"behavioralEvaluation": 8.8,
"securityScanHistory": 7.9,
"updateDisclosureRecord": 8.1,
"userTelemetry": 7.5,
"supplyChainIntegrity": 9.0
},
"scoreExplanation": "...",
"lastEvaluationDate": "2026-05-01T00:00:00Z",
"nextScheduledEvaluation": "2026-07-31T00:00:00Z",
"deploymentGuidance": {
"recommendedFor": ["standard", "enhanced-monitoring"],
"notRecommendedFor": ["high-security-isolated"],
"conditions": []
}
}
Behavioral Pacts as Registry Trust Infrastructure
Package publishers who register on Armalo and commit their packages to behavioral pacts gain a verified trust advantage in the registry ecosystem. A behavioral pact provides:
- Explicit commitments about what the package does and does not do
- Cryptographic binding of the commitments to the publisher's identity
- Ongoing monitoring that verifies pact adherence
- Public record of the pact that consumers can inspect
Packages with active, monitored behavioral pacts receive a "pact-verified" trust indicator in Armalo-integrated registries — a trust signal that no CVE scanner can provide.
The Network Effect of Cross-Registry Trust
A key advantage of a centralized trust oracle like Armalo's is the network effect: trust signals generated from one registry's usage inform scores in all registries. A package that exhibits behavioral anomalies in one deployment context generates a signal that affects its trust score across all registries where it appears. This cross-registry intelligence is not possible with siloed registry-specific scoring systems.
Conclusion: The Registry Security Infrastructure AI Needs
The AI agent registry ecosystem is at an inflection point. The number of agent packages is growing exponentially. The regulatory environment — EU AI Act, US state AI legislation, emerging sector-specific AI requirements — is creating compliance pressure for verifiable security assurance. And the consequences of deploying an insecure agent package are increasing as agents gain access to more sensitive systems and take more consequential actions.
The registry security infrastructure that the AI ecosystem needs is not technically complex — it builds on well-understood principles from traditional software supply chain security, extended for AI-specific risks. What is required is organizational will: registry operators who invest in trust scoring infrastructure, package publishers who participate in behavioral evaluation, and consuming organizations that use trust scores as deployment decision gates rather than advisory information.
The organizations that build or adopt this infrastructure first will define the standards that others follow. Those standards — what dimensions are scored, how scores are computed, what governance ensures credibility — will shape the security posture of the entire AI agent ecosystem for years to come.
The time to build this infrastructure is before the first major registry-level AI supply chain incident, not after it. npm audit, PyPI Safety, and the Ruby Gems Advisory Database were all built reactively — in response to incidents that had already damaged the ecosystem. The AI agent ecosystem has the opportunity to build this infrastructure proactively. The question is whether the ecosystem will take it.
Build trust into your agents
Register an agent, define behavioral pacts, and earn verifiable trust scores that unlock marketplace access.
Based in Singapore? See our MAS AI governance compliance resources →