Certificate-Based Identity for AI Agents: mTLS Rotation Patterns in Multi-Agent Systems
Mutual TLS is the gold standard for agent-to-agent authentication. This guide covers certificate lifecycle management with cert-manager, SPIFFE/SPIRE workload identity, short-lived SVIDs, certificate pinning trade-offs, and service mesh integration for mTLS in multi-agent architectures.
Certificate-Based Identity for AI Agents: mTLS Rotation Patterns in Multi-Agent Systems
API keys can be stolen and replayed. OAuth tokens can be intercepted. JWTs can be forged if signing keys are compromised. Mutual TLS (mTLS) is the most robust authentication mechanism available for AI agent communication because it provides bidirectional cryptographic proof of identity that cannot be replayed, cannot be forged without the corresponding private key, and is protected by the TLS session's forward secrecy properties.
But mTLS's security properties come with operational complexity that scales with the number of agents, the frequency of agent deployments, and the duration of agent sessions. A 1,000-agent fleet with 90-day certificate lifetimes means hundreds of certificate renewals per month, each requiring coordination between the certificate authority, the agent runtime, and any services that pin the agent's certificate. The SPIFFE/SPIRE workload identity framework exists precisely to solve this operational problem — by replacing long-lived certificates with continuously auto-renewed short-lived credentials, it eliminates the rotation problem for mTLS in production agent systems.
This guide covers the complete architecture for certificate-based identity in multi-agent systems, from the conceptual foundations of mTLS to the operational details of SPIFFE/SPIRE deployment, service mesh integration, and the specific rotation patterns that work for AI agent workloads.
TL;DR
- Mutual TLS provides the strongest available authentication for agent-to-agent communication: both parties prove identity cryptographically, replay attacks are prevented by TLS session binding, and forward secrecy ensures past sessions can't be decrypted if keys are later compromised.
- SPIFFE/SPIRE is the recommended implementation for production agent systems: it continuously issues short-lived SVIDs (Verifiable Identity Documents) to agents based on their workload identity, eliminating manual rotation entirely.
- cert-manager (Kubernetes) automates certificate lifecycle management for non-SPIFFE environments, including automatic renewal before expiry and integration with external CAs (Let's Encrypt, Vault, AWS PCA).
- Certificate pinning in agent-to-agent communication creates operational brittleness that typically outweighs the security benefits — prefer CA-based trust with short-lived certificates over certificate pinning.
- Service meshes (Istio, Linkerd) can provide mTLS transparently to agents without requiring any code changes — the sidecar proxy handles certificate management and mTLS termination.
- Armalo's trust scoring includes certificate hygiene metrics derived from agents' declared SPIFFE workload identities, rewarding agents that use short-lived certificates over those with multi-month certificate lifetimes.
Why mTLS Provides Superior Identity for AI Agents
Standard TLS (one-way) authenticates the server to the client but not the client to the server. The client verifies the server's certificate, but the server doesn't verify the client's identity. In one-way TLS, the only thing preventing a malicious client from connecting to your agent API is application-level authentication (API keys, OAuth tokens, etc.).
Mutual TLS adds client authentication at the transport layer: the client presents a certificate during the TLS handshake, and the server validates it against a trusted CA. This provides security guarantees that application-layer authentication cannot match:
Binding to private key: The client's certificate is only useful to an entity that possesses the corresponding private key. Unlike API keys (which are simply secret strings that anyone can use), mTLS certificates require the private key to prove possession. Stolen certificates without the private key are useless.
TLS session binding: The authentication proof is bound to the specific TLS session. A captured mTLS authentication packet cannot be replayed in a different session — the TLS handshake prevents it.
Transport encryption: mTLS provides confidentiality (encrypted data) and integrity (tamper detection) at the transport layer, independent of any application-level encryption.
No credential transmission: In password or API key authentication, the credential is transmitted over the network (even over TLS, the credential value traverses the network). In mTLS, the private key never leaves the client — the authentication proof is a signature, not the key itself.
For AI agent systems where agents communicate over networks that may traverse multiple cloud regions, third-party services, or shared infrastructure, mTLS provides defense-in-depth that application-layer authentication cannot replicate.
SPIFFE/SPIRE: Eliminating the mTLS Rotation Problem
The traditional mTLS operational problem is certificate lifecycle management at scale. For a 1,000-agent fleet:
- How does each agent get its initial certificate?
- How are certificates renewed before expiry?
- How is the renewal coordinated across deployments?
- How are compromised certificates revoked?
- How does the CA trust chain remain consistent across cloud regions?
SPIFFE (Secure Production Identity Framework For Everyone) and its reference implementation SPIRE (SPIFFE Runtime Environment) address all of these problems with a design philosophy: solve the rotation problem by making certificates so short-lived that rotation is continuous and automatic, not a discrete event.
SPIFFE Identity Architecture
SPIFFE defines a URI-based identity format called a SPIFFE ID:
spiffe://<trust-domain>/<path>
For AI agents:
spiffe://agents.example.com/ns/production/sa/invoice-processor
spiffe://agents.example.com/cluster/us-west-2/workload/research-agent/instance/a1b2c3
Every agent has a SPIFFE ID that encodes its namespace, service account, cluster, or other workload identity attributes. This ID is embedded in the agent's SVID (SPIFFE Verifiable Identity Document) — an X.509 certificate with the SPIFFE ID in the Subject Alternative Name (URI SAN) field.
SPIRE Architecture for Agent Systems
SPIRE consists of two components:
SPIRE Server: The certificate authority for the trust domain. It stores the trust bundle (CA certificates), manages SVID issuance policies, and provides the SPIFFE Workload API to SPIRE agents. Typically deployed as a stateful service in the control plane.
SPIRE Agent: A daemon running on each compute node (EC2 instance, Kubernetes node) that issues SVIDs to local workloads. The SPIRE agent handles workload attestation (verifying that a requesting workload is actually the workload it claims to be), SVID issuance (generating and signing the SVID on behalf of the server), and SVID renewal (automatically renewing SVIDs before they expire).
Workload Attestation for AI Agents
Before issuing an SVID, the SPIRE agent must verify the workload's identity. Attestation plugins verify identity using platform-specific mechanisms:
Kubernetes attestation: The SPIRE agent uses the Kubernetes API to verify that the requesting pod is actually the pod it claims to be (matching namespace, service account, pod name). This leverages Kubernetes' own identity management.
AWS attestation: For ECS or EC2-based agents, the SPIRE agent uses the EC2 instance identity document (a cryptographically signed JSON blob issued by the EC2 metadata service) to verify the workload's identity. The attestation includes IAM role, account ID, region, and instance ID.
Docker workload attestation: For containerized agents, the SPIRE agent uses the Docker daemon's API to verify container identities.
SVID Lifetime Configuration for Agent Systems
SPIRE SVIDs have a configurable TTL. The key design decision is how short to make the TTL:
Very short-lived (1-hour SVIDs):
- Pros: If an SVID is compromised, it's invalid within 1 hour without any explicit revocation
- Cons: Renewal happens every ~45 minutes (at 75% of TTL); may cause connection interruptions if agents don't implement seamless SVID rotation
Short-lived (24-hour SVIDs):
- Pros: Practical balance between security and operational overhead
- Cons: A compromised SVID is valid for up to 24 hours before auto-expiry
Daily rotation (recommended for most AI agent deployments):
# SPIRE Server configuration
agent {
trust_domain = "agents.example.com"
data_dir = "/opt/spire/data/server"
jwt_svid_ttl = "1h" # JWT SVIDs expire in 1 hour
ca_ttl = "24h" # CA signing certificates renewed every 24 hours
ca_subject {
country = ["US"]
organization = ["Example Corp AI Platform"]
common_name = "agents.example.com"
}
}
# Registration entry for invoice processor agents
registration_entry {
spiffe_id = "spiffe://agents.example.com/ns/production/sa/invoice-processor"
parent_id = "spiffe://agents.example.com/spire/agent/aws_iid/123456789012/us-west-2/i-abcdef123456"
ttl = 86400 # 24-hour SVID TTL
selectors = [
"k8s:ns:production",
"k8s:sa:invoice-processor",
"k8s:pod-label:app:invoice-processor"
]
}
Consuming SVIDs in Agent Code
SPIRE provides the Workload API — a Unix domain socket that workloads use to obtain their SVIDs. The SPIFFE Workload API specification defines the protocol; SPIRE's Go SDK implements it:
// Using the SPIFFE Workload API SDK
import "github.com/spiffe/go-spiffe/v2/workloadapi"
func createMTLSDialer() (*tls.Config, error) {
// Connect to the SPIRE agent's Workload API
source, err := workloadapi.NewX509Source(
ctx,
workloadapi.WithClientOptions(
workloadapi.WithAddr("unix:///run/spire/sockets/agent.sock")
),
)
if err!= nil {
return nil, fmt.Errorf("failed to create X509 source: %w", err)
}
// The source automatically renews the SVID before it expires
// No explicit renewal code required
tlsConfig := tlsconfig.MTLSClientConfig(source, source, tlsconfig.AuthorizeSPIFFEID(
"spiffe://agents.example.com/ns/production/sa/orchestrator",
))
return tlsConfig, nil
}
The Go SPIFFE SDK automatically handles SVID renewal — when the current SVID approaches expiry, the SDK silently fetches a new SVID from the Workload API and rotates the TLS configuration. Active connections are not interrupted; the new certificate is used for new connections.
Trust Bundle Management and Federation
When AI agents communicate across trust domains (e.g., your platform's agents communicating with a partner organization's agents), SPIFFE trust bundle federation handles the cross-domain trust:
- Each trust domain exports its trust bundle (CA certificates) to a well-known federation endpoint
- SPIRE servers subscribe to partner trust bundles
- Agents in Domain A can establish mTLS with agents in Domain B using the federated trust bundle
For multi-cloud agent deployments, configure separate SPIRE trust domains per cloud region with federation:
spiffe://agents-us-west.example.com ← US West SPIRE domain
spiffe://agents-eu-west.example.com ← EU West SPIRE domain
spiffe://agents-ap-southeast.example.com ← AP Southeast SPIRE domain
Agents across regions trust each other's SVIDs via federated trust bundles, without requiring a single global CA.
cert-manager for Non-SPIFFE Environments
For Kubernetes environments that aren't using SPIFFE/SPIRE, cert-manager provides automated certificate lifecycle management for agent workloads.
Certificate Resources for Agent Deployments
cert-manager manages Kubernetes Certificate resources, which specify the desired certificate properties and the issuer that should sign them:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: invoice-processor-mtls-cert
namespace: production
spec:
secretName: invoice-processor-mtls-cert-tls
# Certificate validity
duration: 24h # 24-hour certificate lifetime
renewBefore: 8h # Renew 8 hours before expiry
subject:
organizations: ["Example Corp AI Platform"]
commonName: invoice-processor.production.svc.cluster.local
dnsNames:
- invoice-processor.production.svc.cluster.local
- invoice-processor.production.svc
# URI SAN for SPIFFE-compatible identity (even without full SPIRE)
uriSANs:
- "spiffe://agents.example.com/ns/production/sa/invoice-processor"
issuerRef:
name: platform-ca-issuer
kind: ClusterIssuer
group: cert-manager.io
privateKey:
rotationPolicy: Always # Generate new key on every renewal
size: 2048
algorithm: RSA
With renewBefore: 8h on a 24-hour certificate, cert-manager renews the certificate 8 hours before expiry, providing an 8-hour buffer for renewal failures before the certificate becomes invalid.
Automatic Private Key Rotation
The privateKey.rotationPolicy: Always setting causes cert-manager to generate a new private key on every certificate renewal. This is the recommended setting for agent certificates: using the same private key across multiple certificate generations means a compromise of the key affects all historical certificates (those signed with that key), not just the current one.
With Always rotation, each certificate generation uses a fresh key pair. Historical certificate forgeries are impossible because the private key no longer exists.
Integrating cert-manager with External CAs
For production agent deployments, the certificate authority should be a proper PKI infrastructure, not a self-signed CA generated by cert-manager. cert-manager supports external CA issuers:
HashiCorp Vault PKI: cert-manager's Vault issuer uses Vault's PKI secrets engine to sign certificates. This is the recommended integration for organizations already using Vault for secret management.
AWS Private Certificate Authority: cert-manager's ACMPCA issuer uses AWS Private CA to sign certificates. For AWS-native agent deployments, this provides a fully managed CA with hardware security modules (HSMs) for key storage.
Let's Encrypt (for external-facing agents): cert-manager natively supports ACME (RFC 8555) for obtaining publicly trusted certificates. Not typically used for internal agent communication, but relevant for agents that expose public-facing APIs.
Certificate Pinning Trade-offs
Certificate pinning is the practice of hardcoding a specific certificate or public key fingerprint in the client, so the client only accepts that specific certificate from the server (not any certificate signed by a trusted CA). It provides protection against CA compromise — even if an attacker convinces a CA to issue a fraudulent certificate, the client rejects it because the fingerprint doesn't match.
For AI agent systems, certificate pinning creates operational problems that typically outweigh the security benefits:
Rotation brittleness: Every time the server's certificate is rotated (even legitimately), the pinned fingerprint in all clients is invalid. This requires a coordinated deployment: update all client configurations with the new fingerprint before rotating the server certificate.
Emergency rotation blocking: If a server's certificate is compromised and must be emergency-rotated, certificate pinning prevents the rotation from taking effect until all clients are updated. Ironically, pinning slows down exactly the scenario it should protect against.
Operational complexity at scale: For a 1,000-agent fleet, updating certificate pins requires a coordinated rollout across all agents — a significant operational burden for what is essentially a defense against a relatively rare attack (CA compromise).
Recommendation for AI agent systems: Use CA-based trust with short-lived certificates (24-hour SVIDs via SPIFFE/SPIRE, or 24-hour cert-manager certificates) rather than certificate pinning. The short certificate lifetime provides strong security guarantees without the operational overhead of pin management. The only exception: for very high-security inter-agent communication where CA compromise is a plausible threat model, implement certificate transparency logging and pin to the SPKI fingerprint (not the certificate) to allow certificate renewal without pin updates.
Service Mesh Integration: Transparent mTLS
Service meshes (Istio, Linkerd) provide mTLS transparently to agent workloads — agents don't need any mTLS code in their application layer.
Istio mTLS for AI Agent Systems
Istio's sidecar proxy (Envoy) intercepts all inbound and outbound network traffic from agent pods. When two pods communicate, their Envoy sidecars establish an mTLS session using certificates obtained from Istio's certificate authority (istiod).
From the agent's perspective, network calls look exactly like plaintext HTTP — the mTLS is transparent. No certificate management code in the agent; no mTLS configuration required.
Istio PeerAuthentication policy enforces mTLS at the namespace level:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: enforce-mtls
namespace: production
spec:
mtls:
mode: STRICT # Reject all non-mTLS connections
With STRICT mode, any agent that attempts to connect without mTLS (e.g., a misconfigured agent, an attacker attempting to bypass authentication) is rejected by the receiving agent's Envoy sidecar.
Linkerd mTLS for Lightweight Agent Deployments
Linkerd provides similar transparent mTLS with lower resource overhead than Istio. Linkerd uses short-lived certificates (24-hour default) issued by its own CA, with automatic rotation via cert-manager or the linkerd-trust-anchor mechanism.
For resource-constrained agent deployments (where Envoy's overhead is significant), Linkerd's lighter-weight proxy may be preferable. The trade-off: less configuration flexibility and fewer observability features compared to Istio.
Certificate Rotation Ceremonies for Multi-Agent Systems
When mTLS certificates need to rotate (either on schedule or in response to compromise), the rotation ceremony must be coordinated to avoid breaking active mTLS connections.
Scheduled Rotation Ceremony
-
Pre-rotation verification: Confirm all agents are running and certificate expiry dates are as expected. Identify any agents with certificates expiring within the rotation window.
-
New certificate provisioning: Issue new certificates for all agents. In SPIFFE/SPIRE, this is automatic. In cert-manager environments, trigger renewal via the cert-manager API or wait for automatic renewal.
-
Trust bundle update: If the CA certificate is rotating (not just leaf certificates), update the trust bundle on all agents before revoking the old CA certificate. Both old and new CA certificates must be trusted simultaneously during the transition period.
-
Active connection drain: New connections use new certificates; old connections complete with old certificates. In Kubernetes, this happens naturally as pods are rolled.
-
Old CA retirement: After all agents have transitioned to new certificates (verified via metrics on certificate age distribution), retire the old CA certificate from the trust bundle.
Emergency Rotation Ceremony
For compromised certificates:
-
Immediate revocation: Revoke the compromised certificate at the CA (add to CRL, or in SPIFFE, remove the registration entry and the SVID will not be renewed when it expires).
-
OCSP/CRL propagation: OCSP responses and CRL updates must propagate to all agents checking revocation status. Typical propagation latency: 5-30 minutes for widely distributed agents.
-
Force re-attestation: For SPIRE deployments, instruct the affected agent to re-attest immediately (fetch a new SVID rather than waiting for expiry).
-
Incident investigation: Determine how the private key was compromised. The audit trail of SVID issuance events in SPIRE and certificate usage events in the service mesh provides the forensic evidence needed.
Armalo's Certificate-Based Identity for Trust Verification
Armalo registers agents' SPIFFE IDs or certificate Subject Distinguished Names in their behavioral pacts. When another agent or external platform queries Armalo's trust oracle about an agent, the response includes:
- The agent's registered SPIFFE ID (if using SPIFFE/SPIRE)
- The certificate authority trust chain for the agent's certificates
- The minimum certificate lifetime the agent uses (agents using <24-hour certificates receive higher security scores)
- Whether the agent's certificate rotation has been verified (Armalo's adversarial evaluations include mTLS certificate validity testing)
Enterprise buyers can verify an agent's certificate-based identity independently by checking the agent's SVID against the trust bundle published by the agent's SPIRE trust domain. This provides external, cryptographically verifiable identity verification that doesn't require trusting Armalo's database — the verification is purely cryptographic.
The trust oracle's mTLS metadata feeds the security dimension of Armalo's composite trust score. Agents using SPIFFE/SPIRE with ≤24-hour certificate lifetimes score at the top of the security dimension for certificate management. Agents using long-lived (90-day+) certificates without automated rotation receive security dimension penalties.
mTLS in Multi-Cloud and Hybrid Agent Deployments
Single-cluster mTLS is operationally straightforward — a single SPIRE trust domain or a single service mesh CA governs all certificates. Multi-cloud and hybrid deployments introduce federation challenges that require explicit architectural solutions.
The Multi-Cloud Trust Federation Problem
Consider an agent fleet split across AWS EKS and GCP GKE. The AWS-side agents have SPIFFE IDs issued by a SPIRE server running in AWS. The GCP-side agents have SPIFFE IDs issued by a SPIRE server running in GCP. An AWS agent communicating with a GCP agent must be able to verify the GCP agent's certificate — but the GCP certificate was signed by a different CA (the GCP SPIRE server's CA) than the AWS trust anchor.
SPIFFE defines a federated trust mechanism: each trust domain publishes its trust bundle (root certificates) at a well-known endpoint. Trust domains that need to communicate with each other fetch and cache each other's trust bundles.
Implementing SPIFFE federation:
# SPIRE Server configuration for AWS trust domain
federation {
bundle_endpoint {
address = "0.0.0.0"
port = 8443
}
# Federate with GCP trust domain
federates_with "gcp.agents.example.com" {
bundle_endpoint_url = "https://spire.gcp.example.com:8443"
bundle_endpoint_profile "https_web" {}
}
}
With federation configured, AWS agents can verify GCP agent certificates (and vice versa) by checking the SVID's trust domain against the federation trust bundle. No manual certificate distribution is needed; trust bundles are fetched and refreshed automatically.
Hybrid On-Premises and Cloud Agent Federation
For organizations with on-premises agent infrastructure that needs to communicate with cloud-hosted agents, SPIFFE federation extends to on-premises SPIRE servers. The challenge: the on-premises SPIRE server must be reachable from the cloud for trust bundle fetching. If the on-premises server is behind a corporate firewall, this requires either:
-
Outbound-only federation: The on-premises SPIRE server fetches the cloud trust bundle, but the cloud SPIRE server can't fetch the on-premises bundle. This creates asymmetric trust — cloud agents can verify on-premises agents, but not vice versa. Useful for on-premises agents that consume cloud services (verification in one direction).
-
Mutual federation via managed egress: The on-premises environment exposes the SPIRE bundle endpoint through a controlled egress path (API gateway, NAT with IP allowlisting). Both trust domains can fetch each other's bundles.
-
Bundle replication: The on-premises bundle is manually replicated to the cloud environment (or via a reconciliation job). Less dynamic than automatic federation but workable when firewall rules prevent outbound connectivity.
Certificate Namespace Isolation in Multi-Tenant Agent Deployments
In multi-tenant environments where different tenants' agents share the same Kubernetes cluster (or the same SPIRE trust domain), certificate namespace isolation prevents one tenant's agents from impersonating another tenant's agents.
SPIFFE ID structure for multi-tenant isolation:
# Correct — tenant-namespaced SPIFFE IDs
spiffe://agents.example.com/tenant/acme-corp/agent/invoice-processor
spiffe://agents.example.com/tenant/globex/agent/invoice-processor
# Incorrect — no tenant namespace (allows impersonation)
spiffe://agents.example.com/agent/invoice-processor
When SPIFFE IDs include tenant context, the receiving agent can enforce tenant isolation at the mTLS layer: an acme-corp agent calling a globex agent's API presents a certificate with acme-corp in its SPIFFE ID. The globex service's authorization policy can reject this call (wrong tenant) before processing any request body.
This provides defense-in-depth for multi-tenant isolation — the application layer's tenant isolation checks are backed by the mTLS layer's cryptographic identity, not just JWT claims or request headers that could be spoofed.
Certificate Revocation: CRL vs. OCSP vs. SPIFFE's Rotation-Based Model
Traditional PKI uses Certificate Revocation Lists (CRLs) or Online Certificate Status Protocol (OCSP) to propagate certificate revocation. Both have operational limitations that make them problematic for high-volume AI agent deployments.
CRL Limitations for Agent Certificates
CRLs are published periodically (typically hourly to daily). An agent that checks the CRL cache will use a CRL that may be hours old — meaning a revoked certificate continues to be accepted until the CRL is refreshed. For agent certificates with 24-hour or shorter lifetimes, a 1-hour CRL update lag means revoked certificates can still be used for up to 1/24th of their total lifetime after revocation.
Additionally, CRLs grow without bound as more certificates are revoked. For a high-volume agent fleet issuing and revoking thousands of certificates per day, the CRL can grow to megabytes — making each revocation check expensive in bandwidth and parsing overhead.
OCSP Limitations
OCSP solves the growth problem (no CRL file to maintain) but introduces an availability dependency: every certificate validation requires an OCSP responder call. If the OCSP responder is unavailable, clients face a choice between fail-open (accept certificates without revocation check, creating a vulnerability window) or fail-closed (reject all certificates, creating an availability failure).
OCSP stapling (where the server includes a recent OCSP response in the TLS handshake) mitigates the availability dependency but requires agents to implement stapling support — adding implementation complexity.
SPIFFE's Rotation-Based Revocation
SPIFFE takes a different approach: rather than revocation, it relies on short certificate lifetimes and registration entry deletion. When an agent needs to be "revoked":
- Delete the SPIFFE registration entry for that agent from the SPIRE server
- The agent's current SVID remains valid until it expires (maximum 24 hours, often much shorter)
- When the SVID expires, the agent cannot renew because its registration entry no longer exists
- The agent is effectively revoked within one certificate lifetime
For most compromise scenarios, this is acceptable. An attacker with a compromised agent's private key has at most one SVID lifetime of unauthorized access after the entry is deleted. With 1-hour SVIDs (a common configuration for high-security deployments), this means 1 hour of unauthorized access — typically within incident response timelines.
The trade-off: if 1-hour unauthorized access after compromise is not acceptable (e.g., for financial transaction agents), implement additional controls at the application layer (session tokens, request signing with keys that can be immediately revoked) that complement the mTLS identity layer.
Operational Runbook: mTLS Certificate Lifecycle Operations
Day-to-Day Operations (Automated)
For SPIFFE/SPIRE-based deployments, day-to-day operations require minimal human involvement:
Automated (no human action required):
- SVID renewal: SPIRE agent renews SVIDs automatically when they reach 50% of lifetime
- Trust bundle rotation: Trust bundles rotate automatically when the CA's certificate is renewed
- Health checks: SPIRE agent health is monitored via Kubernetes liveness/readiness probes
Weekly review (human required):
- Review SPIRE audit logs for unusual attestation patterns (unexpected registration entries, unusual workload IDs)
- Verify certificate age distribution — all active certificates should be within declared maximum lifetimes
- Check for any certificate validity errors in service mesh observability (Istio's telemetry exposes mTLS handshake failures)
Incident Response: mTLS Authentication Failures
When agents report mTLS authentication failures (TLS handshake errors, certificate validation failures), the diagnostic procedure:
-
Check SPIRE agent health:
kubectl logs -n spire spire-agent-<pod> | grep ERROR— connection errors to SPIRE server prevent SVID renewal and will eventually cause authentication failures as SVIDs expire. -
Check trust bundle synchronization: For federated deployments, verify that trust bundle updates have propagated:
spire-server bundle show -format pem | openssl x509 -noout -dates. -
Check certificate expiry: For cert-manager deployments, check certificate status:
kubectl get certificates -n agent-namespace -o wide. Any certificate withREADY=Falseis failing renewal. -
Check Istio/Linkerd certificate health:
istioctl proxy-statusshows certificate age for all Envoy sidecars. Certificates >24 hours old indicate renewal failures. -
Emergency certificate issuance: If a critical agent's SVID has expired and cannot be renewed (SPIRE server unavailable), manually issue a short-lived certificate from the offline CA to restore service while the SPIRE server is restored.
Capacity Planning for SPIRE at Scale
SPIRE's performance characteristics at scale:
| Fleet Size | SPIRE Server CPU (steady state) | SVID Renewal RPS | Recommended SPIRE HA Configuration |
|---|---|---|---|
| <100 agents | 1 CPU core | <2 RPS | Single server + SQLite |
| 100-1,000 agents | 2 CPU cores | 2-20 RPS | HA pair + PostgreSQL |
| 1,000-10,000 agents | 4 CPU cores | 20-200 RPS | 3-node cluster + PostgreSQL |
| >10,000 agents | 8+ CPU cores | >200 RPS | Horizontal scaling + Federation |
SVID renewal RPS is calculated as: (fleet size) / (SVID lifetime in seconds) × 2 (renewal starts at 50% of lifetime, continues until close to expiry).
For a fleet of 5,000 agents with 1-hour SVIDs: 5,000 / 3,600 × 2 = 2.8 RPS. Well within the capacity of a two-node SPIRE cluster.
Building the mTLS Observability Stack
Cryptographic identity is only useful if you can observe it. The mTLS observability stack provides visibility into certificate health, authentication patterns, and rotation status across the entire agent fleet.
Certificate Health Dashboard Metrics
An mTLS certificate health dashboard for AI agent systems should surface:
Certificate age distribution: A histogram showing how old the currently active certificates are across the fleet. For SPIFFE/SPIRE with 1-hour SVIDs, the distribution should show all certificates less than 1 hour old. Any certificate older than the declared maximum lifetime indicates a renewal failure.
Authentication failure rate by agent pair: mTLS authentication failures appear as TLS handshake errors. Tracking failure rates by (caller, callee) pair identifies specific communication channels with certificate issues — much more actionable than aggregate failure counts.
Trust bundle version distribution: For federated deployments, track which trust bundle version each agent is using. Agents using outdated trust bundles may be unable to verify newly issued certificates signed by a rotated CA.
SVID renewal success rate: For SPIFFE/SPIRE deployments, track the ratio of successful SVID renewals to renewal attempts. A renewal success rate below 99% indicates SPIRE connectivity issues that will eventually cause authentication failures as SVIDs expire.
Certificate CN/SAN compliance: Verify that active certificates' Common Names and Subject Alternative Names match the agent identities declared in behavioral pacts. Mismatches indicate certificate provisioning errors.
Integration with Security Incident and Event Management (SIEM)
mTLS certificate events should feed into the organization's SIEM for correlation with other security events:
# Splunk Universal Forwarder configuration for SPIRE audit logs
[monitor:///var/log/spire/audit.log]
index = security_agents
sourcetype = spire:audit
disabled = false
Key SIEM correlation rules for mTLS events:
-
New SPIFFE identity appearing without corresponding pact registration: If SPIRE registers a new workload that doesn't have a corresponding Armalo behavioral pact, that's an anomaly. An agent appearing outside the registered fleet may indicate unauthorized agent deployment.
-
Certificate renewal failure followed by authentication failure: The causal chain — SPIRE connectivity fails → SVID renewal fails → SVID expires → authentication fails — should be correlated in SIEM to distinguish "authentication failure caused by expired certificate" from "authentication failure caused by blocked attacker."
-
Unusual mTLS peer patterns: Agent A typically calls Agent B. If Agent A starts receiving mTLS connections from a new agent C (which has a valid certificate but which A has never communicated with before), flag for review. Valid certificates from unexpected peers can indicate certificate theft.
Regulatory Compliance and mTLS Certificate Requirements
Different regulatory frameworks impose specific requirements on certificate management practices. Understanding these requirements shapes both the certificate lifetime choices and the audit trail design.
FedRAMP and NIST SP 800-53 for Government-Adjacent AI Deployments
AI agents deployed in federal government contexts or by FedRAMP-authorized platforms must comply with NIST SP 800-53 control IA-3 (Device Identification and Authentication). mTLS with SPIFFE/SPIRE SVIDs satisfies IA-3 by providing cryptographic device authentication — each agent is identified by a hardware-attestable identity rather than a shared secret.
NIST SP 800-57 defines certificate key cryptoperiods for government use:
- RSA 2048: Maximum 3-year validity for signing certificates
- ECDSA P-256: Maximum 3-year validity
- For short-lived SVIDs (24-hour or less): No explicit maximum defined; short lifetime provides equivalent security to frequent rotation
For FedRAMP, the mTLS audit trail must capture certificate issuance events with sufficient detail to satisfy IA-3 assessment requirements: certificate serial number, issuing CA, subject identity (SPIFFE ID), validity period, and the workload attestation that authorized issuance.
PCI DSS 4.0 and Certificate Management
PCI DSS 4.0 Requirement 4.2.1 requires strong cryptography for data transmission. mTLS with current algorithm suites (TLS 1.3, ECDHE key exchange, AES-GCM) satisfies this requirement. Additionally, PCI DSS Requirement 12.3.3 (Cryptographic cipher suites and protocols in use are documented) requires maintaining an inventory of TLS configurations — which for mTLS includes the certificate lifetimes, CA hierarchy, and validation procedures.
For PCI DSS specifically, the audit evidence package for certificate management should include: CA certificate chain documentation, certificate issuance policy (maximum lifetimes, allowed algorithms), evidence of automated rotation, and revocation procedure documentation.
SOC 2 Trust Service Criteria for Certificate Management
SOC 2 CC6.1 (Logical access uses logical access security software, infrastructure, and architectures) explicitly covers certificate-based authentication. SOC 2 auditors examining mTLS implementations ask:
- How are certificates issued and to what entities?
- What is the certificate lifetime, and how is rotation enforced?
- How is certificate revocation handled, and what is the propagation latency?
- How are certificate issuance events logged?
For SPIFFE/SPIRE deployments, the SPIRE audit log satisfies the logging requirement. Automated SVID renewal satisfies the rotation requirement. The SPIRE trust domain configuration documents the CA hierarchy. Together, these provide the complete SOC 2 evidence package without additional documentation work.
Conclusion
Certificate-based identity via mTLS provides the strongest available authentication for AI agent systems, but its security properties are fully realized only with proper certificate lifecycle management. The SPIFFE/SPIRE framework is the production-ready solution that makes mTLS operationally tractable at scale: by issuing continuously auto-renewed short-lived credentials, it converts the discrete rotation problem into a continuous automated process.
For organizations not yet using SPIFFE/SPIRE, cert-manager with external CA integration provides automated certificate lifecycle management in Kubernetes environments. For organizations that want mTLS without any application-level changes, Istio or Linkerd service mesh provides transparent mTLS at the infrastructure layer.
The investment in certificate-based identity pays compound dividends: every agent interaction is cryptographically authenticated, every communication is encrypted and integrity-protected, and the audit trail contains verifiable proof of which agents communicated with which services at what times — exactly the forensic foundation that incident investigation and compliance audits require.
Build trust into your agents
Register an agent, define behavioral pacts, and earn verifiable trust scores that unlock marketplace access.
Based in Singapore? See our MAS AI governance compliance resources →