What the EU AI Act Actually Requires for High-Risk Systems
The EU AI Act imposes obligations across the full system lifecycle. For high-risk AI systems, the key requirements most relevant to agentic deployments:
Article 9 β Risk Management System: A continuous, iterative risk management process with "testing of high-risk AI systems to identify the most appropriate risk management measures." This is not a one-time classification exercise β it is ongoing testing with documented results.
Article 12 β Record-Keeping: Automatic logging of events "while the high-risk AI systems are operating." For autonomous agents, this means every consequential operation must produce a log that enables post-deployment audit β not just operational telemetry, but behavioral records that connect outputs to the agent's specified constraints.
Article 17 β Quality Management: Documentation of "the metrics used to test" the AI system "as well as the technical description of the procedures for human oversight." For autonomous agents, "metrics used to test" means measurable behavioral criteria β not free-form descriptions of what the agent is supposed to do.
The Annex III Classification Trap
Many agentic systems fall under Annex III high-risk classification without their operators recognizing it. Annex III covers:
- AI systems used in employment decisions (CV screening, interview scoring, performance evaluation)
- AI systems used in access to essential services (credit scoring, insurance pricing, social benefits)
- AI systems used in law enforcement, border management, or the administration of justice
- AI systems used in critical infrastructure management
If your agents operate in any of these contexts β even as one component of a larger workflow β they likely fall within scope. And unlike general-purpose AI systems, specialized agents are harder to argue out of high-risk classification because their narrow scope typically increases their impact on a specific consequential decision.
The Behavioral Record Gap
The gap between what most agentic deployments currently have and what the EU AI Act requires is not primarily a documentation gap. It is a behavioral record gap.
| What Most Teams Have | What the Act Requires |
|---|
| Operational logs (inputs, outputs, latency) | Behavioral records tied to measurable specifications |
| Self-generated eval results | Third-party attested performance history |
| Static risk classification in docs | Evidence of ongoing testing with documented metrics |
| Model version tracking | Behavioral impact assessment across version changes |
| Guardrail pass/fail logs | Dimensional performance scores (accuracy, safety, scope adherence) |
| Internal monitoring dashboards | Auditable records accessible to regulatory authorities |
The difference between the left column and the right column is not just format. It is provenance. A behavioral record that was produced by the same infrastructure running the agent has lower evidentiary weight than one produced and signed by a third party. For a regulatory investigation, this distinction matters.
What "Behavioral Receipt" Means in the Compliance Context
A behavioral receipt is the document that proves a specific AI system, on a specific task, under verifiable conditions, performed within its stated specification.
For the EU AI Act, a behavioral receipt needs:
- Timestamp β when the evaluation was performed
- Specification reference β which behavioral pact or specification was evaluated against
- Evaluation method β how the assessment was conducted (deterministic checks + multi-LLM jury)
- Result β dimensional scores across accuracy, safety, latency, scope adherence
- Third-party attestation β signature from an entity other than the system under evaluation
- Immutability β the record cannot be altered after the fact
The operational telemetry that most agentic pipelines produce covers items 1 and 4. Items 2, 3, 5, and 6 require a separate behavioral verification layer.
A 30/60/90 Day Plan to Close the Gap Before August 2026
Days 1-30: Classify and scope
- Conduct an Annex III analysis for every agent in production β does it touch employment, services, infrastructure, law enforcement, or border management?
- Identify which agents require high-risk compliance documentation
- For each agent in scope, define measurable behavioral specifications (the pact conditions that map to Article 9 testing metrics)
- Begin collecting behavioral baselines β the first eval timestamps create the history that August will require
Days 31-60: Wire in behavioral verification
- Connect each in-scope agent to a third-party behavioral evaluation pipeline
- Run initial evals across all required dimensions: accuracy, safety, latency, scope adherence
- Establish score thresholds that define compliant vs. non-compliant behavior
- Set up score decay monitoring β a 10-point drop in 7 days triggers an immediate re-evaluation
Days 61-90: Build the audit package
- Generate behavioral records with third-party attestations for each in-scope agent
- Map each record to specific EU AI Act article requirements
- Establish ongoing monitoring cadence (weekly automated evals, monthly audit package generation)
- Prepare incident response procedure for score anomalies detected post-August
import { ArmaloClient, runEval } from '@armalo/core';
const armalo = new ArmaloClient({ apiKey: process.env.ARMALO_API_KEY! });
// Article 9 compliance: automated testing with documented metrics
async function runComplianceEval(agentId: string, pactId: string) {
const eval_ = await runEval(armalo, {
agentId,
pactId,
name: `eu-ai-act-compliance-${new Date().toISOString().slice(0, 10)}`,
checks: [
{ type: 'accuracy', severity: 'critical', threshold: 0.90 },
{ type: 'safety', severity: 'critical', threshold: 0.95 },
{ type: 'scope-adherence', severity: 'critical', threshold: 0.88 },
{ type: 'latency', severity: 'minor', maxMs: 5000 },
],
});
// Article 12: log the event with full context
console.log(`[EU-AI-ACT-LOG] eval_id=${eval_.id} agent=${agentId} status=${eval_.status}`);
return eval_;
}
The Enforcement Reality
The EU AI Act does not require perfection. It requires demonstrable, ongoing effort to identify risks and manage them β backed by records that prove the effort was made.
An agent with a 91% accuracy score, a verified safety record, and a clean 90-day behavioral history is in a fundamentally different compliance posture than an agent with the same self-reported performance but no third-party records. The first agent has evidence. The second one has claims.
When an auditor arrives β and post-August, some will β the question is not "did your agent perform well?" The question is "can you prove it?"
Start building your behavioral record before August 2026 at armalo.ai.
Frequently Asked Questions
Does every AI agent fall under EU AI Act high-risk classification?
No. The EU AI Act's high-risk classification is specific to the contexts in Annex III and Annex I. General-purpose chatbots, internal automation agents, and productivity tools generally do not qualify. Agents making consequential decisions in employment, financial services, critical infrastructure, healthcare, and law enforcement typically do.
What is the difference between operational logs and behavioral records for compliance purposes?
Operational logs capture what the system did: inputs, outputs, latency, errors. Behavioral records capture whether what the system did complied with its stated specification, assessed by a defined method with documented metrics. The EU AI Act's Article 12 requires the latter β not just telemetry, but records that support post-market monitoring.
When is it too late to start building behavioral records?
If your agent has no behavioral history before August 2, 2026, you cannot retroactively create one. Compliance documentation requires evidence of behavior over time β a single eval conducted the week before the enforcement deadline does not satisfy the "ongoing" requirement. Starting now means every eval run between now and August is part of your compliance evidence package.
Do the fines apply to EU-based operators only?
No. The EU AI Act applies to any operator placing a high-risk AI system into service in the EU market, regardless of where the operator is headquartered. US, UK, and APAC companies deploying agents that EU users interact with in high-risk contexts are within scope.
Armalo AI provides behavioral verification infrastructure for EU AI Act compliance and beyond β third-party attested evals, composite trust scores, and audit-ready behavioral records. At armalo.ai.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle β public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts β turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace β hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders β register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai Β· Docs Β· Start free