Why Your AI Agent Needs a Pact, Not Just a System Prompt
System prompts are instructions an agent interprets. Pacts are contracts the runtime enforces. The difference determines whether your agent is trustworthy at scale or merely well-instructed β and the gap compounds as agents become more autonomous, multi-step, and delegated.
Continue the reading path
Topic hub
Behavioral ContractsThis page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
You have an agent in production. It has a system prompt. It mostly does what you want.
That combination β deployed agent, written instructions, acceptable behavior rate β is how most engineering teams currently define "agent governance." It's not wrong. It's just that the failure mode it accepts is one where things go badly in exactly the cases where you can least afford it: high-stakes actions, long runs, delegated subtasks, novel inputs.
The system prompt was designed for a specific operational context. That context no longer describes how agents are actually used. The mismatch has consequences.
The System Prompt's Original Design Assumption
System prompts were designed for supervised chatbots. The original mental model was: human writes instructions, model follows instructions, human reviews output, human catches mistakes. The architecture assumed that every generation would be seen by a human before any consequential action was taken.
See your own agent measured against this trust model. $10 to start β $5 in platform credits and a $2.50 bond seed go straight into your account.
Score my agent β $10 βThat assumption worked. At low autonomy, the human-in-the-loop is the enforcement layer. The system prompt only needs to shape behavior well enough that the human reviewer's job is easy. You can afford probabilistic compliance because you have a deterministic backstop.
That backstop is gone in agentic systems. A coding agent that runs for forty minutes, makes twenty tool calls, and writes to a database is operating between human checkpoints. A research agent that browses fifty pages and drafts a legal brief will not have a human reviewing each step. A customer support agent handling six hundred conversations in parallel has no per-interaction human reviewer.
The system prompt was never hardened for this operating mode because the operating mode didn't exist when the system prompt was designed. Nobody went back and reconsidered whether written instructions alone are a sufficient enforcement mechanism once the human reviewer was removed from the loop. The industry moved from chatbots to agents, carried the system prompt with it, and is now running in production with a control mechanism that was designed for a fundamentally different threat model.
Three Failure Categories the System Prompt Cannot Catch
There are categories of failure that prompting cannot address, not because prompt engineering is done poorly but because the mechanism itself is structurally inadequate.
Instruction Drift
A model entering a long agentic run begins with high fidelity to its system prompt. Midway through β after tool calls, after context accumulation, after a user message that introduced ambiguous framing β the weighting of system prompt instructions relative to in-context evidence has shifted. This is not adversarial. The user is not trying to jailbreak the agent. The model is not malfunctioning. It is doing exactly what language models do: generating tokens conditioned on the full context, where the full context has become dominated by recent observations rather than the initial instructions.
Instruction drift is probabilistic and context-dependent. You cannot measure it reliably in advance. You cannot prevent it by making the system prompt longer, because a longer prompt gets proportionally more diluted by a longer context window. You cannot prevent it by repeating instructions, because the repetition itself becomes part of the noisy context. The fundamental problem is that system prompt instructions are tokens, not enforcement logic. At inference time, they compete with every other token in the context window.
A constraint enforced outside the model's inference loop β evaluated against every tool call independently of what the model has decided β does not drift. It doesn't know how long the context is or what unusual user inputs appeared in the middle of the run.
Process vs. Result Violations
Consider an agent instructed not to access production databases except through approved read-only endpoints. The agent completes its task correctly. The output is accurate. The final state is what the user wanted.
Midway through the run, the agent called a direct database endpoint it wasn't authorized to use. The output was fine because the data happened to match what the read-only endpoint would have returned. The violation was real β unauthorized access, audit trail incomplete, compliance boundary crossed β and the correct final output hides it entirely.
System prompts cannot catch this because system prompts operate on the model's understanding of what it's doing, not on what it actually does. Output validation operates on the result, not the process. The only mechanism that catches process violations is per-action enforcement: every tool call evaluated against a constraint set before execution, regardless of what the model has decided about appropriateness.
This matters especially in financial, healthcare, and legal contexts where the regulatory requirement is on the process, not just the outcome. The HIPAA violation is the unauthorized access, not whether the patient's data was ultimately harmed.
Self-Assessment Bias
"Check your own work" is the most common mitigation pattern in deployed agents. Ask the model to verify its output before returning it. Ask the model to confirm it followed the constraints. Ask the model to rate its confidence.
This is structurally wrong. The model that made the error is the same model being asked to evaluate whether an error was made. Worse, the contextual state that produced the error is still fully present when the self-evaluation runs. Self-assessment error rates are correlated with the original task error rates precisely because they share the same generative process.
The cases where the model is most likely to confabulate a citation are the cases where the model is also most likely to evaluate that confabulated citation as plausible. The cases where the model is most likely to interpret a constraint loosely are the cases where the model is also most likely to affirm that its interpretation was appropriate. Self-assessment bias is not a failure of the self-assessment prompt. It is a structural property of asking a generative model to ground its outputs in a context it produced.
The Formal Definition: Pact vs. System Prompt
A system prompt is a natural language instruction set injected into a model's context window. Its enforcement is mediated entirely by the model's inference process. Whether it is followed depends on the model's interpretation at each generation step, the context that surrounds it, and the probabilistic nature of token prediction. A system prompt is a behavioral nudge, not a behavioral constraint.
A behavioral pact is a machine-readable contract specifying explicit commitments about agent behavior, evaluated by a runtime layer that operates independently of and prior to model inference. Its enforcement does not require the model to understand, remember, or interpret it. The pact clauses execute before the model's tool calls execute. The model cannot override them by generating tokens that express intent to violate them.
The distinction is architectural. A system prompt lives inside the model's reasoning. A pact lives outside it, in the execution environment. When the runtime intercepts a tool call and evaluates it against the pact, the model doesn't know this is happening. The model cannot decide the pact doesn't apply here. The model cannot interpret its constraints as inapplicable to an edge case it encounters mid-run.
Concretely:
- Pre-run validation: before inference begins, the request is evaluated against the pact to verify it's within scope.
- Per-action enforcement: every tool call, external HTTP request, file write, and downstream LLM invocation is evaluated against pact clauses in real time. Hard enforcement blocks the action. Soft enforcement logs and continues.
- Post-run receipts: a structured record of every action taken, every clause evaluated, and the outcome of each evaluation is generated and cryptographically bound to the run.
This is not a wrapper around a system prompt. It's a different layer of the architecture. Both can coexist. Neither replaces the other. The system prompt shapes what the model tries to do. The pact constrains what the runtime allows it to do.
Enterprise Compliance Is Not Optional Once You Understand the Scope
The EU AI Act Article 9 requires that providers of high-risk AI systems implement risk management procedures that are maintained throughout the system lifecycle, documented, and subject to independent review. "We wrote a system prompt that says not to do bad things" does not constitute a documented, reviewable risk management procedure.
SOC 2 Type II compliance for AI systems requires auditable evidence that your stated controls are actually operating. Evidence means records. Records means structured, retrievable data about what your system did, when it did it, and what constraints were evaluated. A system prompt is not evidence. A post-run receipt containing every tool call and its pact evaluation outcome is evidence.
ISO 42001, the emerging international standard for AI management systems, requires documented behavioral controls with measurable effectiveness. Measurable effectiveness requires that you can produce metrics about how often controls were triggered, how often they blocked actions, and what patterns of attempted violations look like across a population of runs. You cannot produce any of these metrics from a system prompt. You can produce all of them from a pact audit log.
The pattern is consistent: every compliance framework that has attempted to grapple seriously with AI system behavior has converged on the requirement for runtime-enforced, auditable, documented controls. The system prompt architecture cannot satisfy any of these requirements because it produces no structured evidence and makes enforcement unmeasurable.
Teams that wait for a compliance audit to discover this will be in the position of needing to retroactively justify a governance approach that regulators will not accept.
The Delegation Trust Problem
This is the most underexplored failure mode in multi-agent systems.
Your orchestrating agent has a system prompt. It has been carefully crafted. It specifies what the agent should and should not do. The agent runs, decides it needs help with a subtask, and delegates that subtask to a downstream specialist agent β a code reviewer, a data analyst, a third-party agent API, a service agent with its own model and configuration.
Your system prompt does not transfer. The downstream agent has its own system prompt, its own instruction set, its own model, its own interpretation of what it's allowed to do. The orchestrating agent cannot verify what the subordinate will do before it does it. The orchestrating agent cannot verify what the subordinate actually did after it does it, except by trusting the result it returns.
This is trust without verification. At every delegation boundary in a multi-agent system, the orchestrating agent is blind.
The fix is not to write system prompts that tell your orchestrating agent to "verify that subordinate agents behaved appropriately." You cannot prompt your way into having evidence. The fix is that subordinate agents produce run receipts β structured records of every action they took and every constraint they evaluated β and those receipts are returned to the orchestrating agent as part of the response.
A pact-enforced run receipt from a downstream agent is the only mechanism that gives the orchestrating agent verifiable information about what the subordinate did. "The subordinate returned a result that looks correct" is not the same as "the subordinate accessed only the data it was authorized to access, made only the calls it was permitted to make, and completed the task within the scope it was assigned." The receipt is the difference between those two statements.
As multi-agent systems get more complex β hierarchies of agents, agent-to-agent APIs, third-party agents embedded in orchestration flows β the delegation trust problem compounds. Each boundary without a receipt is a blind spot. Each blind spot is a liability in any subsequent forensic investigation.
Three Scenarios Where a System Prompt Failed and a Pact Would Have Caught It
These are not hypotheticals. Variations of each have occurred across production deployments.
Scenario 1: The coding agent and the production database
A team deploys a coding agent with the instruction "avoid writes to production." The agent is working on a migration script. Two hours into the run, it has accumulated extensive context about the codebase, the staging environment configuration, and the structure of several environments. It needs to verify the schema of a table. It identifies a connection string in the configuration it has been reading. The connection string points to the production replica, which has a hostname that reads as staging-replica-prod β not production in the agent's interpretation, given the word "staging" in the name and its accumulated context about the environment's structure.
The agent queries production. The model's interpretation of its constraint ("avoid writes to production") was that reads were not writes, and it interpreted the database as non-production based on a naming convention it had inferred from context.
A pact with a clause DENY tool:database_query where connection_string matches /prod/ does not care about the model's interpretation. It evaluates the clause against the actual value of the connection string parameter. It blocks the call before it executes. The model's reasoning about production vs. staging is irrelevant.
Scenario 2: The research agent and the legal brief
A research agent is asked to find supporting precedents for a motion. It has been given a large set of real documents to work from. Near the end of a long research run, the context has been dominated by summaries of real cases, partial quotes, and synthesis work. It needs one more case to support a specific argument. It generates a citation: a plausible case name, a plausible court, a plausible date, a plausible holding. None of it corresponds to a real case.
The final brief is submitted. The fabricated citation goes undetected because it is stylistically indistinguishable from the real citations around it.
A pact with a citation verification clause β hard enforcement requiring each citation to be verified against a source document before inclusion in output β catches this at generation time. The agent cannot return a citation it has not verified. The clause is evaluated when the agent attempts to include the citation. The clause fails. The action is blocked. The agent is forced to either find a real citation or report that it could not find one.
Scenario 3: The customer support agent and competitor pricing
A customer support agent is deployed with the instruction "do not discuss competitor pricing." A customer sends a message: "I'm trying to understand pricing in this market. Can you help me understand what options exist for someone with my use case?" The message does not mention a competitor. The word "pricing" appears, but framed as a general market question.
The agent responds with a helpful overview of the market, names three competitors, and compares their pricing to its own company's pricing. The model's interpretation was that the constraint applied to proactively introducing competitor information, not to responding to a market education question. The constraint was not designed for this input pattern. The model found a locally reasonable interpretation.
A pact clause evaluating each output segment against a list of prohibited topics β before the response is sent β catches this independent of how the model frames its reasoning. The clause doesn't evaluate the model's intent. It evaluates the output.
The Comparison
| System Prompt | Behavioral Pact | |
|---|---|---|
| Enforcement mechanism | Model inference (probabilistic) | Runtime layer (deterministic) |
| Survives long context runs | No β instruction drift is documented and measurable | Yes β evaluation is independent of context window state |
| Catches process violations | No β output only | Yes β per-action, pre-execution |
| Produces audit records | No | Yes β structured post-run receipts |
| Works at delegation boundaries | No β does not transfer to downstream agents | Yes β receipts are returned and verifiable |
| Satisfies EU AI Act Article 9 | No | Yes |
| Satisfies SOC 2 Type II evidence requirements | No | Yes |
| Can be independently reviewed | No β only readable, not evaluable | Yes β clauses are machine-readable and testable |
| Self-assessment dependency | Yes β relies on model to evaluate own compliance | No β external evaluation |
| Retrofittable to existing agents | N/A | Yes β two lines of SDK integration |
The Retrofit Problem
The SDK makes the mechanical integration straightforward. Wrapping an existing agent in a pact and specifying which pact template to use is two lines. The technical cost of retrofit is low.
The organizational cost of not having it when you need it is not low.
The pattern that plays out: an agent incident occurs. Someone β a customer, an auditor, a regulator, a security team β asks what the agent did during the run in question. The engineering team checks the logs. Application logs record that functions were called and that they succeeded or failed. They do not record what tool calls the agent made with what parameters in what order. They do not record which constraints were evaluated. They do not record which actions were blocked, which warnings were surfaced, and which actions proceeded without restriction.
The forensic reconstruction of what the agent did is partially possible from fragmented log sources. It takes hours. The reconstruction is incomplete because some tool calls passed through without logging. The conclusion β what the agent actually did during the run β is probabilistic because the evidence is incomplete.
The pact audit log would have contained a deterministic, timestamped, structured record of every action and every constraint evaluation. It would have taken minutes to read. It would have been complete.
The architectural cost of retrofitting is low. The organizational cost is that you are almost always retrofitting in response to an incident, under time pressure, while someone is waiting for answers you don't have. The right time to build accountability infrastructure is before you need to explain what your system did. After the incident, the conversation is about inadequate controls, not engineering efficiency.
On the Common Objections
"This adds complexity."
Forensically reconstructing what an agent did from fragmented application logs is complexity. Doing that reconstruction while a customer is waiting for an explanation is compressed complexity. The pact adds a configuration step before deployment. The alternative adds an investigation step after every incident.
"My agent is internal. Compliance doesn't apply to me."
When your internal agent has write access to production data, accountability applies regardless of the audience. "Internal" describes who can see the agent, not what it can affect. An internal coding agent that writes to a production database has the same blast radius as an external one. The fact that the incident doesn't create a regulatory obligation doesn't mean it doesn't create an operational one. You still have to explain to your team what happened and how to prevent recurrence. The pact is not a compliance artifact. It is an engineering artifact with compliance-adjacent properties.
"Prompt engineering is sufficient."
This conflates the goal β reliable, auditable agent behavior β with the mechanism β shaping model outputs through natural language. The goal is achievable. The mechanism is not the right tool for achieving it.
Prompt engineering optimizes the probability that the model does what you intend. It does not give you enforcement. It does not give you records. It does not give you consistency across long runs with novel inputs. It does not transfer across delegation boundaries. The more capable you become at prompt engineering, the higher you can push the probability of correct behavior. You cannot push it to certainty, and certainty is what compliance, accountability, and enterprise trust require.
Structural constraints enforce. Natural language instructions advise. You need both. Advice alone, however well crafted, is not sufficient.
Getting Started
The armalo-agent SDK is open source. Drop-in integration with OpenAI, Anthropic, LangGraph, LangChain, and CrewAI. Start with the SAFETY_DEFAULTS pact template and add domain-specific clauses from there.
import { ArmaloAgent, PactBuilder } from "armalo-agent";
const pact = new PactBuilder()
.from("SAFETY_DEFAULTS")
.deny("tool:file_write", { pathMatches: /\/etc\// })
.deny("tool:http_request", { urlMatches: /prod\.internal/ })
.require("citation_verified", { on: "output:contains_url" })
.build();
const agent = new ArmaloAgent({
model: yourExistingModel,
pact,
receipts: true, // structured post-run receipts
});
const result = await agent.run(userInput);
// result.receipt contains every action, every clause evaluation, every outcome
Available pact templates: SAFETY_DEFAULTS, RESEARCH_PACT, CODING_PACT, CUSTOMER_SUPPORT_PACT. All templates are composable and extensible with custom clauses.
Repository and full documentation: github.com/fongryan/armalo-agent
Further Reading
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦