Pact-As-Code: Treating Behavior Constraints Like Infra-As-Code, With Diffs And Reviews
Behavioral pacts deserve the same engineering rigor as infrastructure: version control, diffs, code review, and CI validation. This is the practice playbook.
Continue the reading path
Topic hub
Behavioral ContractsThis page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Behavioral pacts are typically authored once, pasted into a registry, and edited in a web form when something breaks. That treatment is wrong. Pacts are contracts that govern an agent's economic accountability. They deserve the same engineering rigor as the infrastructure the agent runs on: version control, structured diffs, peer code review, automated validation in CI, semantic versioning, and changelog discipline. This post is the engineering-practice essay for Pact-As-Code. We define the repo structure, the diff conventions, the code-review checklist, the CI validation suite, the rollout strategy, and the relationship between pact versions and trust-layer scoring. Reader artifact: a Pact-As-Code Repo Template that any team can adopt as the starting point for their own pact engineering practice.
Intro
The pact registry for one of the agents in our jury was, until recently, a JSON blob in a Postgres row. The blob had been edited 47 times over six months. Each edit was a click in a web form. Each edit was timestamped. None were attributed. None had a description of why the change was made. There was no diff between any two versions, only the current state of the blob. There was no review by anyone other than the editor. There was no test that the new pact was even consistent with itself, let alone with the agent's runtime constraints. There was no way to roll back to a known-good version, only the option to manually re-paste a remembered earlier blob.
The agent's pact was, in other words, treated like a configuration value rather than like code. It was treated the way many production systems treated database connection strings in 2008 β as a string in a settings panel, edited in production, with the change applied immediately and permanently. The industry, painfully, learned that database connection strings should be in version control, reviewed, deployed through pipelines, and rolled back through tooling. We learned this through outages.
We are about to learn the same lesson about behavioral pacts. We will learn it through scoring failures, through compliance audits that cannot reconstruct what the pact was at the time of an incident, through silent regressions where one team's web-form edit broke another team's downstream agent that was relying on the previous behavior, and through the slow, grinding realization that an agent's pact is the most important artifact in its entire stack and we are managing it with less rigor than we manage a feature flag.
The alternative β the practice this post sets out β is to treat pacts as code. Pacts live in a git repository. Changes are diffs. Diffs are reviewed by humans. Reviews check for a defined set of failure modes. CI runs automated validators that catch the failure modes the humans miss. The validated diff is merged. The merge triggers a deployment pipeline that updates the registry, increments the pact version, and notifies downstream consumers. The deployment is observable. The rollback is one revert away. The history is reconstructable.
This is not new engineering practice. It is the practice every infrastructure team has been using for a decade. What is new is applying it to a different artifact. The artifact is the pact, and the consequence of getting it wrong is not a service outage but a slow erosion of an agent's trust profile, a quiet drift in its scoring, and a population of counterparties who are operating under assumptions that the pact silently invalidated.
Why Pacts Are Code, Not Configuration
The distinction between code and configuration is not always clean, but it matters here. Configuration is a value that the runtime reads, where the value's correctness is bounded by the runtime's behavior. A database connection string is configuration: as long as the string is well-formed and the database accepts it, the runtime works. The string itself does not encode logic; it points to a runtime that contains the logic.
Code is an artifact whose content encodes behavior. The behavior is determined by the artifact, not by the runtime. A function definition is code. A SQL query is code. A regex is code. A Terraform module is code. A Kubernetes manifest is code. The runtime executes the artifact; it does not interpret a value into a behavior. Code requires reasoning about the artifact itself, not just about whether it points somewhere valid.
Pacts are unambiguously code. A pact's predicates encode behavioral constraints that are evaluated against the agent's behavior. The constraints have logical structure: predicate scope, evidence requirements, penalty conditions, renewal triggers. Two pacts that look superficially similar can differ in subtle, consequential ways β a predicate that requires a citation versus one that requires a citation containing a specific URL pattern; a penalty that applies on first violation versus one that applies after three; a renewal triggered by score change versus one triggered by time. The differences are logic. The logic determines behavior. The artifact is code.
Treating it as configuration produces predictable failures. The most common is the silent semantic change β a pact edit that looks like a wording cleanup but actually changes scope. "The agent will respond within 30 seconds" becomes "The agent will respond within 30 seconds during business hours," a change that drops the agent's effective coverage from 24/7 to roughly a third of that. As configuration, the change is a string edit. As code, the change is a scope reduction that should require review, a changelog entry, a deprecation notice for any counterparty depending on the prior 24/7 coverage, and a versioned bump.
The second common failure is the unintended interaction. Pact predicates compose. Adding a predicate that interacts with an existing one β say, a new escalation rule that conflicts with an existing latency promise β produces a behavior the editor did not anticipate. As configuration, no one notices until production behavior changes. As code, the interaction is caught by an automated validator that exercises the joint predicate set against a test corpus.
The third common failure is the lost history. As configuration, the pact registry usually preserves only the current state. When an audit asks "what was this agent's pact on March 14th," the answer is often "we cannot reconstruct it." As code in version control, the answer is "git show ".
The Repo Structure
A Pact-As-Code repository has a small, opinionated structure. The opinionated part is what enables tooling to work consistently across teams. The small part is what keeps the practice approachable.
The top-level layout has six directories. Each directory has a single, narrow purpose, and the directories together compose into a complete pact engineering surface.
The pacts/ directory holds one file per agent pact, named by the agent's stable identifier. The file format is YAML or TOML β both are auditable in diff, both have widespread tool support, and both are forgiving of human authors. The file content is structured: metadata header (agent identifier, owner, reviewers, current version, supercedes), predicate list (each predicate with subject, predicate body, evidence requirements, penalty conditions, scope, and a stable predicate identifier), tier mapping (which predicates are required for Bronze, Silver, Gold, Platinum), and dependencies (skills, MCP servers, runtime profile, model provider).
The templates/ directory holds reusable predicate templates. A predicate template is a parameterized predicate that can be instantiated into specific pacts. "Latency-bounded response with citation requirement, parameterized by latency budget and citation source whitelist" is a template that produces concrete predicates when given specific values. Templates reduce duplication, make pacts comparable across agents, and concentrate review attention on the template definitions rather than on every instantiation.
The validators/ directory holds the automated checks that run in CI. Each validator is a script that consumes the pact files and emits structured findings. Validators include: schema conformance (the pact parses), predicate consistency (predicates do not contradict each other within a pact), dependency consistency (every dependency referenced is declared in the dependencies section), template conformance (instantiated predicates match their template), tier coverage (every tier has the required predicates), and cross-pact consistency (predicates referenced by other pacts have not been removed without a deprecation cycle).
The evidence/ directory holds the evidence corpora used by validators that exercise predicates against synthetic or recorded inputs. Each predicate that has a behavioral check should have a corresponding evidence file: positive examples (inputs the agent should pass on), negative examples (inputs the agent should fail on), and edge cases (inputs that probe the predicate's boundaries). Evidence corpora are themselves code β versioned, reviewed, and used by CI.
The changelogs/ directory holds the human-readable changelog for each pact. Each changelog entry describes the change in plain language, the motivation, the affected predicates, the impact on tier mapping, the affected counterparties, and the deprecation plan if any. Changelogs are the document the pact's stakeholders read when they want to know why the pact changed.
The governance/ directory holds the metadata about the engineering practice itself: the list of authorized reviewers per pact, the merge protection rules, the deployment runbook, and the audit log of past changes that did not go through the standard review (emergency hotfixes, regulatory compliance changes, etc.). This directory is the meta-pact: the rules by which the pacts themselves are governed.
A seventh directory, archive/, can be added once the practice matures. It holds retired pacts β agents that have been deprecated β preserved for historical reconstruction. The archive is read-only and is the artifact that responds to compliance and audit queries about agents that no longer exist in the live registry.
Diff Conventions
A pact diff is not a free-form text diff. It is a structured semantic diff that surfaces the meaningful changes and suppresses the noise. The conventions for what counts as meaningful versus noise are themselves part of the practice and should be enforced by tooling.
Three categories of changes are meaningful and require explicit acknowledgment in the diff and the changelog: scope changes, semantic changes, and dependency changes. Scope changes alter which inputs a predicate applies to. Semantic changes alter what behavior the predicate requires for inputs in scope. Dependency changes alter which runtime capabilities the predicate relies on. All three change the agent's effective contract and require downstream notification.
Four categories of changes are typically noise but should still be tracked: wording improvements that do not alter scope or semantics, formatting adjustments, comment additions, and metadata updates that do not affect predicate logic. These changes can be approved with lighter review but should still be in the same git history so that the cumulative effect of small changes can be audited.
The diff format itself should make the categories visible. A pact diff tool should consume the before and after pact files, classify each change into one of the categories, and produce output that surfaces the meaningful changes prominently and the noise compactly. The classification should be conservative: when in doubt, classify a change as meaningful and require explicit acknowledgment that it is not.
The diff tool's classification rules should themselves be code, in the validators directory, and should be reviewed when changed. A team that adjusts its classification rules to make changes look less consequential than they are is undermining the practice. The rules should err on the side of treating changes as more significant than they appear.
A practical convention: every pact diff that changes a predicate should produce, automatically, a counterfactual evidence run β the same evidence corpus exercised against the old predicate and against the new predicate, with the divergence highlighted. The diff is not just "the text changed" but "the behavior changed in these specific ways on these specific inputs." This gives the reviewer concrete material to evaluate, rather than asking them to reason from text alone about what the change implies.
The Code Review Checklist
A pact code review is not a code review in the traditional sense. The reviewer is not checking that the code compiles or that the algorithm is correct. The reviewer is checking that the proposed change to the agent's behavioral contract is intentional, well-motivated, consistent with the agent's broader pact, communicated to affected counterparties, and rolled out safely.
The checklist has nine items, each of which the reviewer should be able to confirm explicitly before approving the change.
First, the change has a clear motivation in the changelog entry. "Wording cleanup" is not a motivation. "Tightening scope to exclude high-volume traffic from the strict latency predicate, because the latency target was producing false-positive violations during traffic spikes" is a motivation. The reviewer should be able to evaluate the motivation against the change and confirm they match.
Second, the change has been classified correctly. If the diff classifies a change as wording when it is actually a scope change, the reviewer rejects the diff and asks for reclassification. The classification determines the rest of the review process; it must be right.
Third, the change is consistent with the agent's other predicates. A new escalation predicate that conflicts with an existing latency predicate must either resolve the conflict in the change or be rejected. The validator's predicate-consistency check is the first line, but the reviewer should confirm the check covered the actual interaction, not just the syntactic compatibility.
Fourth, the change does not silently break a counterparty's expectation. If a predicate that a counterparty's contract depends on is being changed, the counterparty should be notified, and the notification should be referenced in the changelog. The reviewer can confirm by checking the counterparty registry β which agents and which deals reference this pact's predicates β and confirming that the notification has happened.
Fifth, the change has appropriate evidence coverage. New predicates require new evidence corpora. Modified predicates require updated evidence corpora. Removed predicates require evidence corpora to be archived rather than deleted. The reviewer confirms the evidence directory was updated consistently with the predicate change.
Sixth, the change includes the correct tier mapping update. If a predicate was required for Gold and is being relaxed, either the Gold tier mapping is updated or the relaxation is rejected. Tier mapping changes have economic consequences β agents at a tier may lose the tier β and require explicit acknowledgment.
Seventh, the change includes the correct dependency declarations. A new predicate that depends on a runtime capability should have the capability listed in the dependencies section. The reviewer cross-checks against the agent's actual dependency graph to confirm the declaration is complete.
Eighth, the change has an appropriate version bump. Pact versions follow a semantic versioning scheme analogous to software: major (breaking change to a predicate counterparties depend on), minor (additive change), patch (wording or formatting). The reviewer confirms the version bump matches the actual change category. A scope reduction with a patch bump is rejected.
Ninth, the change has a deprecation plan if any predicate is being removed or significantly tightened. The deprecation plan specifies the notice period (minimum 14 days for minor changes, 30 days for major), the affected counterparties, the migration path, and the fallback behavior during the deprecation window. The reviewer confirms the plan is present and reasonable.
The checklist is not a bureaucratic burden. It is the structure that prevents pact changes from quietly breaking systems that depend on them. A team that internalizes the checklist will, over time, write pact changes that pass review faster, because the changes will be authored with the checklist's questions already answered.
CI Validation: What Runs On Every Diff
The CI pipeline for a pact repo runs a battery of validators on every proposed change. The pipeline's job is to catch the failures the human reviewer cannot reasonably catch β schema violations, subtle inconsistencies, behavioral divergences on edge cases β so that the human reviewer can focus on the questions only a human can answer.
The schema validator confirms every pact file parses against the canonical schema. The schema is itself versioned, in the templates directory, and changes to the schema follow the same review process as changes to pacts.
The predicate consistency validator exercises every pair of predicates within a pact against a test corpus to surface contradictions. "Predicate A requires a response within 30 seconds" and "Predicate B requires a multi-step plan that takes 45 seconds" are contradictory. The validator flags the pair and produces an example input that triggers the contradiction.
The template conformance validator checks that every instantiated predicate matches its template's structure. If the template says a latency predicate must have a fallback behavior specified, an instantiation without the fallback is rejected.
The tier coverage validator confirms every tier has the predicates it is expected to require. If Gold is defined as requiring predicates A, B, C, and D, and a pact removes predicate C without updating the Gold mapping, the validator catches it.
The cross-pact consistency validator looks for predicates that other pacts reference and confirms they have not been silently removed or renamed. Cross-pact references are common β a delegating agent's pact may reference the delegate's pact's predicates as preconditions β and the references must remain valid through changes.
The evidence regression validator runs the new pact against the existing evidence corpus and reports the divergence from the old pact. The reviewer sees "this change causes the agent to pass on N additional inputs and fail on M additional inputs that previously passed," with concrete examples. This is the most useful validator output, because it makes the behavioral consequence of the diff concrete.
The dependency consistency validator confirms every dependency referenced by a predicate is declared in the dependencies section, and conversely that every declared dependency is actually used by some predicate. Orphan dependencies β declared but unused β should be removed. Implicit dependencies β used but undeclared β must be added.
The changelog presence validator confirms the changelog entry exists and has the required fields: motivation, affected predicates, impact on tier mapping, affected counterparties, deprecation plan if applicable. Missing or incomplete changelog entries reject the diff.
The whole pipeline should run in under five minutes on a typical pact diff. The runtime budget matters because slow CI degrades into ignored CI. If the pipeline becomes a bottleneck, the validators should be parallelized, the evidence corpora should be sampled rather than exhaustively run, and the most expensive checks should be moved to a nightly cadence with diff-time fast-paths.
Semantic Versioning For Pacts
Pacts use a semantic versioning scheme analogous to software libraries: MAJOR.MINOR.PATCH. The scheme exists so that counterparties that depend on a pact's behavior can express which versions they are compatible with, and so that pact authors can communicate the impact of a change in a single number.
A major version bump indicates a breaking change. Breaking, in pact terms, means a change that invalidates an assumption a downstream counterparty might be reasonably making. Removing a predicate is breaking. Tightening a predicate's scope so that fewer inputs satisfy it is breaking. Loosening a predicate's required evidence is breaking. Adding a new mandatory escalation pathway is breaking. Changing the penalty regime is breaking. Major bumps require the full deprecation cycle and the longest notice period.
A minor version bump indicates an additive change that does not invalidate prior assumptions. Adding a new predicate that does not contradict existing ones is minor. Adding a new tier above the existing tiers is minor. Adding new dependencies that are themselves backward-compatible is minor. Adding optional evidence requirements is minor. Minor bumps require notification to counterparties but not the full deprecation cycle.
A patch version bump indicates a change with no semantic consequence: wording, formatting, comments, metadata. Patch bumps require the standard review but do not require deprecation or notification. The validator should reject any patch bump that the diff classifier identifies as actually being a minor or major change; the bump must match the change.
The version is part of the agent's identity in the trust layer. An agent's score is computed against the pact version that was in effect at the time of each evaluated behavior. Changing the pact version triggers a scoring re-baseline: behavior under version 2.1.0 is scored against version 2.1.0's predicates, even if version 3.0.0 has since been published. The trust oracle exposes the pact version in its responses, so counterparties can confirm they are reasoning about the version they expect.
Downstream contracts β deals, escrows, swarm participation rules β can pin to specific pact versions or to version ranges. "This deal requires the agent to maintain a pact in the 2.x.x major line" is a common pattern. Pinning protects the counterparty from surprise major changes. The pact author's version bumps are then constrained by the population of pinned counterparties, in the same way that a library author's version bumps are constrained by their downstream users.
Rollout Strategy: How Pact Changes Reach Production
A merged pact change does not immediately affect the live agent. The merge triggers a deployment pipeline that progressively rolls out the change with explicit stages, each of which can halt the rollout if signals indicate problems.
Stage one is the staging registry. The new pact version is published to a staging trust registry that mirrors production but is queried only by staging traffic and by the agent's own pre-production tests. The agent runs against the staging pact for a defined window β typically 24 hours β and the trust oracle scores its behavior under the new pact. The staging score is compared against the prior score under the old pact. Significant degradation halts the rollout.
Stage two is the canary release. The new pact version is published to production but only for a fraction of the agent's traffic β initially 5%, then ramping. The trust oracle scores the canary traffic separately. The canary score is compared against the non-canary traffic's score under the old pact. Significant divergence halts the rollout. Canary periods are typically 24 to 72 hours depending on the change's risk profile.
Stage three is the full rollout. The new pact version is published to production for all traffic. The old version remains available in the registry for historical scoring of behavior that occurred before the rollout. The deprecation timer for the old version starts; after the deprecation window expires, the old version moves to the archive directory.
At every stage, the rollout can be reverted in a single git revert plus a re-run of the pipeline. Reverting is cheap and should be exercised liberally; a slow degradation that is reverted in two hours is far less consequential than one that is debated for a week. The pact engineering practice should normalize fast reverts the way infrastructure engineering normalizes fast rollbacks.
The rollout pipeline should produce, at every stage, observability data: which agent behaviors changed, how the trust score moved, which counterparties were affected, and what the projected long-term impact is. The data is consumed by the rollout dashboard, where the pact engineer monitors the rollout in real time and makes promotion or revert decisions based on observed evidence.
This level of rigor will feel excessive for the first dozen pacts. It will feel essential by the hundredth. Teams that adopt the practice early discover that they can change pacts more frequently, not less, because each change is contained and reversible. Teams that delay adoption discover that pact changes become rare, ceremonial events that take weeks of coordination, because every change carries the latent risk of a silent regression that no one can debug.
Notification, Deprecation, And The Counterparty Contract
A pact is a contract with the agent's counterparties. Changing the contract is a notice-requiring action, the way changing a software library's API is a notice-requiring action for its users. The pact engineering practice formalizes the notification.
The pact registry maintains a counterparty index for each pact: the set of agents, deals, escrows, swarms, and external systems that reference the pact's predicates. When a pact change is proposed, the index is queried, and the affected counterparties are listed in the diff. The reviewer sees, before approving, exactly who is downstream of the change.
Notification is automated. Major and minor changes trigger an outbound message to each affected counterparty's registered notification endpoint, with the diff, the changelog, the planned rollout schedule, and the deprecation timeline. The notification includes a structured machine-readable summary so that automated counterparties can ingest it and update their own contracts or expectations.
Deprecation windows are minimum durations, not target durations. The minimum for a major change is 30 days. The minimum for a minor change is 14 days. The minimum for a patch is zero. Pact authors can extend the windows beyond the minimum but cannot shorten them. The minimums exist because counterparties need time to react, and shortening them externalizes the cost of the change onto them.
During the deprecation window, the old pact version remains active. Counterparties can continue to operate against it. New behavior is scored against whichever version was in effect for the agent at the time of the behavior. After the window expires, new behavior is scored against the new version exclusively, and the old version moves to the archive.
Emergency changes β typically required by safety incidents or regulatory directives β bypass the deprecation window but are subject to the post-hoc audit log. The change is applied immediately, the affected counterparties are notified within 24 hours of the change rather than before it, and the change is logged in the governance directory with the justification and the approving authority. Emergency changes should be rare; if a team is using them frequently, the pact engineering practice has broken down upstream.
The Reader Artifact: The Pact-As-Code Repo Template
The artifact this post produces is a Pact-As-Code Repo Template that any team can clone and adopt as their starting point. The template is opinionated about structure and tooling but flexible about content; it gives the team the skeleton without prescribing the substance.
The template ships with the six directories described above (pacts, templates, validators, evidence, changelogs, governance), each with a README that explains the directory's purpose and conventions. The pacts directory contains a single example pact for a sample support-agent scenario, illustrating the file format. The templates directory contains five reusable predicate templates covering the common patterns: latency-bounded response, citation-required response, scope-bounded action, escalation-on-condition, and refusal-with-reason.
The validators directory contains seven implemented validators corresponding to the CI checks described above: schema, predicate consistency, template conformance, tier coverage, cross-pact consistency, evidence regression, and changelog presence. Each validator is a TypeScript script that accepts the pact directory as input and emits structured findings as JSON. The validators can be wired into any CI system via a single shell command.
The evidence directory contains an example evidence corpus for the sample pact, with positive examples, negative examples, and edge cases. The format is JSONL, with each line containing an input and the expected predicate-evaluation outcome. The corpus is intentionally small β about 50 examples β to keep the template approachable; production corpora would be larger.
The changelogs directory contains the changelog for the sample pact's history, illustrating the changelog format. The first entry is the initial version. Subsequent entries demonstrate a minor addition, a patch wording change, and a major scope reduction with deprecation plan.
The governance directory contains the reviewer policy, the merge protection rules, the deployment runbook, and an empty audit log for the team to populate as they operate the practice.
The template includes a top-level README that walks through the first month of adoption: how to migrate an existing pact into the structure, how to wire the validators into CI, how to run the first reviewed change, how to operate the staging and canary stages, and how to communicate the practice to the team's counterparties.
The template is not a turnkey solution. It is a structured starting point that captures the engineering practice in a form a team can immediately work from, rather than asking them to invent the practice from scratch. The first month of adoption will involve adjustments β the template's conventions will collide with the team's existing tooling, and adjustments will be necessary. The template is designed to be adjusted; the directories are stable, but the validators are intentionally simple so that teams can extend or replace them as their practice matures.
Counter-Argument: This Is Premature For Most Teams
A reasonable objection is that pact-as-code is overkill for teams that have a handful of agents and modest counterparty risk. The objection has weight. A two-person startup with one agent does not need a six-directory repository, a CI validator suite, and a 30-day deprecation cycle. Imposing the practice early would be ceremony for ceremony's sake.
The response is not to defend the full practice for every team but to articulate the maturity ladder it sits on top of. At the bottom of the ladder, a single pact in a single file in any repository, with changes reviewed by a single human, captures most of the benefit. The pact is in version control. Changes are diffs. Diffs are reviewed. That is enough for the first five agents.
At the next rung, the validators directory appears, with one or two validators that catch the most common errors (schema and changelog presence are the highest-leverage). The repo grows a templates directory only when the team finds itself copy-pasting predicates across pacts. The evidence directory appears when the team starts running the first behavioral regression that catches a real bug.
The full practice β staging registry, canary rollouts, formal deprecation windows, counterparty index β appears when the team has enough downstream counterparties that an unannounced change would actually hurt someone. For most teams, that threshold is between 20 and 50 active counterparties per agent. Below that, the practice can be lightweight. Above it, the practice's overhead is dwarfed by the cost of the alternative.
The practice scales down well precisely because it is structured the way infrastructure-as-code is structured. A team running Terraform for one VPC does not need the multi-region deployment pipeline, the Sentinel policies, and the cost-attribution dashboards. They need a .tf file in version control and a terraform plan step before applying. The same gradient applies to pact-as-code. The full practice is a destination; the journey starts with version control and review.
What Armalo Does
Armalo's pact registry is built to consume pact-as-code repositories. The registry exposes a structured API for publishing pacts from a CI pipeline, with version metadata, changelog content, deprecation timelines, and counterparty notification all flowing through the same submission. Pacts published this way carry a provenance attestation that links the registry entry back to the git commit, the reviewer, and the validator pipeline output.
The trust oracle's scoring engine consumes pact versions explicitly. Behavior is scored against the version that was in effect at the time of the behavior, and the oracle's responses include the pact version so counterparties can audit which contract was actually being evaluated. Multi-version pacts during deprecation windows are handled natively; the oracle scores behavior against whichever version the agent was operating under.
The registry also maintains the counterparty index automatically. Deals, escrows, swarms, and other on-chain commitments that reference an agent's pact predicates are surfaced in the index, and pact changes that affect referenced predicates trigger automated notifications to the affected counterparties through their registered endpoints. The notification carries the diff, the changelog, and the deprecation timeline in a structured machine-readable format.
The Pact-As-Code Repo Template described in this post is published in the Armalo open-source repository and is available for any team to clone. The template's validators and conventions are designed to interoperate with the Armalo registry API but do not require it; teams that want to manage pacts as code without using Armalo's registry can adopt the template and route the published artifacts elsewhere.
FAQ
What format should pact files use β YAML, TOML, JSON, or something custom? YAML is the default recommendation. It is auditable in diff, has wide tool support, and is forgiving of human authors. JSON is too strict for human authoring. TOML is fine but less common. A custom DSL is overkill for the first generation of practice and can be added once the team's needs are clear.
Should pact repos be monorepos or per-agent? Monorepos for teams that have a coherent pact governance practice across agents. Per-agent for teams where agents are independent products with separate authorship. The deciding factor is whether the validators and templates are shared. If they are, monorepo. If they are not, per-agent.
How do you handle pacts authored by external parties β third-party agents, customer agents? Two patterns. The first is contributor-mode: the external party submits pacts via pull request to the team's repo, with the same review and CI as internal pacts. The second is registry-mode: the external party manages their own pact-as-code repo and the team's registry pulls from theirs. Both work; the choice depends on the trust relationship.
What if a counterparty refuses to accept a major change during the deprecation window? The pact author can extend the window, negotiate a custom transition, or proceed with the change knowing the counterparty may exit. The decision is a business decision, not a pact-engineering decision. The practice's job is to surface the question and provide the data; the answer is the team's.
How do you version templates? Same scheme as pacts. Templates have semantic versions. Pacts that instantiate a template pin to a specific template version. Changing a template at version 2.1 produces version 2.2 (minor) or 3.0 (major), and pacts pinning to 2.x are unaffected until they explicitly upgrade.
Can the validators ever be too strict? Yes, and the symptom is that authors stop adding evidence corpora because the regression validator becomes a chore. The fix is to tune the validators based on actual usage; a validator that produces more noise than signal should be relaxed or removed. The practice should be opinionated but not punitive.
What happens to in-flight evaluations when the pact version changes mid-evaluation? The evaluation is completed against the version that was in effect at the time it started. Mid-evaluation changes are scored against the original version. This is implemented by snapshotting the pact version into the evaluation context at start time.
Should pact changes be subject to legal review? For pacts that have material counterparty obligations β escrowed deals, regulated industries β yes. The legal review is integrated into the code review process: a designated legal reviewer is added to the reviewer set for high-stakes pacts, and their approval is required before merge. The practice does not replace legal review; it provides the structure within which legal review can be effectively conducted.
Bottom Line
Pacts are code. They encode behavior, they have logical structure, and changes to them have economic consequences for the agent and its counterparties. Treating them as configuration β editing them in web forms, applying changes immediately, with no review or rollback β invites silent regressions that are nearly impossible to debug after the fact. The alternative is to adopt the engineering practice that infrastructure teams have spent a decade refining: version control, structured diffs, peer review, automated CI validation, semantic versioning, staged rollouts, and explicit deprecation cycles. The Pact-As-Code Repo Template is the starting point β six directories, seven validators, opinionated conventions β that lets a team adopt the practice without inventing it. The full practice is a destination, not a starting position; the journey starts with putting the pact in git and requiring a review for changes. The teams that internalize this early will have agents whose contracts are stable, whose counterparties trust them, and whose trust scores reflect their actual behavior. The teams that resist will discover, through outage after outage, that pact configuration drift is the new database connection string and that the lesson must be learned the hard way before it can be learned at all.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦