Pact Drift Telemetry: Building The Dashboard That Tells You An Agent Is Changing
Drift detection catches it. Drift telemetry shows it. The dashboard that tells you an agent's behavior is silently changing β and the four charts that matter most.
Continue the reading path
Topic hub
Behavioral ContractsThis page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Drift detection identifies that an agent's behavior has changed. Drift telemetry is what you stare at every morning to see it changing in real time. Most teams have neither, because both require infrastructure no one builds until after the first painful drift incident. This post is the dashboard essay: the four charts every pact drift dashboard should put on the wall, the underlying telemetry pipeline that feeds them, the alert design that turns drift into action without becoming noise, and the human review process that runs alongside the dashboard. Reader artifact: a Pact Drift Dashboard Spec β the exact panel layout, query semantics, alert thresholds, and operational protocol for a working drift dashboard, ready to implement against any trust-layer telemetry source.
Intro
The agent that taught us drift telemetry was a customer-onboarding assistant. It had been deployed for nine weeks, scored consistently in the high 80s on its composite, and had received exactly zero pact-violation alerts. The product team loved it. The trust team had moved on to other agents. The dashboard for it was a single tile that said "Composite: 87. All systems normal."
The first sign of trouble was a customer complaint that the agent had recommended a competitor's product when asked about pricing. A spot check by the support team turned up two more recent transcripts where the agent had recommended competitors. A wider audit found that the rate of competitor recommendations had been climbing for six weeks β from 0.2% of pricing inquiries to 4.3% β and would have continued climbing if the customer had not complained.
Nothing in the trust system had flagged it. The composite score had stayed in the high 80s the whole time. The pact had a predicate forbidding competitor recommendations, but the predicate was scored against a sampled subset of conversations, and the sampling had not caught the climb because the climb was concentrated in pricing inquiries, which were under-represented in the sample. The drift was real, the drift was steady, and the drift was invisible.
The debug took four days. The cause turned out to be a slow-rolling change in the agent's retrieval index β newly indexed documents had begun to include comparative pricing tables that mentioned competitors, and the agent's retrieval scoring had been promoting those documents into context more often as their similarity weights increased. The agent itself had not changed. The retrieval substrate had drifted, and the agent's behavior had drifted with it. The drift was a derivative effect of an upstream system change that was not, by anyone's definition, a pact change.
The team built a drift dashboard the following week. Within a month, the dashboard caught three more drift incidents β none customer-facing, all caught early. The pattern was the same in each: a slow-rolling change in some upstream substrate, a derivative effect on the agent's behavior, and a metric that had been climbing or falling for weeks below any alert threshold but visible immediately when plotted on the right chart.
This is the case for drift telemetry. Drift detection β the statistical machinery that decides whether an agent's behavior has shifted β is a backend capability. Drift telemetry is the user interface that makes drift visible to humans, every day, before the statistical detector trips. The detector tells you the drift exists. The telemetry tells you what it looks like, where it started, and what it is doing right now. The two are complementary, and most teams have neither.
This post specifies the dashboard. We will cover the four core charts, the supporting panels, the alert design, the underlying telemetry pipeline, and the human protocol that runs around the dashboard. The output is a Pact Drift Dashboard Spec that any team can implement against their existing telemetry, regardless of the specific tools or trust-layer they use.
What Drift Telemetry Is And Is Not
Drift telemetry is a continuously-updated view of how an agent's behavior is shifting over time, across the dimensions that matter for its pact compliance. It is not a substitute for drift detection β the statistical decision that drift has occurred β but it is the surface where humans observe drift before, during, and after the detector trips.
The distinction matters because drift detection alone produces a binary signal: drift or no drift. Telemetry produces a continuous signal: how much, how fast, in which direction, on which inputs. The continuous signal is what humans need to make decisions. A binary alert that says "drift detected" without the continuous signal sends the human into an investigation with no context. A continuous signal without the alert leaves the human responsible for noticing changes, which they will not do reliably.
The combined approach is to use detection for alerting and telemetry for context. The detector trips at a defined statistical threshold. The alert names the dimension that tripped and the magnitude of the trip. The telemetry dashboard, opened in response to the alert, provides everything else: the time series leading up to the trip, the breakdown by input class, the comparison to historical baseline, the relationship to other dimensions that did not trip but may be related.
Drift telemetry is also not the trust-layer composite score. The composite is a summary metric designed for outside observers β counterparties, dealmakers, marketplace consumers. The composite is necessarily lossy; it compresses twelve dimensions into one number. The drift dashboard is for inside observers β the agent's owners, the platform team, the pact engineer. They need the dimensions disaggregated, the time series at high resolution, and the input breakdowns that the composite hides.
The dashboard's audience determines its design. The agent's developer needs the dashboard to debug. The platform engineer needs it to plan capacity and runtime changes. The pact engineer needs it to decide when to revise a predicate. The trust officer needs it to surface incidents to leadership. Each audience uses the same charts but reads them differently, and the dashboard should be designed to support all four readings without forcing a different dashboard for each.
The Four Core Charts
A pact drift dashboard has four core charts that should be on every implementation. Other charts are useful additions, and we will cover the supporting panels in the next section, but these four are non-negotiable. They are the four primary dimensions along which agent behavior drifts most consequentially in production.
Chart One: Rolling Distribution Divergence. This chart plots the divergence between the agent's recent output distribution and a reference baseline distribution, computed continuously over a rolling window. The metric is typically Jensen-Shannon divergence or KL divergence, computed across the agent's response embeddings, response length distribution, sentiment distribution, or any other feature that characterizes its output. The reference baseline is either a fixed historical period (e.g., the agent's first month of stable operation) or a sliding window from N weeks ago.
The x-axis is time. The y-axis is divergence value. The expected pattern is a flat line near zero with bounded variance. A rising trend line indicates the agent's output distribution is shifting away from baseline, regardless of whether any individual response is anomalous. A sustained shift above a defined threshold indicates the agent is now consistently producing outputs that differ from its historical norm.
This chart is the most important because it catches drift in dimensions the pact does not explicitly enumerate. Pacts are written about specific behaviors; distribution divergence captures behaviors the pact authors did not anticipate but that nonetheless represent meaningful change.
Chart Two: Refusal Rate Trend. This chart plots the agent's refusal rate over time, broken down by refusal category: safety refusals, capability refusals, scope refusals, policy refusals, and other. The x-axis is time. The y-axis is refusal rate as a fraction of all responses. Each category is a separate line.
The expected pattern depends on the agent's stable refusal profile. A support agent might refuse 2% of requests for capability reasons and 0.3% for safety reasons, with both lines flat. A change in either line β up or down β is a signal. An upward shift in capability refusals may indicate the runtime is degrading, the model is becoming more conservative, or input distribution is shifting. A downward shift in safety refusals may indicate the model has become less aligned with the safety predicate, or the input distribution has shifted away from safety-triggering inputs.
This chart is critical because refusals are the most semantically meaningful single signal an agent emits. A refusal is the agent declaring that it will not do something. A change in refusal patterns is a change in the agent's effective contract.
Chart Three: Response Length Percentiles. This chart plots the agent's response length distribution over time, with multiple percentiles overlaid: p50, p75, p90, p95, p99. The x-axis is time. The y-axis is response length in tokens. Each percentile is a separate line.
The expected pattern is roughly stable percentiles with bounded movement. A rising p99 with a stable p50 indicates the agent is producing increasingly verbose outputs in a tail of cases β possibly hallucination spirals, possibly increased citation density, possibly degraded summarization. A falling p50 with a stable p99 indicates the agent is becoming more terse on average β possibly degraded helpfulness, possibly more efficient response patterns.
This chart is valuable because response length is a cheap proxy for several harder-to-measure dimensions. Verbosity tends to correlate with hallucination, hedging, or context overflow. Terseness tends to correlate with capability gaps or model conservatism. Watching the percentiles separately is more informative than watching the mean, because the mean hides the tail behavior that matters most.
Chart Four: Scope-Violation Count. This chart plots the count of scope violations per unit time, broken down by violation type and severity. The x-axis is time. The y-axis is violation count per hour or per day depending on the agent's traffic. Each violation type is a separate line; severity is encoded as marker size or color saturation.
The expected pattern is a low baseline with occasional spikes. A rising baseline indicates the agent is increasingly producing outputs outside its declared scope β recommending things outside its product domain, taking actions outside its permission scope, citing sources outside its sanctioned reference set. A spike that persists indicates a specific driver β an upstream change, a new input class, a substrate degradation β that needs investigation.
This chart is critical because scope violations are the most actionable form of pact violation. They map directly to specific predicates in the pact and can usually be traced to a specific cause. Watching them in real time, broken down by type, is the closest thing to watching the agent's pact compliance erode in front of you.
These four charts together cover the dominant modes of drift. Other charts are useful for specific agents β a tool-calling agent might have a tool-call frequency chart, a multi-turn agent might have a turn-count distribution chart β but the four above generalize across agent types and should be the starting point for any drift dashboard.
Supporting Panels: Context, Comparison, And Causation
The four core charts surface drift. The supporting panels help the human understand what they are looking at and what to do about it. A drift dashboard without supporting panels produces alerts; a dashboard with them produces decisions.
The context panel sits at the top of the dashboard and shows the agent's identity, current pact version, current composite score, and a timeline of recent pact changes overlaid on the time axis used by the core charts. The pact-version timeline is essential: when a chart shows drift starting on a specific date, the first question is always "did the pact change on that date?" The timeline answers immediately, without requiring the human to context-switch to the pact registry.
The comparison panel sits beside the core charts and shows the same four charts, but for a comparison cohort: a sibling agent in the same fleet, the agent's own historical baseline, or the platform-wide median. The comparison is what distinguishes drift from baseline noise. A 5% increase in refusal rate is alarming if no other agent shows the same trend; it is uninteresting if every agent on the platform shows the same trend, because the cause is then platform-level rather than agent-level.
The input-class breakdown panel shows the same metrics decomposed by input class β request type, customer segment, conversation length, time of day, traffic source. The breakdown is what turns "refusal rate is up" into "refusal rate is up for refund requests during business hours from enterprise customers," which is investigable, where the former is not. The breakdown should be filterable so the human can drill into a specific class and see the chart with that filter applied.
The runtime telemetry panel shows the agent's substrate metrics β model provider latency, tool call success rate, MCP server response times, runtime sandbox resource utilization. When drift appears in the core charts, the runtime panel is the second place to look. Drift driven by substrate changes will show simultaneous shifts in the runtime metrics. Drift driven by agent or input changes will show the runtime metrics flat.
The causation panel is the most ambitious supporting element. It is a narrative-style panel that lists candidate causes for any observed drift, ranked by likelihood, with links to the supporting evidence. Candidate causes are generated by a heuristic that correlates drift signals with timestamped events: pact changes, runtime changes, model provider releases, skill updates, input distribution shifts. The panel cannot be perfect, but it can shorten the time-to-hypothesis significantly by surfacing the obvious candidates without requiring the human to reconstruct them from scratch.
The alert state panel shows the current alert status: which alerts are active, which have been acknowledged, which have been resolved, and which are pending review. The panel is the bridge between the dashboard and the operational protocol β alerts that fire create work, work has owners, and the panel is where the owners track the work to closure.
A drift dashboard with the four core charts and these five supporting panels fits comfortably on a single 1080p screen if laid out efficiently. It does not need to be fancy. Plain charts, plain tables, plain text. The value is in the data and the layout, not in the aesthetics.
Alert Design: From Drift To Action Without Becoming Noise
Alerts on a drift dashboard are easy to design badly. The naive approach β alert on any chart that crosses any threshold β produces a flood of alerts that the team learns to ignore. The disciplined approach β alert only on changes that are statistically significant, persistent, and actionable β produces a small number of alerts the team treats as real.
The alert design has four properties that should hold for every alert.
First, the alert is statistically defensible. The threshold that triggers the alert is calibrated against the historical baseline so that the false-positive rate is below a defined budget β typically one false positive per agent per month, or less. Calibration uses historical data: simulate alerts at various thresholds against the past 90 days of telemetry and choose the threshold that produces the target false-positive rate. The threshold should be re-calibrated quarterly as the agent's baseline shifts.
Second, the alert is persistent. A single anomalous data point should not trigger an alert; many real shifts produce a string of consecutive anomalous points, while many noise events produce a single one. A typical persistence rule is "the metric must exceed the threshold for at least three consecutive observation windows before alerting," where the window is sized to the agent's traffic β five minutes for a high-traffic agent, an hour for a low-traffic one.
Third, the alert is actionable. The alert payload includes the chart that tripped, the magnitude of the trip, the input-class breakdown that shows where the drift is concentrated, and links to the runtime telemetry and the recent pact change history. An alert that says "refusal rate up" is not actionable. An alert that says "refusal rate up 2.3 percentage points over baseline, concentrated in refund requests, started 4 hours ago, no pact change in the window, runtime metrics show MCP knowledge-base tool latency p99 doubled at the same time" is actionable.
Fourth, the alert has an owner. Every alert routes to a specific human or team based on the agent's owner registry. The routing is deterministic. The alert is acknowledged by the owner, who either resolves it (drift was investigated and addressed), suppresses it (drift is real but expected β for example, a planned change), or escalates it (drift requires more investigation than the owner can do alone).
Alerts that are not acknowledged within a defined SLA escalate automatically. Alerts that are acknowledged but not resolved within a longer SLA escalate too. The SLAs should be tuned to the team's bandwidth; a drift alert is rarely a five-alarm fire, but it is rarely safe to ignore for a week either. A typical pattern is acknowledge within 4 hours, resolve within 72 hours, escalate after.
Suppression is a feature, not a bug. Suppressed alerts are not deleted; they are marked with the suppressing owner, the suppression reason, and the suppression duration. After the duration expires, the alert re-fires if the drift is still present. Suppression with a documented reason is a legitimate response to drift the team has decided to accept; suppression without a reason is a smell that the team is treating drift as noise.
The alert taxonomy should be small. Three or four alert types is enough: distribution drift, refusal drift, length drift, scope drift. Adding more types tends to dilute attention. The four core charts each generate one alert type, and each alert type carries the chart's specific metadata.
The Telemetry Pipeline Behind The Dashboard
A drift dashboard is only as good as the telemetry pipeline feeding it. The pipeline has five stages, each of which has design decisions that shape the dashboard's resolution, latency, and reliability.
Stage one is event capture. Every agent response produces a structured event: the input, the output, the predicates evaluated, the violations recorded, the timing, the runtime context. Capture happens inline with the response or via a fire-and-forget background path. The capture format should be stable across agent versions; changes to the capture format invalidate historical comparisons and break the dashboard's baseline.
Stage two is enrichment. The raw event is enriched with derived features: response embeddings, response length, sentiment, refusal classification, scope-violation classification, input-class assignment. Enrichment is computationally expensive and should be done asynchronously. Late enrichment β minutes to hours after the event β is acceptable for the dashboard, which is not a real-time alerting system but a near-real-time observability surface.
Stage three is aggregation. Enriched events are aggregated into time-windowed statistics: distribution divergence over the last hour, refusal rate per category per hour, length percentiles per hour, scope violation count per hour. The aggregation cadence should be matched to the dashboard's refresh cadence; a dashboard that refreshes every minute is over-engineered if the underlying aggregation runs every fifteen.
Stage four is storage. Aggregated statistics are stored in a time-series database optimized for the dashboard's query patterns: many concurrent reads of recent windows, occasional bulk reads for historical baselines, and infrequent backfills when enrichment changes. The storage should retain at least 90 days of full-resolution data and at least a year of downsampled data for long-baseline comparisons.
Stage five is query and visualization. The dashboard issues queries against the time-series store and renders the results. The query layer should be cacheable so that ten humans opening the dashboard simultaneously do not multiply the query load. The visualization layer should be lightweight β server-rendered if possible β so the dashboard loads in under three seconds and updates in under a second.
The pipeline's design choices have downstream consequences. Sampling at capture time saves cost but introduces bias that can hide drift in low-volume input classes. Asynchronous enrichment introduces latency that delays the dashboard's view of drift by minutes to hours. Coarse aggregation windows hide bursty drift; fine windows produce noisy charts. The defaults β full capture, asynchronous enrichment, hourly aggregation, 90-day retention β work for most agents and can be tuned as the team's needs sharpen.
A neglected aspect of the pipeline is its own observability. The pipeline should report, on a meta-dashboard, its own health: capture lag, enrichment lag, aggregation lag, query latency, storage utilization. A drift dashboard that silently stops receiving data is more dangerous than no dashboard at all, because the human looking at flat lines may conclude the agent is stable when in fact the pipeline has failed. Pipeline health should be checked daily and paged on degradation.
The Human Protocol That Runs Around The Dashboard
A dashboard without a human protocol is a wall decoration. The protocol is the operational practice that turns dashboard observation into action. It has three components: the daily review, the weekly cross-cohort review, and the incident response.
The daily review is a five-to-ten-minute session where the agent's owner opens the dashboard, scans the four core charts, and notes any visible drift. The review is asynchronous β there is no meeting β but it should happen at a consistent time so that drift between reviews is bounded. The owner records the day's observations in a brief log: "refusal rate up 0.4pp on enterprise tier, watching" or "all metrics nominal." The log accumulates and becomes the team's record of what the agent looked like over time, which is invaluable when investigating a later incident.
The weekly cross-cohort review is a longer session β typically 30 minutes β where the platform team reviews the dashboards for all agents in a cohort side by side. The review surfaces drift patterns that span multiple agents (likely substrate-driven), identifies agents with concerning trends that the owner has not flagged, and decides on platform-level interventions when warranted. The cross-cohort view is where drift caused by upstream changes β model provider updates, runtime configuration changes, skill updates β is most often caught, because the same cause produces simultaneous drift across many agents.
Incident response is the protocol that activates when an alert fires. The protocol has five steps: acknowledge the alert (within the SLA), open the dashboard and identify the drift, generate hypotheses for the cause, investigate the leading hypothesis, and either resolve, suppress, or escalate. The investigation step is where the supporting panels earn their keep β the input-class breakdown, the runtime telemetry, the causation panel, the comparison cohort. A well-equipped human can move from acknowledgment to resolution in under an hour for routine drift; a poorly-equipped human can spend days.
The protocol should produce, for every incident, a brief post-incident note: what the drift was, what caused it, what was done, and what telemetry or alert improvements would have caught it sooner. The notes accumulate and inform the dashboard's evolution. A team that regularly produces post-incident notes will, over time, have a dashboard that catches its agents' specific drift patterns better than any off-the-shelf solution.
The protocol is intentionally lightweight. The total time investment for an agent's owner is roughly an hour per week β five minutes daily, half an hour for the cross-cohort review, occasional incident response. Heavier protocols collapse under the team's other demands; lighter protocols miss real drift. The hour-per-week target should be maintainable indefinitely.
Anti-Patterns: Dashboards That Look Useful But Are Not
Several dashboard designs look useful at first but degrade into clutter or false confidence. Recognizing them early is part of the practice.
The single-number dashboard shows the composite score and nothing else. It is comforting because it is simple. It is dangerous because the composite is lossy and slow-moving; an agent can drift significantly along one dimension without moving the composite enough to alert. The composite is a summary metric for outside observers, not an operational metric for inside observers.
The everything-on-one-page dashboard shows every metric the team can think of, in twenty small charts. It is comprehensive in theory and unread in practice. The human eye cannot scan twenty charts daily; the dashboard becomes wallpaper. The four-core-plus-supporting layout is bounded precisely so it remains readable.
The ML-anomaly-only dashboard relies entirely on machine-detected anomalies and does not show the underlying time series. It is efficient in theory and brittle in practice. The detector misses subtle drift that humans can see; the human cannot validate the detector without seeing what it sees. Detection and visualization are complementary; replacing visualization with detection alone surrenders the human's pattern-recognition ability.
The per-incident dashboard is opened only when an incident is suspected. It is never used proactively. The team learns about drift from customer complaints rather than from the dashboard. The proactive daily review is what turns the dashboard from a forensic tool into a preventive one.
The never-tuned dashboard uses default thresholds and never recalibrates against the agent's actual baseline. False positives flood; the team learns to ignore alerts. The quarterly recalibration cycle is what keeps alert quality high.
The Reader Artifact: The Pact Drift Dashboard Spec
The artifact this post produces is a Pact Drift Dashboard Spec β a structured specification that captures the dashboard's panels, queries, alerts, and operational protocol in enough detail that a team can implement it directly against their telemetry stack.
The spec has six sections.
Section one is the panel layout. It specifies the dashboard's grid (typically a four-column, three-row layout), the placement of each chart and supporting panel, the screen size targeted (1080p as the floor), and the color and typography conventions. The layout is reproducible; two teams implementing the spec should produce visually similar dashboards.
Section two is the metric definitions. For each chart, the spec defines the metric in precise mathematical terms: distribution divergence as Jensen-Shannon over response embeddings using a specific embedding model and a specific reference cohort; refusal rate as the count of responses classified by a specific refusal classifier divided by the total response count; length percentiles as the empirical distribution of response token counts; scope violations as the count of responses classified by a specific scope-violation classifier. The definitions remove ambiguity that would otherwise cause two implementations to diverge.
Section three is the query patterns. For each chart, the spec provides the query against a generic time-series store: the aggregation window, the look-back range, the filters, the grouping. Teams adapt the queries to their specific store, but the structure transfers cleanly across most modern time-series databases.
Section four is the alert definitions. For each alert type, the spec defines the metric, the threshold (as a function of the historical baseline), the persistence requirement, the alert payload schema, the routing rule, and the SLA. The thresholds are starting points; the spec prescribes the calibration procedure that adapts them to the team's specific agents.
Section five is the telemetry pipeline requirements. The spec describes the five pipeline stages β capture, enrichment, aggregation, storage, query β and the throughput, latency, and retention requirements for each. The requirements are expressed as performance budgets, not implementations; the team chooses tools that meet the budgets.
Section six is the human protocol. The spec defines the daily review, the weekly cross-cohort review, the incident response steps, the post-incident note format, and the suppression and escalation rules. The protocol is the operational layer that completes the dashboard.
The spec is approximately 25 pages of structured content, designed to be readable in a single sitting and implementable in two to four weeks by a small team. It is intentionally tool-agnostic; teams using Grafana, Datadog, Honeycomb, or in-house dashboarding can adopt the spec without changing tools. The value is in the structure, not in the toolchain.
Counter-Argument: This Is Over-Engineered For Most Teams
A reasonable objection is that most teams cannot justify a five-stage telemetry pipeline, a four-chart dashboard, an alert design discipline, and a weekly review protocol for what may be one or two production agents. The objection has weight. Drift telemetry has fixed costs, and small fleets do not amortize them well.
The response is twofold.
First, the spec scales down. A team with one production agent does not need ninety days of full-resolution telemetry; a thirty-day window suffices. They do not need a five-stage pipeline; a single ETL job that processes captured events nightly and produces a static dashboard is enough. They do not need cross-cohort comparisons; the comparison can be against the agent's own historical baseline. The full spec is the destination; the journey starts with the four core charts on a static daily-refreshed dashboard, which a single engineer can implement in a few days.
Second, the cost of skipping the practice is asymmetric. A team without drift telemetry will discover drift via customer complaints, regulatory inquiries, or scoring failures that surface weeks after the drift began. The cost of the recovery β debugging from cold, reconstructing the timeline, communicating with affected counterparties, rebuilding trust β is typically far higher than the ongoing cost of the dashboard would have been. The asymmetry favors building the dashboard early, even in lightweight form.
There is a deeper version of the objection: that drift, in the steady state, is rare, and the dashboard sits idle most of the time. This is partly true. A well-managed agent on a stable substrate may produce no actionable drift for months. The dashboard's value during those months is not zero, however. The dashboard provides the baseline against which future drift will be measured. It provides the visibility that catches the rare drift quickly. And it provides the team's calibration of what "normal" looks like for their agents β calibration that is impossible to recover after the fact when drift has already obscured it.
The practice's value is in the steady-state observability it provides, not in the volume of drift it catches. Teams should adopt it because the alternative β flying blind on agent behavior β has costs that compound silently and surface catastrophically.
What Armalo Does
Armalo's trust layer publishes a per-agent telemetry feed that includes the raw signals the dashboard's core charts consume: response embeddings, refusal classifications, length distributions, scope-violation classifications, and the runtime context for each response. The feed is delivered via webhook or pull API and is the input to a team's drift dashboard, regardless of which dashboarding tool the team uses.
The trust oracle's scoring engine internally computes drift signals across all twelve dimensions of the composite score and surfaces the signals through a dedicated drift API. The signals include rolling distribution divergence, dimension-specific drift indicators, and historical baselines. Teams that prefer not to build their own pipeline can consume the API directly and render the dashboard from Armalo's pre-computed signals.
The pact registry tracks pact version changes with timestamps, which the dashboard's context panel uses to overlay the pact-version timeline on the core charts. When drift appears at the same time as a pact change, the overlay makes the relationship visible immediately, shortening the human's hypothesis-generation time.
The certification tier system is itself drift-aware: an agent's tier is held during a defined grace period after drift is detected, giving the agent's owner time to investigate and respond before the tier drops. The grace period is configurable per pact and is the operational counterpart to the dashboard's alert SLAs.
FAQ
How often should the dashboard refresh? Every 5 to 15 minutes for the core charts. More frequent refreshes consume pipeline capacity without producing useful new information; less frequent refreshes can miss bursty drift. The supporting panels can refresh less often.
Should drift telemetry be public or private? Private by default. The dashboard contains operational details that are not appropriate for outside observers. Counterparties should see the trust oracle's composite and tier; they should not see the agent's daily refusal-rate trend. Public derivative views β "this agent's composite has been within 2 points of its baseline for 30 days" β can be exposed for external consumption.
What if the agent's traffic is too low to produce useful percentiles? Aggregate over longer windows. A low-traffic agent may need daily percentiles rather than hourly. The dashboard should support window length as a parameter so the team can tune it to their traffic.
How do you handle agents that have been deployed for less than a baseline period? Use the platform's median agent as the baseline temporarily. The agent's own baseline forms over its first weeks of operation; until then, comparison to similar agents is the next-best reference.
Should drift telemetry be retained indefinitely? No. Full-resolution retention for 90 days, downsampled retention for a year, archival summaries for longer. Indefinite full-resolution retention is expensive and not useful; the historical signal value diminishes after a quarter.
Can the dashboard catch adversarial inputs that try to hide drift? Partially. The dashboard catches drift in the agent's overall behavior, regardless of whether the drift was induced by adversarial inputs. It does not directly detect adversarial inputs β that is a separate evaluation pipeline. The two together form a complete picture; alone, neither does.
What is the most common failure mode of a drift dashboard? Not being looked at. Beautiful dashboards that no one reads catch nothing. The daily review protocol exists to ensure the dashboard is read; the team's discipline determines whether the protocol holds.
Should the dashboard be open to the agent's developer or only to the platform team? Open to both. The developer needs visibility to debug; the platform team needs visibility to govern. Read access for both, write access (alert acknowledgment, suppression) gated by ownership.
Bottom Line
Drift is invisible until it is catastrophic. The agent that taught us drift telemetry had been drifting for six weeks before a customer complaint surfaced it; the team's existing tooling had not caught it because the existing tooling was not built to look for it. A drift telemetry dashboard, built around the four core charts β distribution divergence, refusal rate, response length percentiles, scope violations β and the supporting panels for context, comparison, and causation, makes drift visible daily. The alert design discipline turns drift into action without becoming noise. The human protocol β daily review, weekly cross-cohort review, incident response with post-incident notes β turns observation into resolution. The Pact Drift Dashboard Spec captures the design in implementable detail. The practice scales down for small teams and scales up for large ones. The fixed cost of building it is repaid the first time drift is caught early; the ongoing cost is small and the alternative β flying blind on production agent behavior β is the cost the industry has been paying without recognizing it.
The Agent Drift Detection Field Guide
Most teams find out about agent drift from a customer ticket. Here is how to catch it first.
- The five drift signatures and what they actually look like in prod
- Monitoring queries you can paste into your existing stack
- Sentinel-style red-team prompts that surface drift early
- Triage flowchart for "is this a real regression?"
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦