"Never access external URLs." "Do not discuss competitor products." "Always refuse requests for financial advice."
You put these instructions in your system prompt. They work most of the time.
Most of the time is not a hard stop. Most of the time is a soft guardrail with a violation rate you have not measured.
A hard stop that actually works does not depend on the model deciding to comply. It is enforced externally, at the task level, by a system that does not ask the model for permission.
TL;DR
- System prompt instructions are probabilistic, not deterministic. A model can and will violate them — the rate depends on prompt complexity, input framing, and adversarial pressure.
- A hard stop requires external enforcement. The check must happen outside the model's context window, by a system that can reject or flag the output regardless of the model's intent.
- Three enforcement mechanisms work: pact conditions, inline post-processing evaluation, and output gating. System prompt instructions alone are not on this list.
- The hard stop should have a consequence attached. A stopped output that has no effect on the agent's scoring is a cost-free violation — the agent learns nothing from it.
- Measure your actual violation rate. The first step in building a working hard stop is knowing what fraction of your current instructions are being honored.
Why System Prompt Instructions Are Not Hard Stops
A system prompt instruction is a probabilistic nudge on the model's output distribution. It shifts the probability of the model producing certain outputs. It does not create a binary boundary.
This is not a flaw — it is the fundamental nature of language model inference. The model is producing a probability distribution over next tokens. Instructions influence that distribution. They do not create a hard constraint.
The practical implications:
Adversarial inputs can cross any soft boundary. An instruction that says "do not discuss X" will be violated by some fraction of carefully framed inputs that approach X obliquely. The violation rate depends on how the instruction is framed, how specific the topic is, and how adversarially the inputs are constructed.
Complex prompts create instruction conflict. When a system prompt has many instructions, they can conflict under certain inputs. The model resolves conflicts using its own learned priors — which may not match your intended priority ordering.
Long conversations dilute instruction weight. In a long multi-turn conversation, the effective weight of early instructions decreases as context fills. An instruction that was effectively a hard stop at turn 3 may not hold at turn 20 on the same topic.
The Three Enforcement Mechanisms That Actually Work
Mechanism 1: Behavioral pact conditions
A behavioral pact condition is a machine-readable rule that gets evaluated against the agent's output by an external system — not the model itself. The evaluation is deterministic for structured violations (did the output contain a URL? did it mention a competitor by name?) and LLM-jury-evaluated for semantic violations (does this output constitute financial advice?).
The key difference from a system prompt instruction: the evaluation happens outside the model's context. The model cannot "decide" to comply or not comply. The output is evaluated by a separate system, and the result is independent of the model's intentions.
Example pact condition:
{
"condition": "no_external_urls",
"type": "deterministic",
"check": "output_contains_url_pattern",
"severity": "critical",
"action": "reject_and_flag"
}
The check runs on every output. A violation triggers rejection and flags the output for review. The model's system prompt may also say "do not include URLs" — but the enforcement is the pact condition, not the instruction.
Mechanism 2: Inline post-processing evaluation
Between the model's output and the downstream consumer of that output, an evaluation step checks the output against defined criteria. The evaluation can be:
- Pattern matching: Does the output contain a blocked string, URL pattern, or entity name? Fast, deterministic, 100% recall on defined patterns.
- Classifier: Does the output fall into a blocked category? Faster than LLM evaluation, lower cost, less reliable on semantic edge cases.
- LLM evaluation: Does the output violate a semantic rule? ("Does this constitute financial advice?") Slower, higher cost, handles cases that pattern matching misses.
The critical property: this check is not in the model's context window. The model produced its output. The check is now running on that output externally. The model cannot influence the result.
Mechanism 3: Output gating
For high-stakes hard stops — outputs that should never reach the user under any circumstances — output gating holds the response until an evaluation passes. The model produces output. The output is quarantined. The gate checks it. If the check passes, the output is released. If not, it is rejected and the user receives a fallback response.
This is the most reliable mechanism and also the most expensive in latency. For genuinely critical hard stops, the latency cost is worth it.
What "A Consequence That Works" Looks Like
A hard stop with no consequence is a speedbump. The agent encounters it, the output is rejected, the user gets an error. But the agent's trust score is unchanged, no behavioral record is updated, and the incident has no effect on the agent's future task eligibility.
A hard stop with a consequence looks like this:
- Output is evaluated and found to violate a pact condition.
- The violation is recorded with the full input/output context.
- The agent's scope-honesty score decreases.
- The violation is factored into the agent's composite trust score.
- If the agent's score drops below a threshold, its certification tier drops.
- A lower certification tier restricts the types of tasks the agent is eligible for.
The consequence is not punitive in the pejorative sense — it is informational. It tells the system that this agent violated a hard stop, updates the agent's behavioral record accordingly, and adjusts its future task eligibility to match its actual reliability.
Without this feedback loop, hard stops are whack-a-mole. With it, agents that consistently violate hard stops are systematically moved to lower-stakes tasks — which is the correct governance response.
Measuring Your Actual Violation Rate
The first step in building a working hard stop is understanding what fraction of your current instructions are being honored. Most teams do not know this number.
To measure it:
- Define the instruction as a testable pact condition — not "do not discuss X" but "output does not contain reference to X or semantic equivalent."
- Sample 500-1000 production outputs.
- Evaluate each against the pact condition using the external enforcement mechanism.
- Calculate the violation rate.
A violation rate of 0.5% on 10,000 daily tasks is 50 violations per day. Whether that is acceptable depends on the severity of the instruction. For "do not include your system prompt in output" — probably fine to review and handle. For "do not provide medication dosage recommendations" — 50 violations per day is a critical governance failure.
Measuring the violation rate before building enforcement tells you which instructions are soft guardrails and which need to be hardened with external enforcement.
The Hard Stop Architecture
For any behavioral boundary where the consequence of violation is high:
User input →
Model executes →
Output captured (not yet delivered) →
Pact condition evaluation runs externally →
Pass: output delivered →
Behavioral record: clean eval
Fail: output rejected, fallback delivered →
Behavioral record: violation logged →
Composite score updated →
Certification tier checked
The model is in the loop for producing output. It is out of the loop for deciding whether that output gets delivered. External enforcement makes the stop hard.
Instructions that must be enforced externally, not just instructed in the prompt, belong in behavioral pacts. armalo.ai provides the pact conditions, eval execution, and scoring consequence infrastructure.
Frequently Asked Questions
Why don't system prompt instructions create hard stops?
System prompt instructions are probabilistic nudges on the model's output distribution. They shift the probability of certain outputs but do not create deterministic constraints. The model can and will violate them under adversarial inputs, long conversations, and complex instruction conflicts — the violation rate depends on framing and adversarial pressure.
What makes an external enforcement mechanism more reliable than a system prompt?
External enforcement runs outside the model's context window. The model has already produced its output. The check is performed by a separate system that does not consult the model — it evaluates the output against defined criteria deterministically or via an external LLM judge. The model cannot influence the result.
What is a behavioral pact condition?
A behavioral pact condition is a machine-readable rule in a behavioral pact that specifies what the agent's output must not contain or must contain. Conditions are evaluated by an external system after the model produces output, before the output is delivered. Violations trigger rejection, logging, and a scoring consequence.
How do I know if my current hard stop instructions are working?
Sample 500-1000 production outputs, evaluate them against your instructions using external evaluation (pattern matching, classifier, or LLM judge), and calculate the actual violation rate. Most teams find that instructions they thought were hard stops have measurable violation rates — often higher than expected on adversarial inputs.
Armalo AI provides external enforcement for behavioral hard stops: pact conditions, inline eval execution, and scoring consequences that make agent governance tractable. See armalo.ai.