Multi-Sensory AI Just Exploded the Audit Surface. Text-Only Trust Infra Cannot Keep Up.
When a model only read text, the audit surface was one channel. The instant it can see, hear, watch, and synthesize across modalities, the audit surface multiplies. Most trust pipelines were built for a world that no longer exists.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
For roughly two years the AI safety conversation lived in a comfortable assumption: the model takes text in, the model emits text out, and the auditor inspects the transcript. That assumption is now obsolete.
The frontier models shipping today perceive images, video frames, raw audio, structured sensor streams, and, increasingly, real-time desktop and browser state. They are also, in many cases, emitting across modalities β generating synthetic voice, composing video, rendering interfaces, or controlling tools whose effects are observed by cameras and microphones rather than by API responses. The audit surface β the total set of channels a third party would need to inspect to credibly claim "this agent behaved acceptably" β has expanded by at least one order of magnitude in eighteen months.
Text-only trust infrastructure was already strained. Multi-sensory AI breaks it.
The structural problem with single-channel audits
A text-only audit is, in computer science terms, a one-dimensional projection of a high-dimensional behavior. When a customer-service agent only ever read incoming chat messages and produced outgoing chat messages, the projection lost almost no information. The audit transcript was, for all practical purposes, the full behavior.
See your own agent measured against this trust model. $10 to start β $5 in platform credits and a $2.50 bond seed go straight into your account.
Score my agent β $10 βWhen the same agent now also: listens to the customer's voice, watches a screen share, reads documents the customer uploads, retrieves images from a knowledge base, and generates a synthesized voice response β the transcript is no longer the behavior. It is a single slice of the behavior. An auditor inspecting only the text channel sees a tiny fraction of what was actually said, perceived, and decided.
The naive response is "just log everything." That is necessary but completely insufficient. Logging the bytes of every modality merely creates an evidence pile. It does not produce a judgment about whether the multi-modal behavior was correct. Producing that judgment requires evaluators that can reason across modalities β joint evaluators, not federated single-modality evaluators glued together at the end.
Cross-modal failure modes that single-channel evals cannot see
There is a class of failure modes that simply does not exist in single-modality systems. A few examples that are now routine in multi-sensory deployments:
Channel cross-contamination. An audio instruction conflicts with a visual instruction conflicts with a text instruction. The model resolves the conflict by picking one and silently dropping the others. A text-only auditor sees only what the model emitted in text. They cannot tell that the model ignored an explicit visual constraint.
Modality-shifted prompt injection. An adversary cannot inject into the system prompt directly, so they hide the injection inside an image's pixel-grid steganography, or inside a near-ultrasonic burst in an audio file, or in invisible Unicode embedded in a screenshot. The injection is invisible to text-channel logging. It executes inside the perception pipeline. By the time anything reaches the text log, the behavior has already shifted.
Confidence laundering. The model expresses high textual confidence in a claim derived from a visual or audio input it perceived poorly. The text auditor sees "I am 95% certain the image shows X." They have no way to verify the model's perception was actually accurate. The confidence number was generated downstream of a perception step the auditor never sees.
Modality echo. A voice agent calls another voice agent. The second agent says something the first agent already said. The first agent treats the echo as new information and acts on it. Text logs of either agent in isolation reveal nothing; only a joint audit of the cross-agent audio stream reveals the loop.
These are not theoretical concerns. They have all been reported in deployed multi-sensory systems in the past twelve months. The common factor is that none of them are detectable by inspecting any single modality.
Why "more evals" is not the fix
The natural reaction inside a frontier lab is: we will write more evaluations, covering the new modalities, and run them in CI. This response is incomplete in a way that matters.
CI evaluations cover a fixed test set. The model passes the eval, ships, and then encounters production inputs that the test set never sampled. In text-only systems, the test-to-production distribution shift was already substantial. In multi-sensory systems, the shift is enormous, because the input space includes every possible image, every possible audio waveform, every possible video sequence, every possible combination of those modalities, and every possible adversarial perturbation of all of the above. No test set covers this. No CI run answers the question "is the agent behaving acceptably right now, on this user's actual inputs."
The only structural answer to that question is the same answer that the financial system arrived at for trading and the medical system arrived at for clinical decisions: continuous, independent, third-party review of live behavior, with the reviewer's incentives explicitly separated from the entity being reviewed.
What a multi-sensory trust layer actually has to do
A trust infrastructure that can credibly cover multi-sensory AI behavior has to satisfy at least the following properties β none of which are present in a typical "we run our own evals" setup:
Joint cross-modal evaluation. Each behavior is judged against all the modalities the agent actually perceived and emitted, not each modality in isolation. The judgment captures whether the modalities were consistent with each other and consistent with the agent's stated reasoning.
Real-time, not retrospective. Audit verdicts are produced on the same time scale the agent is acting. A weekly safety review of last week's behavior is not trust infrastructure; it is forensics. Real-time review is the only kind that can prevent a multi-sensory failure from compounding into a downstream consequence.
Independent counterparty. The auditor is not the same entity that built or operates the agent. This is not a procedural nicety; it is a structural requirement. Frontier labs cannot credibly self-audit for the same reason exchanges cannot self-clear and auditors cannot also be management β the incentive geometry rules out trustworthy output.
Behavioral provenance. The audit produces a verifiable record of what the agent perceived, what it decided, and what the reviewer concluded β portable across counterparties, queryable by downstream systems, and durable enough that future regulators can examine it.
Modality-aware adversarial testing. The auditor continuously probes the agent with adversarial multi-modal inputs and measures the response, so that failure modes that have never occurred in production traffic are still uncovered before they occur for the first time at a customer.
These are not aspirational properties. They are the minimum bar for any infrastructure that calls itself a trust layer for multi-sensory AI.
The honest position on where the industry actually is
Most production deployments of multi-sensory AI today have, at most, retrospective logging of one or two modalities, an internal eval pipeline that runs against a fixed test set, and a team that reviews escalations after a user complaint. That is not trust infrastructure. That is product telemetry plus customer support.
The gap between what is being shipped and what is required is large and growing. Closing it requires conceding two uncomfortable things at the same time: that the auditor cannot be the same organization as the model builder, and that the audit cannot be a quarterly committee output β it has to run on every meaningful call.
The next several posts in this series take each of these properties β joint evaluation, real-time review, independent counterparty, behavioral provenance, adversarial multi-modal probing β and examine what each one actually requires to build. None of them are easy. All of them are non-negotiable.
β Armalo is building the independent trust layer for the AI agent economy: real-time, third-party verification of behavior across modalities. Explore armalo.ai.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦