You Cannot Evaluate What You Cannot Reproduce: The Multi-Modal Eval Crisis Nobody Talks About
Text-only evals were already lossy. With audio, video, and sensor streams in the input, deterministic replay is effectively dead. Without replay there is no eval. Without eval there is no trust.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
There is a quiet crisis underneath the multi-modal AI wave. Most of the industry has not noticed it yet because the leaderboards still publish numbers and the demos still feel magical. The crisis is this: the evaluation methodologies that justified text-only models cannot be applied honestly to multi-sensory models. The numbers being published are increasingly meaningless, and the trust infrastructure that depends on those numbers is being built on sand.
The root cause is reproducibility.
Why text models could be evaluated at all
The reason text-only LLM evaluations have any credibility is that text is a discrete, lossless, low-bandwidth modality. You can write down the input, write down the output, save them in a file, share them with anyone in the world, and they will be evaluating bit-identical artifacts. The evaluation is reproducible because the substrate is reproducible.
Even within text the reproducibility was imperfect β tokenizer differences, sampling temperature, system fingerprint drift β but those were tractable, well-understood, and could be controlled with disciplined eng practice.
The minute you add image, audio, or video to either the input or the output, this property collapses.
The four ways multi-modal eval reproducibility breaks
Input encoding ambiguity. When you pass an image to a multi-modal model, you are not passing pixels β you are passing a sequence of encoder embeddings derived from those pixels. Different versions of the encoder produce different embeddings from the same pixels. Even within a version, image preprocessing (resize, crop, color space, JPEG re-encoding) introduces non-determinism. Two evaluators with the same model checkpoint and the same source image can feed materially different inputs to the model and report different results, both of them claiming to evaluate "the same image."
Audio sampling and codec lossiness. Audio is worse. Sample rates differ, codecs differ, normalization differs, voice activity detection thresholds differ. An audio clip uploaded to two different evaluation harnesses arrives at the model as two different waveforms. "Reproduce my audio eval" is a sentence that, when taken seriously, requires shipping not just the audio file but the entire preprocessing pipeline, the codec implementation, the sample-rate converter, and the bit-exact encoder build.
Video temporal sampling. Video adds a fourth dimension. Which frames did the encoder sample? At what stride? Did it use uniform sampling, scene-change sampling, or adaptive sampling? Different multi-modal models make different choices, and even the same model may make different choices on different invocations under load. The "same video" produces different model-visible inputs across runs.
Generated output that cannot be diffed. When the model emits an image or audio clip rather than text, the evaluator faces a new problem: how do you score whether two generated images are equivalent? Bit-level equality is the wrong question β the model produces stochastic, slightly different outputs even with temperature zero. Perceptual similarity is closer to the right question, but every perceptual metric is itself a model with its own failure modes and its own non-determinism. You end up evaluating models with models, in a regress that erodes confidence in every layer.
Why this breaks trust infrastructure specifically
A trust layer for AI behavior depends on being able to say, after the fact, "we evaluated this behavior against this contract and it either passed or failed." That sentence presupposes that the evaluation can be re-run, audited, contested, and verified by a third party. Reproducibility is not a nice-to-have; it is the substrate on which the entire trust apparatus rests.
When the evaluation cannot be reproduced, none of the following are possible:
- A counterparty cannot independently verify a vendor's claim that "our agent passed eval X with score Y." They have to take the vendor's word for it. That is not third-party trust; that is brand trust.
- A regulator cannot replay an incident to determine cause. The incident is unreplayable by construction; the inputs no longer exist in the exact form the model originally saw.
- A buyer cannot compare two vendors' claims on equal footing. Vendor A and Vendor B ran their evals on different preprocessing pipelines and the numbers cannot be compared.
- An insurance underwriter cannot price model risk. The actuarial calculation assumes the evidence base is reproducible. It is not.
The eval crisis is therefore not an eval problem β it is a trust-infrastructure problem masquerading as an eval problem.
What the real fix looks like
The honest fix has three layers. None of them are popular because all of them are expensive.
Capture the model-visible inputs, not the source inputs. The evidence saved for any multi-modal evaluation has to be the exact bytes the model perceived after preprocessing β the encoder embeddings, the resampled audio waveform, the sampled video frames β not the source files. This eliminates encoder/codec/sampling drift between original behavior and replay. It also vastly increases storage cost; an honest multi-modal trust layer is fundamentally more expensive than a text-only one.
Make the evaluator deterministic where possible and probabilistic where not, transparently. Some evaluations can be made deterministic with care (exact-match on text output, structural diff on tool calls). Others are irreducibly probabilistic (perceptual similarity, voice-quality scoring). Both are acceptable, but they must be labeled. A trust layer that mixes deterministic and probabilistic verdicts without disclosing which is which is misleading by construction.
Run evaluations on live traffic continuously, not on test sets in CI. Once reproducibility is achieved on the model-visible input artifacts, the evaluation should be performed on every real production call, not on a frozen benchmark. The point of the eval is no longer to grade the model on a homework set; it is to produce a per-call verdict that contributes to a continuously updated trust score. Live continuous evaluation is the only thing that actually answers the question buyers and regulators are now asking.
The uncomfortable structural conclusion
If you accept that multi-modal evaluation requires capturing model-visible inputs, mixing deterministic and probabilistic verdicts transparently, and running continuously on live traffic β then you have also accepted that the entity running the evaluation cannot be the same entity producing the model. The conflict of interest is too sharp: the evaluator and the evaluatee share the same loss function. You also cannot run this infrastructure as an internal team at a company that ships the model, because that team will, sooner or later, be told to soften a verdict before a release. This is not cynicism about people; it is realism about incentives.
What multi-modal AI requires, therefore, is an independent counterparty layer that captures the model-visible artifacts, runs the joint cross-modal evaluation, produces the verdict, and stores the evidence under a different organization's roof than the model builder's.
That is what serious trust infrastructure looks like in a multi-sensory world. Anything less is theater dressed up in eval numbers.
β Armalo runs continuous, third-party verification of agent behavior with reproducible, model-visible evidence retention. See how at armalo.ai.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦