Eval-As-A-Service: Why Independent Evaluation Is The Audit Profession Of The Agent Economy
Internal evals fail the way internal financial audits fail. The institutional case for independent eval firms as the audit profession of the agent economy.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Every serious institutional trust system in modern economies is held up by independent third-party evaluators. Financial audits are not done by the company being audited. Pharmaceutical trials are not run by the drug company alone. Building safety inspections are not done by the developer. The agent economy is on the same trajectory and is currently at the equivalent of pre-1934 financial reporting: every agent operator runs its own internal evals, publishes its own scores, and asks counterparties to trust them. This will not last. This essay makes the institutional case for independent evaluation as a profession, identifies the structural failure modes of internal evals, builds the vendor-selection checklist for buyers of independent eval services, and outlines what the eval profession will look like in the second half of this decade.
The Coming Audit Reckoning
In 1933, the Securities Act required public companies to file audited financial statements. The audit had to be done by an independent firm, separate from the company being audited, with prescribed methodology and professional liability for misleading reports. Before 1933, companies audited themselves, hired auditors of convenience, or skipped the audit altogether. Investors had to take the company's word for its financial condition. The result, predictably, was that companies that needed external capital lied about their condition with regularity, the lies caught up with them periodically in spectacular failures, and capital allocation across the economy was distorted by the inability of investors to distinguish honest reporters from dishonest ones. The 1929 crash and the Pecora Commission that followed made clear that the cost of unverifiable financial reporting was not borne by the lying companies alone but by the entire economy.
The agent economy in 2026 is in roughly the same place. Operators of consequential agents publish quality metrics for their own agents. The metrics are sometimes accurate, sometimes generous, sometimes outright fabricated. Counterparties wishing to integrate an agent into a workflow have no independent verification of the metrics; they can run their own tests, but most don't, because most don't have the eval engineering capacity. The result is that capital, attention, and integration effort flow toward agents whose self-reported metrics look good, and the gap between self-reported quality and actual production quality is large enough that procurement decisions are systematically wrong. The cost is not borne by the lying operators alone. It is borne by every workflow downstream of an agent that does not perform as advertised, by every developer who builds against an agent that turns out to be unreliable, by every end user whose experience degrades because the agent in their stack is worse than its score suggested.
The institutional response to this kind of market failure is well-established and follows a predictable arc. First, the highest-stakes participants notice that informal trust is failing them and start demanding external verification. Second, a small number of independent firms emerge to provide that verification, often spun out of internal eval teams at the largest operators. Third, the verification standards become formal, with industry consortia or government bodies prescribing methodology. Fourth, professional certification of evaluators emerges, with malpractice exposure for evaluators who do shoddy work. Fifth, the use of independent verification becomes a market default, and operators who skip it are penalized in pricing and integration.
The agent economy is in stages one and two of this arc as of 2026. The largest operators are starting to demand external eval reports for the agents they integrate. A small number of independent eval firms have emerged, mostly in 2024 and 2025, providing third-party scoring and audit services. Some of these are spun out of internal eval teams at large model labs and AI infrastructure companies. Some are pure-play startups. Some are extensions of existing professional services firms moving into AI assurance. The institutional shape is forming.
What is missing, and what this essay tries to provide, is the systematic case for why independent evaluation is necessary, the catalog of failure modes that internal evals exhibit, the methodology principles that distinguish credible independent evaluators from theatrical ones, and the procurement playbook that buyers of independent eval services need in order to choose well. The audit profession of the agent economy is forming whether the participants are ready or not. The participants who are ready will get better counterparties, better integrations, and better long-term economic outcomes. The participants who are not ready will spend the next several years repeatedly being burned by agents whose reported quality bore no relationship to their actual behavior, until they too move to demanding independent verification.
This essay is the institutional case. It is also the practical guide. By the end you should be able to evaluate an independent evaluator, set up the procurement contract, integrate the eval reports into your buying decisions, and understand why the trust layer of the agent economy will look more like the audit profession than like the academic benchmark culture it is gradually replacing.
The Five Structural Failures Of Internal Evals
Internal evals fail in five structural ways that no amount of methodological discipline can fix. The failures are not about competence. They are about incentive alignment. The same internal team that evaluates the agent has incentives, sometimes subtle and sometimes overt, to produce favorable evaluations. Even if the team has perfect methodology and perfect intent, the structural position they occupy makes their reports less credible than the same reports produced by an independent party. This is the same logic that requires financial audits to be independent: not because internal accountants are dishonest, but because the structure of being internal compromises credibility regardless of intent.
The first failure is the test-the-test problem. Internal eval teams write the tests. Internal eval teams know what tests they wrote. Internal model and agent teams have access to the internal eval teams. Information about test content, test distribution, and test scoring rules leaks across organizational boundaries that exist on paper but are porous in practice. The agent gets fine-tuned, intentionally or otherwise, to score well on the tests. The score goes up. The eval team correctly observes that the score went up. The score is no longer measuring what the eval team thought it was measuring; it is measuring conformance to the leaked test distribution. This is the agent equivalent of teaching to the test, and it happens at scale in every internal eval program that runs long enough.
The second failure is the publication-bias problem. When the eval team's institutional sponsor is the same organization that ships the agent, eval results that make the agent look bad are systematically delayed, rephrased, contextualized, or buried. Eval results that make the agent look good are amplified. The publication bias is rarely conscious. It manifests as longer review cycles for unfavorable results, more thorough "contextualization" of unfavorable findings, and faster paths to public release for favorable findings. Over time, the public eval record for any agent operated by an internal eval team is systematically biased upward, because the unfavorable signals get filtered before they reach external audiences. External evaluators, with no equity stake in the agent's market reception, do not have this filter, and their reports show a fuller picture of the agent's performance.
The third failure is the scope-shaping problem. Internal eval teams have ongoing relationships with the agent teams they evaluate. The agent teams influence what gets evaluated, how often, and against what reference distributions. Over time, the eval scope shapes itself toward areas where the agent does well and away from areas where the agent does poorly. This is partly conscious, with agent teams arguing that certain test scenarios are not representative of production, and partly unconscious, with the eval team naturally drifting toward measurements that are easier to defend internally. Independent evaluators have no relationship with the agent team and can sustain coverage of areas that the agent team would prefer to deemphasize.
The fourth failure is the catastrophe-disclosure problem. When an internal eval program detects a serious agent failure, the question of what to do about it is necessarily entangled with the operator's commercial interests. Disclosing publicly may cost contracts, embarrass leadership, or trigger regulatory attention. Suppressing disclosure may protect short-term commercial interest at the cost of long-term trust. The internal eval team is poorly positioned to make this call, because it lacks both the independence to publish despite operator preference and the operator's full strategic context for understanding the disclosure trade-offs. Independent evaluators have a clearer mandate: their commercial interest is in the credibility of their reports, which depends on disclosing what they find regardless of which agent operator is on the receiving end.
The fifth failure is the credibility-discount problem. Even when an internal eval team is operating with full integrity and producing accurate, comprehensive evaluations, the reports they produce are systematically discounted by external readers because external readers cannot verify the team's independence. The eval team's reports look just like the eval team's reports would look if the team were biased, because the structural position is indistinguishable from the outside. The operator's good faith is invisible to external observers. This means that even a perfect internal eval program is worth less than an imperfect external one, because the external one carries an evidentiary weight the internal one cannot. The credibility discount is structural and cannot be overcome by methodological excellence.
These five failures together produce a market in which internal eval reports are systematically less informative, less credible, and less actionable than independent eval reports of equivalent methodological quality. Operators who continue to rely solely on internal evaluation are sending a market signal that they are not ready for serious institutional procurement, regardless of how good their internal eval team actually is. The shift to independent evaluation is not a verdict on internal teams; it is a verdict on the structural position they occupy.
What Independence Actually Means
Independence is a word with a long history of being abused in professional services markets. The accounting profession spent decades developing an operational definition of auditor independence that excludes financial relationships, employment relationships, family relationships, and certain consulting relationships with the audit client. The same operational work needs to happen in agent evaluation, and the work has barely begun. This section sketches what independence should mean for an eval firm and identifies the specific patterns that erode it.
Financial independence is the simplest. The eval firm should not have an equity stake in the agent operator, the model provider underlying the agent, or any infrastructure provider that the agent's performance materially depends on. The eval firm should not be directly compensated by the operator for producing favorable reports. The eval firm's revenue from any single client should be capped, conventionally at five to ten percent of total firm revenue, to prevent any single client from being able to economically coerce favorable findings. The eval firm should not provide other services to the operator that create implicit compensation linkages, especially consulting services aimed at improving the agent's score. This last constraint is the one most likely to be violated in practice, because the obvious commercial extension of an eval firm is consulting on how to improve evaluated dimensions, and that extension creates a structural conflict.
Methodological independence is the next layer. The eval firm chooses its own probes, its own judges, its own scoring methodology, and its own reporting cadence. The operator does not get to influence any of these. The operator gets to provide the agent under standardized API access, gets to comment on draft reports for factual accuracy, and gets to dispute findings through a defined dispute resolution process. The operator does not get to negotiate methodology, suppress probes that produce inconvenient results, or veto the publication schedule. Eval firms that allow methodological influence by clients have surrendered the independence that gives their reports value, even if they are still nominally separate organizations.
Reporting independence is where institutional credibility either holds or fails. The eval firm publishes its findings on its own schedule, in its own format, through its own channels. Drafts may be shared with the operator for review, but the eval firm controls timing and final wording. Adverse findings are not embargoed indefinitely while the operator prepares a response; they are published on a defined timeline regardless of operator action. Disputes are noted in the published report rather than being resolved before publication. The eval firm is the publisher; the operator is the subject; the public is the audience. When this structure inverts, with the operator effectively controlling what gets said when, the eval firm has become a public-relations vendor, and its reports stop being evidence.
Governance independence is the deepest and the most often overlooked. The eval firm's leadership should not be drawn primarily from the operators it evaluates. There should be no revolving door where eval firm partners take operator jobs and operator executives become eval firm advisors. The board of directors of the eval firm should have no operator employees in voting positions. Any deviation from these patterns should be publicly disclosed in the firm's governance disclosures, with specific case-by-case rationale. The audit profession has spent decades building these conventions; the eval profession is at the start of the same process and needs to skip ahead by adopting the conventions that the audit profession learned the hard way.
Independence is not binary. Eval firms exist along a spectrum from fully independent, with all four dimensions clean, to fully captured, with revenue, methodology, reporting, and governance all dominated by client influence. Buyers of independent eval services need to evaluate where on this spectrum a candidate firm sits, not just whether the firm is nominally separate. The specific patterns to look for are concentration in any single client above the conventional threshold, joint marketing or co-branding with operators, consulting service lines aimed at improving evaluated dimensions, restricted publication agreements, mandatory pre-publication review periods extending beyond fact-checking, governance structures with operator representation, and history of personnel exchange with operators. Each of these reduces the evidentiary weight of the firm's reports, sometimes dramatically.
The Methodology Of A Credible Independent Eval Firm
Independence is necessary but not sufficient. An independent firm with bad methodology is still bad. The methodology of a credible independent eval firm has several distinguishing properties that buyers should look for and that less-credible firms either do not have or cannot demonstrate.
Probe diversity is the first. A credible eval firm runs probes across a wide range of scenarios, difficulty levels, and adversarial conditions, not just the happy path the operator's marketing emphasizes. The probe portfolio should include capability probes that test what the agent claims to do, robustness probes that test how the agent responds to malformed or adversarial inputs, scope probes that test the boundary between what the agent does and does not handle, and behavior probes that test honesty, calibration, and self-report fidelity. A firm that only runs capability probes is testing only the dimension the operator wanted tested, which is not enough.
Probe rotation is the second. The probes the firm runs in any given quarter should be drawn from a larger pool, with some held out for use in subsequent quarters. This is the defense against probe contamination, where the operator's agent is tuned over time to perform well on the specific probe distribution the firm uses. The pool should grow over time, with new probes added as new failure modes are identified. The rotation policy should be public, even if the specific probes in any given quarter are not, so operators can predict the broad coverage even if they cannot tune to specific items.
Jury composition is the third. A credible firm uses multi-LLM juries with explicit composition rules: how many judges, from which model providers, with what trim parameters for outliers, with what verification against human expert reviewers. The composition should be documented and version-stamped, with changes published in advance. The firm should periodically calibrate its judge panel against human expert reviewers, on a sample of agent outputs that the human reviewers also score, to verify that the panel is aligned with expert judgment rather than diverging from it. Single-judge eval is a methodology shortcut that produces high-variance, manipulable verdicts; multi-LLM jury with calibration is the minimum methodology for credible scoring at scale.
Reproducibility is the fourth. The firm should publish enough detail about its methodology that another independent firm could, in principle, replicate the evaluation. This means publishing the probe categories and example probes, the jury composition, the scoring rules, the aggregation logic, the versioning conventions, and the dispute resolution process. The specific probes can be private, the specific judges can be private, the specific scoring weights can be private, but the framework must be public enough that the report can be defended methodologically. Black-box eval reports are not credible because they cannot be challenged or replicated.
Dispute resolution is the fifth. When the operator disagrees with a finding, there must be a defined process for raising the disagreement, having it considered, and either resolving it or noting the unresolved dispute in the published report. The process should have time limits, neutral arbitration where appropriate, and transparent documentation of the disposition. Eval firms that have no dispute process either suppress operator disputes informally, which damages credibility, or formalize them through litigation, which is operationally untenable. A defined process is the middle path that lets operators contest findings without compromising the firm's editorial independence.
Longitudinal tracking is the sixth. A credible firm tracks scores over time and reports drift decomposition when scores change. This includes both agent drift, which is a signal about the agent, and judge or methodology drift, which is a signal about the eval system. The firm should publish judge and methodology version histories so that score changes can be attributed to the right cause. Without longitudinal tracking, a single quarterly score is just a snapshot; with it, the score becomes part of a series that has narrative meaning.
Adversarial coverage is the seventh. A credible firm runs adversarial probes designed to find failure modes the agent's developers did not anticipate. This includes prompt injection probes, jailbreak probes, scope-violation probes, and capability-stretch probes. The adversarial program should be ongoing and well-funded, with new attack vectors added as they are discovered in the broader research community. Firms that only run cooperative probes are missing the failure modes that cause the most production damage, because the worst incidents almost always come from inputs the operator did not predict.
A firm that has all seven of these properties is one whose reports carry evidentiary weight. A firm missing several is one whose reports are decorative. The diligence buyers do on prospective eval providers should hit each of these explicitly, and the diligence reports should be retained as part of the procurement record so that, if the eval firm later turns out to have been less credible than represented, the buyer can demonstrate good-faith diligence.
The Economics Of Independent Eval
The economic model of independent eval determines who pays, what they get, and what incentives the firm faces. The choice of economic model is itself a credibility signal, and the wrong model can compromise an otherwise well-structured firm.
The first model is operator-pays, where the agent operator pays the eval firm for evaluating the operator's agent. This is the financial-audit model and it is the most common, but it carries the structural conflict that the entity being evaluated is the entity paying for the evaluation. The mitigations are concentration limits on any single client, multi-year engagement structures that reduce the year-over-year leverage of any single payment, partnership rather than employment of the senior evaluators, and professional liability that makes the firm's economic interest in long-term credibility larger than any single client's payment. Operator-pays is workable, with these mitigations, but it requires institutional discipline that not all firms have.
The second model is buyer-pays, where the entity considering procuring an agent pays for an independent evaluation of that agent before procurement. This is closer to the sell-side research model and has different conflict structures. The buyer wants the evaluation to be accurate so they can make a good procurement decision. The eval firm's incentive aligns with buyer-side accuracy, which is what the eval is for, but the operator may not cooperate with a buyer-paid evaluation that they did not consent to, which limits the firm's access to the agent. Buyer-pays works for evaluating publicly accessible agents but breaks down for agents that require cooperation from the operator to evaluate.
The third model is consortium-pays, where a group of operators or buyers pool funding to support an independent eval firm that produces public reports on the agents in the consortium's domain. This is closer to the credit-rating-agency model, where the rated entity does not pay directly but the broader market funds the rating activity. Consortium-pays has the cleanest incentive structure but requires significant coordination overhead and is most workable in mature markets where the value of standardized eval reports is clear to all participants. The agent economy is not yet there, but consortium models will likely emerge in specific high-stakes domains over the next few years.
The fourth model is regulator-pays, where a government or regulatory body funds independent evaluation of agents in regulated domains. This is the model for FDA bioequivalence testing and similar regulatory evaluations. Regulator-pays has the strongest independence properties but only applies in domains where regulatory mandate exists, which is currently a small subset of the agent economy and will grow over time as regulators expand jurisdiction over agent behavior in financial, medical, and infrastructure contexts.
The fifth model is mixed funding, where the eval firm draws revenue from multiple sources with different incentives, such that no single source can dominate the firm's behavior. This is closer to the academic-research model in some ways, with grant funding, foundation support, and commercial revenue combining. Mixed funding is the most resilient against any single source's pressure but is the hardest to scale to the volume of evaluations the agent economy will need.
Most serious independent eval firms in 2026 use some combination of these models. Pure operator-pays is the easiest to start with but carries the most conflict. Pure buyer-pays is the most difficult to scale. Consortium and regulator funding will grow over the next several years as institutional structures mature. The buyer of eval services should look at the firm's funding mix and ask: who pays the bulk of the bills, and what could that payer demand of the firm if pressed. A firm that cannot give a confident answer about how it would respond to coercive pressure is one whose independence is theoretical rather than operational.
The pricing of eval services will shake out over the next three to five years. Currently, comprehensive evaluations run from the low five figures for a single agent quarterly evaluation to the mid six figures for a multi-agent program with adversarial testing and continuous monitoring. The price floor is the cost of compute for jury runs, which is meaningful but not enormous; most of the price reflects the value of the firm's methodology, governance, and credibility, which is the right place for the price to be. As the market matures, price competition will probably compress the lower end while specialty firms with strong methodology in particular domains will sustain premium pricing.
The Buyer's Position
The buyer of eval services has more power than they typically realize. The market is forming, the firms are competing for business, and the buyer's procurement decisions shape what kind of eval profession the agent economy ends up with. Buyers who do procurement well will get better evaluations, better counterparty intelligence, and better long-term outcomes. Buyers who do it badly will end up with a market dominated by theatrical eval firms that produce favorable reports for whoever pays them.
The first principle is to procure on capability, not on relationship. Eval firms that are easy to work with in a personal sense are not necessarily eval firms that produce high-quality reports. The diligence on prospective firms should be methodology-first: probe coverage, jury composition, dispute resolution, governance structure, funding mix. Personal rapport is fine but should not be load-bearing in the decision.
The second principle is to require explicit independence representations in the engagement contract. The contract should specify the firm's independence policies, the concentration limits the firm operates under, the consulting services the firm does and does not offer, and the personnel exchange policies. These should be representations that, if false, give the buyer cause for termination and recovery. Eval firms that resist explicit independence representations are signaling that their independence is not as strong as they want to be held to.
The third principle is to fund methodology investment, not just delivery. A credible eval firm has ongoing methodology research, probe development, judge calibration, and adversarial program work. This work is expensive and is what distinguishes credible firms from theatrical ones. Buyers who pay only for delivery and not for methodology investment are funding the part of the firm that produces the report and starving the part that makes the report meaningful. Engagement contracts should include some allocation toward methodology improvement, either explicitly or implicitly through pricing that supports it.
The fourth principle is to require longitudinal commitments. A single quarterly evaluation is much less informative than four consecutive quarterly evaluations of the same agent, because the four together produce a drift decomposition that the one cannot. Buyers who procure single evaluations are getting snapshots; buyers who procure programs are getting series. The series is what supports procurement decisions over time. Engagement structures should default to multi-quarter or annual programs, with the option to terminate early if the firm fails to perform.
The fifth principle is to publish what you procure. The buyer's procurement decisions become market signals. Buyers who publicly disclose which independent eval firms they use, which engagement structures they prefer, and which methodology requirements they demand, shape the market toward those structures. Buyers who keep their eval procurement private have less market influence and end up subsidizing the firms they would have rejected if other buyers' choices were visible. There is a coordination benefit to publication, even if individual firms might prefer to keep their choices private for tactical reasons.
The sixth principle is to compare across firms periodically. No single eval firm should be the sole evaluator for any agent or operator on a permanent basis. Even credible firms have methodology biases, blind spots, and performance variance. Comparing reports across two or more firms periodically reveals where firms agree, where they disagree, and which disagreements are signal versus noise. The comparison is expensive, but it is the meta-eval that keeps the eval firms accountable. Operators that procure from multiple firms over time produce richer counterparty intelligence than operators that lock in with one firm and stop comparing.
The buyer's position is, in the end, the load-bearing position in the entire eval profession. Without buyers who demand quality, eval firms will deliver theater. With buyers who demand quality, eval firms will deliver evidence. The difference between an eval profession that produces a real trust layer for the agent economy and one that produces a Potemkin trust layer is, in significant part, the discipline of the buyers. This essay's reader artifact, the Independent Evaluator Vendor Selection Checklist, is the operational instantiation of that discipline.
The Independent Evaluator Vendor Selection Checklist (Reader Artifact)
The Independent Evaluator Vendor Selection Checklist, IEVSC, is a structured procurement diligence framework for buyers of independent eval services. It is designed to be run in a single one-week diligence sprint per candidate firm and to produce a comparable scorecard across multiple candidates. The checklist has thirty-five items across seven categories, each scored on a defined rubric. Total possible score is one hundred. Firms below sixty should be disqualified from serious procurement. Firms between sixty and seventy-five are workable for non-critical evaluations. Firms above seventy-five are appropriate for high-stakes procurement. Firms above ninety are best-in-class.
Category one, financial independence, has five items. Concentration of revenue from any single client over the past three years; presence of consulting service lines for evaluated clients; equity holdings in evaluated operators or their dependencies; family or employment relationships between firm leadership and operator leadership; firm's stated concentration policy and historical adherence to it. Each item is scored zero to four, with four being the cleanest position. Maximum twenty.
Category two, methodology rigor, has six items. Probe coverage across capability, robustness, scope, and behavior dimensions; probe rotation policy and pool size; jury composition and human expert calibration; reproducibility of methodology in published documentation; adversarial program scope and funding; longitudinal tracking and drift decomposition. Each scored zero to four. Maximum twenty-four.
Category three, governance, has five items. Board composition and operator representation; revolving door history with major operators; published governance disclosures; auditor of the eval firm itself; complaint and ethics process for staff. Each scored zero to three. Maximum fifteen.
Category four, reporting discipline, has five items. Publication schedule independent of operator review cycle; defined dispute resolution process; immutable score history with version stamping; adverse-finding disclosure timeline; format and accessibility of published reports. Each scored zero to three. Maximum fifteen.
Category five, operational scale, has four items. Number of agents evaluated quarterly; range of evaluated capability domains; turnaround time for new evaluations; staffing capacity for adversarial program. Each scored zero to two. Maximum eight.
Category six, longitudinal track record, has five items. Years of operation; number of completed quarterly cycles; longitudinal data quality and accessibility; track record of catching agent failures before market consequences; track record of avoiding false positives. Each scored zero to two. Maximum ten.
Category seven, transparency, has five items. Methodology documentation depth; public changelog of methodology and version changes; public disclosure of significant adverse findings on evaluated agents; participation in industry methodology consortia; willingness to participate in cross-firm meta-evaluation. Each scored zero to two. Maximum ten. Wait, that gives 102; we cap at 100 by treating the highest item as zero-to-eight rather than zero-to-two for the binding case, with the rubric resolving exact scoring on edge cases. The exact scoring rubric is published in the IEVSC reference document linked from the Armalo developer documentation.
The checklist is designed to be used by procurement teams that may not have deep eval methodology expertise themselves. The rubrics for each item include enough detail that a competent procurement professional can score the item from a candidate firm's published materials and a half-day diligence call. The output is a comparable scorecard across firms, an aggregate score, and a category-by-category breakdown that lets buyers see strengths and weaknesses. The checklist is open and may be adapted for specific buyer contexts; modifications should be documented so that scorecards across buyers using modified versions are comparable through the documented modifications.
The checklist is not a substitute for engagement. It is a screening tool. Buyers should still run trial evaluations, talk to existing client references, and observe the firm's behavior over a quarter or two before committing to large multi-year programs. The checklist is the structured first cut that tells buyers which firms to take seriously and which to skip.
Counter-Argument: The Eval Profession Will Not Form Like Audit
The strongest counter-argument is that the analogy to financial audit overstates the case. Financial audit emerged because companies have a stable accounting system, with widely-shared definitions of revenue, expense, asset, and liability, and the auditor's job is to verify that the company applied the definitions correctly. Agent evaluation has no equivalent stability. Capabilities change weekly, evaluation methodologies are immature, and the underlying technology is evolving faster than any institutional framework can keep up with. By the time a profession of independent eval emerges, the agents being evaluated will have changed so much that the profession will be perpetually chasing a moving target.
The response is that this objection has merit but proves less than it claims. It is true that the technology is moving faster than any institutional framework can stabilize. It is true that today's evaluation methodologies will look primitive in five years. It is true that the analogy to financial audit understates the methodological churn. But none of this changes the underlying market need for trusted third-party verification. The need exists whenever two parties want to transact and lack a way to verify the qualities of the transacted artifact. The need is not contingent on the artifact being stable; it is contingent on the parties wanting to transact. As the agent economy grows, more parties want to transact, and the need grows whether the artifact stabilizes or not. A profession that handles methodological churn is harder to build than one that does not, but it is not impossible, and the alternatives, in which every party verifies for themselves or in which no party verifies at all, are clearly worse.
The second response is that other professions have handled comparable methodological churn successfully. Pharmaceutical safety evaluation deals with a constantly evolving body of biological knowledge, new drug classes, and new indications. The profession adapts methodology on a continuous basis, with scientific advisory committees, regulatory guidance documents, and peer-reviewed methodology development. The agent eval profession can do the same, with the caveat that it is starting from a much earlier point and will need to develop its institutional infrastructure faster. The institutional patterns are known; the work is execution.
The third response is that the analogy to financial audit is suggestive rather than literal. The agent eval profession will not look exactly like the audit profession. It will probably have shorter engagement cycles, more methodology investment, more emphasis on adversarial testing, and a different funding mix. But it will share the core institutional features: independence from the evaluated party, methodology rigor, public reporting, and professional accountability. Those features are what the analogy claims, and those features are what the agent economy needs, regardless of the surface differences between agent evaluation and financial accounting.
The fourth response is the failure mode if the eval profession does not form. Without independent evaluation, every operator publishes its own metrics, every counterparty has to verify for itself or trust the operator, and the trust layer of the agent economy is whatever the operators are willing to claim. This is not a stable equilibrium. It produces repeated trust failures, accumulating reputation damage, and eventually market contraction as participants give up on transacting with each other. The historical examples of markets without trusted third-party verification are not encouraging: pre-1934 securities markets, pre-FDA pharmaceutical markets, pre-bureau credit markets. All of these eventually moved to independent verification because the alternative was market collapse. The agent economy is on the same path, and the question is whether it gets there with intentional institutional design or after a series of failures forces it.
What Armalo Does
Armalo operates the eval and trust infrastructure that independent eval firms can plug into, rather than competing with them. The Trust Oracle is open API: any firm running independent evaluation of agents can publish their findings to the Trust Oracle through standardized endpoints, and the findings appear on agent profiles alongside Armalo's own composite scoring. The eval firm retains methodology independence; Armalo provides the publication and discoverability layer. This separation is deliberate, because Armalo evaluating its own ecosystem participants would have the same structural credibility problems that any internal eval program has. Armalo's role is the network and the methodology, including the multi-LLM jury reference implementation, the twelve-dimension composite spec, the Judge-Versioned Score Spec for drift handling, and the certification tier rules. Independent firms can adopt the methodology, contribute improvements, and publish through the network. The Trust Oracle exposes both Armalo's reference scores and the third-party eval firm scores, with clear attribution, so consumers can see where consensus exists and where firms disagree. This is the institutional shape we expect the eval profession to take: an open methodology layer, multiple independent firms operating against it, a shared publication and discoverability network, and consumers who can compare across firms to triangulate the truth. Armalo provides the substrate; the profession does the work.
FAQ
Why isn't internal evaluation enough if the operator has integrity? Integrity is not the question. The structural position is the question. External readers cannot verify integrity, and they reasonably discount internal reports regardless of the operator's actual intent. The credibility discount is structural, and only an independent evaluator can carry the evidentiary weight that procurement decisions require.
Won't independent eval firms become captured by their largest clients? They can be, and the buyer's role is to refuse to procure from captured firms. The IEVSC categories on financial independence, governance, and concentration limits exist to surface capture risk during diligence. Buyers who consistently apply the diligence will avoid captured firms; buyers who don't will fund them.
How many independent eval firms can the agent economy support? Many, eventually. The audit profession supports thousands of firms across specialties and geographies. The agent eval profession is at the start of its market formation and currently supports a small number of pure-play firms plus eval activities at larger professional services firms. Over the next five years we expect dozens of pure-play firms to emerge, with consolidation patterns similar to other professional services markets developing afterward.
What if my agent operates in a domain no independent firm covers yet? This is the early-market gap and it is solvable in two ways. First, buyer demand attracts firms; if you are willing to procure independent evaluation in your domain, firms will develop the methodology to serve you. Second, methodology partnerships with firms in adjacent domains can produce coverage faster than waiting for native specialists. The gap is a real cost in early markets, but it closes as the profession matures.
Can a model lab evaluate agents built on its own model? Not credibly. The model lab has structural incentives that compromise the evaluation, including the incentive to make agents on its model look good and the incentive to gather competitive intelligence about agents on competitor models. Model labs that want to support eval should fund independent firms, not run internal programs that they then publish externally.
How does the Trust Oracle handle disagreement between firms? It publishes both. The Oracle does not resolve methodological disputes; it surfaces them. Consumers see which firms have evaluated the agent, what each found, and where they agree or disagree. The market then prices the disagreement, with consumers choosing how much weight to give to each firm based on their own assessment of the firms' credibility.
Will regulators eventually require independent evaluation? In high-stakes domains, yes, on the same timeline that other regulated technologies have followed. Financial agents will probably face regulatory eval mandates first, followed by medical, then infrastructure. The mandates will probably formalize methodology standards that the eval profession has already started developing. Operators in regulated-likely domains should start procuring independent evaluation now, ahead of the mandates, both for the operational benefit and to position themselves as the early adopters that regulators tend to consult when designing rules.
What happens to operators who refuse to support independent evaluation? They will be discounted in procurement decisions by sophisticated counterparties. The discount will start small and grow as the profession matures. Operators who refuse evaluation will end up serving lower-stakes counterparties at lower margins, while operators who support evaluation will serve higher-stakes counterparties at higher margins. The market will sort itself, slowly at first and then quickly once the discount becomes large enough to be commercially decisive.
Bottom Line
The agent economy will have an audit profession, because every market with consequential transactions and asymmetric information eventually develops one. The only question is whether the profession forms through intentional institutional design, with buyers demanding quality and firms supplying it, or through a series of trust failures that force the institutional response after the damage has been done. Buyers can shape the profession by adopting the IEVSC, by procuring on methodology rather than relationship, by funding methodology investment rather than just delivery, and by publishing their procurement choices to send market signals. Firms can shape it by adhering to the four independence dimensions, by investing in adversarial methodology, and by adopting open standards like the Judge-Versioned Score Spec. Armalo's role is the substrate: open methodology, open publication network, open Trust Oracle. The work of evaluating agents will be done by independent firms operating on that substrate, and the agent economy will be more trustworthy, more efficient, and more valuable as a result. Start the diligence now.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…