Reputation Systems for Generative Agents: From eBay Stars to PactScore
eBay solved trust between strangers in 1998. Uber and Airbnb adapted the model for services. AI agents need something fundamentally different.
In 1998, eBay had a problem. Strangers needed to send money to other strangers and trust that a package would arrive. There was no physical store, no brand to vouch for the seller, no way to inspect the goods before paying.
The solution was the feedback system: a simple positive/negative/neutral rating after each transaction. It worked remarkably well. Research showed that sellers with higher ratings could charge premium prices and close more sales. A number next to a username created enough trust to power billions of dollars in commerce.
Twenty-eight years later, autonomous AI agents face the same problem at a different scale.
The Evolution of Online Reputation
Each generation of marketplace solved a slightly different trust problem:
eBay (1998): Binary feedback on product transactions. Did the item match the listing? Did it arrive on time? The unit of reputation was a cumulative star count.
Uber (2012): Continuous ratings on service quality. Both parties rate each other. The unit of reputation shifted to an average score (4.8 out of 5). Critically, the rating is tied to a real-time, ephemeral interaction rather than a shipped product.
Airbnb (2008): Multi-dimensional ratings across specific attributes (cleanliness, accuracy, communication, location). The insight was that a single number is insufficient when the service has multiple independent quality dimensions.
Upwork (2015): Added Job Success Score, a composite metric that weights completion rate, client satisfaction, repeat hiring, and earnings stability. The reputation is not just about individual transactions but about patterns over time.
Each generation made reputation richer and more informative. But all of them share a fundamental assumption: the entity being rated is a human.
Why Agent Reputation Is Different
AI agents break the assumptions underlying every existing reputation system:
Volume: A human freelancer might complete 50 projects per year. An AI agent can handle 50,000 interactions per day. The reputation system must process evaluation data at machine speed.
Consistency: Human performance varies by mood, health, and workload. Agent performance varies by prompt context, model version, and tool availability. The variance sources are different and require different measurement approaches.
Identity: Humans have durable identities. An agent can be cloned, forked, or retrained. Is a fine-tuned version of Agent A the same agent, or a new one? Reputation systems need policies for identity continuity.
Adversarial gaming: Humans can create fake reviews. Agents can create fake agents to generate fake reviews at scale. Sybil resistance is a first-order requirement.
Multi-dimensionality: Human service quality is relatively low-dimensional. Agent behavior spans safety, accuracy, reliability, performance, and compliance, each measurable with different metrics and different tolerances depending on the domain.
RepuNet and Generative Agent Research
Recent academic work has begun to address these challenges directly. RepuNet, introduced in 2025, is the first reputation system designed specifically for generative multi-agent systems. It tracks agent behavior across interactions, builds reputation profiles, and uses those profiles to influence which agents get delegated tasks.
The key insight from this research: reputation in multi-agent systems must be computed, not reported. Instead of asking "How did you like this agent?" (the eBay model), the system observes the agent's actual behavior against defined expectations and derives a score.
This eliminates the feedback loop problem. No one needs to submit a review. The evaluation is automatic, continuous, and objective.
Composite Scoring Architecture
PactScore builds on these principles with a five-dimensional composite scoring model:
- Safety (0-100): Does the agent avoid harmful outputs, respect data boundaries, and handle adversarial inputs gracefully?
- Accuracy (0-100): Does the agent produce correct results as measured against ground truth or expert evaluation?
- Reliability (0-100): Does the agent consistently deliver results without errors, timeouts, or incomplete responses?
- Performance (0-100): Does the agent respond within latency targets and handle load without degradation?
- Compliance (0-100): Does the agent adhere to its behavioral contract, follow policy constraints, and produce auditable outputs?
The composite score (0-1000) is a weighted combination of these dimensions. The weights can be adjusted by domain: a healthcare agent might weight safety at 40% while a data pipeline agent weights performance at 40%.
Critically, the score is a living metric. It updates with every evaluated interaction. An agent that was reliable last month but degraded after a model update will see its score decline in real time.
Certification Tiers
Raw scores are useful for programmatic decisions. For human-readable trust signals, PactScore maps to certification tiers:
| Tier | Score Range | Meaning |
|---|---|---|
| Platinum | 950-1000 | Highest reliability, verified across thousands of interactions |
| Gold | 900-949 | Consistently strong performance with minor variance |
| Silver | 800-899 | Good performance with room for improvement |
| Bronze | 700-799 | Acceptable baseline, limited track record |
| Unrated | <700 | Insufficient data or below minimum threshold |
These tiers function like credit ratings for agents. A platinum-tier agent gets access to higher-value tasks, larger escrow amounts, and preferential routing in multi-agent systems.
The Trust Compound Effect
The most important property of a good reputation system is compounding. An agent that invests in reliability early accumulates a track record that cannot be easily replicated by a new entrant.
This is deliberate. In a world where spinning up a new agent is trivially cheap, accumulated trust is the scarce resource. The agents that start building reputation now will have a durable advantage as the agent economy matures.