Evaluations
Submit evaluation runs to measure your agent against pact conditions. Each eval triggers automated checks and updates the agent's composite trust score.
How Evals Work
When you submit an eval, Armalo runs the following pipeline asynchronously via Inngest:
- 1Your eval is created with status processing.
- 2Armalo invokes the agentEndpoint in your eval config with the configured test inputs.
- 3Each check type runs against the agent response — deterministic checks resolve inline; jury checks are dispatched to the multi-model LLM jury.
- 4Once all checks complete, the eval is marked completed or failed.
- 5The composite trust score and dimensional breakdown for the agent are recomputed and persisted.
Submit an Eval
POST to /api/v1/evals with your evals:write API key. Set autoStart: true to begin execution immediately after creation.
curl -X POST https://www.armalo.ai/api/v1/evals \
-H "X-Pact-Key: pk_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"agentId": "YOUR_AGENT_ID",
"name": "prod-smoke-test-1",
"evalType": "custom",
"pactId": "YOUR_PACT_ID",
"autoStart": true,
"config": {
"agentEndpoint": "https://your-agent.example.com/invoke",
"checks": [
{ "type": "latency", "threshold": 2000 },
{ "type": "safety" },
{ "type": "format" }
],
"timeout": 30000
}
}'{
"id": "eval_def456",
"agentId": "YOUR_AGENT_ID",
"pactId": "YOUR_PACT_ID",
"status": "processing",
"createdAt": "2026-03-16T10:00:00.000Z"
}Eval Types
The evalType field controls which default check suite is used when no explicit config.checks are provided.
| Type | Description |
|---|---|
| custom | You specify checks explicitly in config. Maximum flexibility. |
| accuracy | Runs jury-based accuracy checks. Requires a pactId with accuracy conditions. |
| safety | Runs safety and prohibited content checks only. |
| performance | Runs latency and cost-efficiency checks only. |
| compliance | Runs all conditions from the linked pact. |
| pact-verification | Full pact verification pass — equivalent to calling /pacts/:id/verify. |
Check Types
Each entry in config.checks specifies a check type and optional threshold.
latencyMeasures wall-clock response time in ms. Set threshold to the max acceptable value.
safetyDetects harmful, toxic, or policy-violating content in the agent response.
formatValidates that the output conforms to an expected format (JSON, Markdown, etc.).
output_lengthChecks that the response length is within min/max token bounds.
prohibited_topicsRejects responses that mention topics in a configured blocklist.
required_topicsFails responses that do not cover topics in a required list.
red-teamSubmits adversarial prompts via the adversarial agent and checks that the response holds the pact.
Get Eval Results
Poll GET /api/v1/evals/:evalId until status is completed or failed. Alternatively, register a webhook for the eval.completed event to avoid polling.
curl https://www.armalo.ai/api/v1/evals/eval_def456 \ -H "X-Pact-Key: pk_live_your_key_here"
{
"id": "eval_def456",
"agentId": "YOUR_AGENT_ID",
"pactId": "YOUR_PACT_ID",
"status": "completed",
"passed": true,
"checks": [
{ "type": "latency", "passed": true, "actual": 1240, "threshold": 2000 },
{ "type": "safety", "passed": true, "details": "No harmful content detected" },
{ "type": "format", "passed": true, "details": "Valid JSON response" }
],
"createdAt": "2026-03-16T10:00:00.000Z",
"completedAt": "2026-03-16T10:00:03.200Z"
}SDK
The @armalo/core SDK provides a typed createEval method and a higher-level runEval helper that creates the eval and polls until completion.
import { ArmaloClient, runEval } from '@armalo/core';
const client = new ArmaloClient({ apiKey: process.env.ARMALO_API_KEY });
// Low-level: create and poll manually
const eval_ = await client.createEval({
agentId: 'YOUR_AGENT_ID',
name: 'prod-smoke-test-1',
evalType: 'custom',
pactId: 'YOUR_PACT_ID',
autoStart: true,
config: {
agentEndpoint: 'https://your-agent.example.com/invoke',
checks: [
{ type: 'latency', threshold: 2000 },
{ type: 'safety' },
],
timeout: 30000,
},
});
// High-level: create + poll in one call
const result = await runEval(client, {
agentId: 'YOUR_AGENT_ID',
name: 'prod-smoke-test-1',
agentEndpoint: 'https://your-agent.example.com/invoke',
checks: [{ type: 'latency', threshold: 2000 }, { type: 'safety' }],
});
console.log(result.passed); // true
console.log(result.checks); // [{ type: 'latency', passed: true, ... }, ...]Explore related docs