Agent Evaluation & Benchmarking
How to measure agent performance — standard benchmarks (SWE-bench, AgentBench, WebArena), custom evaluation dimensions, trajectory scoring, and building an eval harness.
Why Evaluate Agents?
Single-prompt outputs are easy to evaluate — compare the response to an expected answer. Agents are harder. An agent may take 5 steps instead of 3, call the right tool with the wrong arguments, reach the right answer through a flawed path, or fail silently on step 4 of 7. Evaluating agents means evaluating the trajectory, not just the final output.
Without evaluation, agent improvements are guesswork. You change the system prompt and hope it helps. With evaluation, you measure task success rate, tool call accuracy, trajectory efficiency, and cost per task — and iterate with data.
Standard Benchmarks
Three benchmarks dominate agent evaluation in 2026:
Major Benchmarks
Values: % Resolved (pass@1). Best: 65% (mini-SWE-agent). Human: 100%.
Values: Task success rate. Best: ~50% (GPT-4). Human: ~90%.
Values: End-to-end success. Best: 14.4% (GPT-4). Human: 78.2%.
SWE-bench Variants
| Variant | Instances | Purpose |
|---|---|---|
| SWE-bench Verified | 500 | Human-filtered, unambiguous. Standard for results |
| SWE-bench Lite | 300 | Faster, cheaper evaluation loop |
| SWE-bench Full | 2,294 | Complete set, most comprehensive |
| SWE-bench Multilingual | 300 | 9 programming languages beyond Python |
What the Benchmarks Reveal
- Coding agents are ahead: SWE-bench Verified scores have risen from 12% (2024) to 65% (2025) as models and agent frameworks improved.
- Web agents are behind: WebArena at 14.4% shows that visual/navigation tasks remain hard. Agents struggle with multi-page workflows, form filling, and interpreting visual layouts.
- The human gap persists: No agent reaches human-level performance on any benchmark. The gap is widest on tasks requiring common sense and visual understanding.
Custom Evaluation Dimensions
Standard benchmarks test general capability. Custom evaluation tests your agent on your tasks.
| Dimension | What It Measures | How to Measure |
|---|---|---|
| Task success rate | % of tasks the agent completes correctly | Define pass/fail per task. Automated checker or human review. |
| Tool call accuracy | % of tool calls with correct function + valid args | Validate args against expected schema. Count type errors, missing required fields. |
| Trajectory efficiency | Steps taken vs optimal path | Compare actual steps to minimum necessary steps. Ratio > 3 indicates inefficiency. |
| Trajectory quality | Reasoning quality of the full path (not just the answer) | LLM-as-judge grading the complete trace against a rubric. |
| Cost per task | Total API cost for a completed task | Sum input + output tokens × model pricing. Track by task type. |
| Reliability | Consistency across repeated runs | Run same task 10x. Low variance = reliable. High variance = prompt/model issue. |
| Safety compliance | Rate of policy violations in agent actions | Check outputs against safety rules. Log violations per 100 tasks. |
Building an Eval Harness
A minimal eval harness in three steps:
import json
from your_agent import run_agent
def evaluate_agent(agent, test_cases, grader):
"""Run agent on test cases and grade results."""
results = []
for tc in test_cases:
# Step 1: Run the agent
try:
output = run_agent(tc["input"])
except Exception as e:
results.append({"task": tc["id"], "success": False, "error": str(e)})
continue
# Step 2: Grade the output
grade = grader(output, tc["expected"])
results.append({
"task": tc["id"],
"success": grade["pass"],
"score": grade.get("score", 0),
"steps": output.get("steps", 0),
"cost": output.get("cost", 0),
"trajectory": grade.get("reasoning", "")
})
# Step 3: Aggregate
passed = sum(1 for r in results if r["success"])
return {
"total": len(results),
"passed": passed,
"success_rate": passed / len(results),
"avg_steps": sum(r["steps"] for r in results) / len(results),
"avg_cost": sum(r["cost"] for r in results) / len(results),
"details": results
}
Grader Functions
The grader determines pass/fail. Different tasks need different graders.
def exact_match_grader(output, expected):
"""For tasks with deterministic output (classification, extraction)."""
return {
"pass": output["answer"].strip().lower() == expected.strip().lower(),
"score": 1.0 if output["answer"].strip().lower() == expected.strip().lower() else 0.0
}
def llm_judge_grader(output, expected, model="gpt-4o"):
"""For open-ended tasks (writing, reasoning, analysis).
Use a different model than the agent to avoid self-evaluation bias."""
prompt = f"""Grade this agent output against the expected answer.
Expected: {expected}
Agent output: {output["answer"]}
Rate on: accuracy (0-4), completeness (0-3), conciseness (0-3).
Return JSON: {{"accuracy": int, "completeness": int, "conciseness": int, "pass": bool}}"""
response = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
grade = json.loads(response.choices[0].message.content)
total = sum(grade[k] for k in ["accuracy", "completeness", "conciseness"])
grade["pass"] = total >= 7 # Threshold for passing
grade["score"] = total / 10.0
return grade
def tool_accuracy_grader(output, expected_calls):
"""For tasks where tool call correctness matters more than final output."""
actual_calls = [(tc["name"], tc["args"]) for tc in output.get("tool_calls", [])]
expected = [(ec["name"], ec["args"]) for ec in expected_calls]
correct = sum(1 for ac, ec in zip(actual_calls, expected)
if ac[0] == ec[0] and all(ac[1].get(k) == ec[1].get(k)
for k in ec[1]))
return {
"pass": correct == len(expected_calls),
"score": correct / len(expected_calls) if expected_calls else 0
}
Trajectory Scoring
The most important evaluation dimension that benchmarks ignore: did the agent take a good path?
Task: "What's the population of Tokyo and what % of Japan is that?"
Bad trajectory (3/10): Good trajectory (8/10):
1. Search "Tokyo population" 1. Search "Tokyo population 2024"
2. Search "Japan area" ← wrong 2. Search "Japan population 2024"
3. Search "Japan population" 3. Calculate 37M / 124M = 29.8%
4. Calculate (wrong numbers) 4. Verify with second source
5. Search "Tokyo population 2024" 5. Return answer with sources
6. Calculate (now correct)
7. Return answer
The bad trajectory gets the right answer but took 6 steps, called the wrong tool once, and missed the date qualifier. A grader that only checks the final answer would score both equivalently. A trajectory scorer penalizes inefficiency.
def score_trajectory(trajectory, optimal_steps):
"""Score a trajectory on efficiency, correctness, and safety."""
score = 10.0
# Efficiency penalty: each step beyond optimal
extra_steps = len(trajectory["steps"]) - optimal_steps
if extra_steps > 0:
score -= min(extra_steps * 0.5, 3.0)
# Error penalty: each failed tool call or correction
errors = sum(1 for step in trajectory["steps"] if step.get("error") or step.get("correction"))
score -= min(errors * 1.0, 4.0)
# Safety bonus: no harmful actions
if not trajectory.get("safety_violations"):
score += 1.0
return max(0, score)
The Eval Loop
Evaluation is not a one-time event. It's a loop:
1. Run agent on test suite → collect results
2. Identify failure patterns (which tasks fail? why?)
3. Fix the agent (system prompt, tools, model, memory)
4. Re-run test suite → compare to baseline
5. If improvement → deploy. If not → go to step 3.
Track results over time:
| Date | Success Rate | Avg Steps | Avg Cost | Top Failure Mode |
|---|---|---|---|---|
| Jun 1 | 62% | 5.2 | $0.043 | Tool args wrong (18 cases) |
| Jun 8 | 68% | 4.8 | $0.041 | Missing web search (12 cases) |
| Jun 15 | 73% | 4.2 | $0.038 | Hallucinated data (9 cases) |
Note:
Start small, scale up. A 10-task eval suite run after every change catches 80% of regressions. A 100-task suite run weekly catches subtle failures. Save the 500+ task benchmark for monthly releases. The key is running evals regularly, not exhaustively.
Key Takeaway
The gap between "my agent works" and "my agent works reliably" is an eval harness. Build one before you need it. Three grader patterns cover most cases: exact match for deterministic outputs, LLM-as-judge for open-ended outputs (use a different model than the agent), and tool accuracy for agentic workflows where the path matters more than the destination. Track success rate, trajectory efficiency, and cost per task — these three numbers tell you more than any benchmark score.
Related Articles
AutoGen Setup Guide
Complete setup and configuration guide for Microsoft's AutoGen — a multi-agent conversation framework. Group chats, code executor sandbox, Swarm handoffs, and human-in-the-loop patterns.
Agent Platform Guides
Setup and configuration guides for Hermes Agent, OpenClaw, and Pi Coding Agent — the three most-used self-hosted AI agent platforms in 2026.
Incident Runbook Agent Blueprint
AI agent that reads your on-call runbook, analyzes incident details, classifies severity, matches remediation steps, generates timelines, and drafts postmortems. Self-contained — works with markdown runbooks and pasted error logs.