Why Evaluate Agents?

Single-prompt outputs are easy to evaluate — compare the response to an expected answer. Agents are harder. An agent may take 5 steps instead of 3, call the right tool with the wrong arguments, reach the right answer through a flawed path, or fail silently on step 4 of 7. Evaluating agents means evaluating the trajectory, not just the final output.

Without evaluation, agent improvements are guesswork. You change the system prompt and hope it helps. With evaluation, you measure task success rate, tool call accuracy, trajectory efficiency, and cost per task — and iterate with data.

Standard Benchmarks

Three benchmarks dominate agent evaluation in 2026:

Major Benchmarks

SWE-bench Verified

500 real GitHub issues. Agent must locate the bug in a repository, write a patch, and verify it passes tests. The gold standard for coding agent evaluation.

Values: % Resolved (pass@1). Best: 65% (mini-SWE-agent). Human: 100%.

AgentBench

8 environments: operating system, database, web shopping, code, card games, knowledge graph, lateral thinking puzzles, and house-holding tasks. Tests reasoning + decision-making.

Values: Task success rate. Best: ~50% (GPT-4). Human: ~90%.

WebArena

Realistic web tasks across 4 domains: e-commerce, social forums, software development (GitLab), and content management. Agents navigate and interact with functional websites.

Values: End-to-end success. Best: 14.4% (GPT-4). Human: 78.2%.

SWE-bench Variants

Variant	Instances	Purpose
SWE-bench Verified	500	Human-filtered, unambiguous. Standard for results
SWE-bench Lite	300	Faster, cheaper evaluation loop
SWE-bench Full	2,294	Complete set, most comprehensive
SWE-bench Multilingual	300	9 programming languages beyond Python

What the Benchmarks Reveal

Coding agents are ahead: SWE-bench Verified scores have risen from 12% (2024) to 65% (2025) as models and agent frameworks improved.
Web agents are behind: WebArena at 14.4% shows that visual/navigation tasks remain hard. Agents struggle with multi-page workflows, form filling, and interpreting visual layouts.
The human gap persists: No agent reaches human-level performance on any benchmark. The gap is widest on tasks requiring common sense and visual understanding.

Custom Evaluation Dimensions

Standard benchmarks test general capability. Custom evaluation tests your agent on your tasks.

Dimension	What It Measures	How to Measure
Task success rate	% of tasks the agent completes correctly	Define pass/fail per task. Automated checker or human review.
Tool call accuracy	% of tool calls with correct function + valid args	Validate args against expected schema. Count type errors, missing required fields.
Trajectory efficiency	Steps taken vs optimal path	Compare actual steps to minimum necessary steps. Ratio > 3 indicates inefficiency.
Trajectory quality	Reasoning quality of the full path (not just the answer)	LLM-as-judge grading the complete trace against a rubric.
Cost per task	Total API cost for a completed task	Sum input + output tokens × model pricing. Track by task type.
Reliability	Consistency across repeated runs	Run same task 10x. Low variance = reliable. High variance = prompt/model issue.
Safety compliance	Rate of policy violations in agent actions	Check outputs against safety rules. Log violations per 100 tasks.

Building an Eval Harness

A minimal eval harness in three steps:

import json
from your_agent import run_agent

def evaluate_agent(agent, test_cases, grader):
    """Run agent on test cases and grade results."""
    results = []

    for tc in test_cases:
        # Step 1: Run the agent
        try:
            output = run_agent(tc["input"])
        except Exception as e:
            results.append({"task": tc["id"], "success": False, "error": str(e)})
            continue

        # Step 2: Grade the output
        grade = grader(output, tc["expected"])
        results.append({
            "task": tc["id"],
            "success": grade["pass"],
            "score": grade.get("score", 0),
            "steps": output.get("steps", 0),
            "cost": output.get("cost", 0),
            "trajectory": grade.get("reasoning", "")
        })

    # Step 3: Aggregate
    passed = sum(1 for r in results if r["success"])
    return {
        "total": len(results),
        "passed": passed,
        "success_rate": passed / len(results),
        "avg_steps": sum(r["steps"] for r in results) / len(results),
        "avg_cost": sum(r["cost"] for r in results) / len(results),
        "details": results
    }

Grader Functions

The grader determines pass/fail. Different tasks need different graders.

def exact_match_grader(output, expected):
    """For tasks with deterministic output (classification, extraction)."""
    return {
        "pass": output["answer"].strip().lower() == expected.strip().lower(),
        "score": 1.0 if output["answer"].strip().lower() == expected.strip().lower() else 0.0
    }

def llm_judge_grader(output, expected, model="gpt-4o"):
    """For open-ended tasks (writing, reasoning, analysis).
    Use a different model than the agent to avoid self-evaluation bias."""
    prompt = f"""Grade this agent output against the expected answer.
Expected: {expected}
Agent output: {output["answer"]}

Rate on: accuracy (0-4), completeness (0-3), conciseness (0-3).
Return JSON: {{"accuracy": int, "completeness": int, "conciseness": int, "pass": bool}}"""

    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    grade = json.loads(response.choices[0].message.content)
    total = sum(grade[k] for k in ["accuracy", "completeness", "conciseness"])
    grade["pass"] = total >= 7  # Threshold for passing
    grade["score"] = total / 10.0
    return grade

def tool_accuracy_grader(output, expected_calls):
    """For tasks where tool call correctness matters more than final output."""
    actual_calls = [(tc["name"], tc["args"]) for tc in output.get("tool_calls", [])]
    expected = [(ec["name"], ec["args"]) for ec in expected_calls]

    correct = sum(1 for ac, ec in zip(actual_calls, expected)
                  if ac[0] == ec[0] and all(ac[1].get(k) == ec[1].get(k)
                  for k in ec[1]))
    return {
        "pass": correct == len(expected_calls),
        "score": correct / len(expected_calls) if expected_calls else 0
    }

Trajectory Scoring

The most important evaluation dimension that benchmarks ignore: did the agent take a good path?

Task: "What's the population of Tokyo and what % of Japan is that?"

Bad trajectory (3/10):               Good trajectory (8/10):
1. Search "Tokyo population"        1. Search "Tokyo population 2024"
2. Search "Japan area" ← wrong     2. Search "Japan population 2024"
3. Search "Japan population"        3. Calculate 37M / 124M = 29.8%
4. Calculate (wrong numbers)        4. Verify with second source
5. Search "Tokyo population 2024"   5. Return answer with sources
6. Calculate (now correct)
7. Return answer

The bad trajectory gets the right answer but took 6 steps, called the wrong tool once, and missed the date qualifier. A grader that only checks the final answer would score both equivalently. A trajectory scorer penalizes inefficiency.

def score_trajectory(trajectory, optimal_steps):
    """Score a trajectory on efficiency, correctness, and safety."""
    score = 10.0

    # Efficiency penalty: each step beyond optimal
    extra_steps = len(trajectory["steps"]) - optimal_steps
    if extra_steps > 0:
        score -= min(extra_steps * 0.5, 3.0)

    # Error penalty: each failed tool call or correction
    errors = sum(1 for step in trajectory["steps"] if step.get("error") or step.get("correction"))
    score -= min(errors * 1.0, 4.0)

    # Safety bonus: no harmful actions
    if not trajectory.get("safety_violations"):
        score += 1.0

    return max(0, score)

The Eval Loop

Evaluation is not a one-time event. It's a loop:

1. Run agent on test suite → collect results
2. Identify failure patterns (which tasks fail? why?)
3. Fix the agent (system prompt, tools, model, memory)
4. Re-run test suite → compare to baseline
5. If improvement → deploy. If not → go to step 3.

Track results over time:

Date	Success Rate	Avg Steps	Avg Cost	Top Failure Mode
Jun 1	62%	5.2	$0.043	Tool args wrong (18 cases)
Jun 8	68%	4.8	$0.041	Missing web search (12 cases)
Jun 15	73%	4.2	$0.038	Hallucinated data (9 cases)

Note:

Start small, scale up. A 10-task eval suite run after every change catches 80% of regressions. A 100-task suite run weekly catches subtle failures. Save the 500+ task benchmark for monthly releases. The key is running evals regularly, not exhaustively.

Key Takeaway

The gap between "my agent works" and "my agent works reliably" is an eval harness. Build one before you need it. Three grader patterns cover most cases: exact match for deterministic outputs, LLM-as-judge for open-ended outputs (use a different model than the agent), and tool accuracy for agentic workflows where the path matters more than the destination. Track success rate, trajectory efficiency, and cost per task — these three numbers tell you more than any benchmark score.

Agent Evaluation & Benchmarking

Why Evaluate Agents?

Standard Benchmarks

Major Benchmarks

SWE-bench Variants

What the Benchmarks Reveal

Custom Evaluation Dimensions

Building an Eval Harness

Grader Functions

Trajectory Scoring

The Eval Loop

Related Articles

Contract Review Agent Blueprint

Permission-Gated Tool-Use: How Datasette Agent 0.3a0 Handles User Approval for Write Operations

Hermes Agent Setup Guide

On this page