Agent Evaluation & Benchmarking

How to measure agent performance — standard benchmarks (SWE-bench, AgentBench, WebArena), custom evaluation dimensions, trajectory scoring, and building an eval harness.

June 12, 2026
agent-evaluationbenchmarksswe-benchagentbenchwebarenatrajectory-scoringllm-as-judge

Why Evaluate Agents?

Single-prompt outputs are easy to evaluate — compare the response to an expected answer. Agents are harder. An agent may take 5 steps instead of 3, call the right tool with the wrong arguments, reach the right answer through a flawed path, or fail silently on step 4 of 7. Evaluating agents means evaluating the trajectory, not just the final output.

Without evaluation, agent improvements are guesswork. You change the system prompt and hope it helps. With evaluation, you measure task success rate, tool call accuracy, trajectory efficiency, and cost per task — and iterate with data.

Standard Benchmarks

Three benchmarks dominate agent evaluation in 2026:

Major Benchmarks

SWE-bench Verified
500 real GitHub issues. Agent must locate the bug in a repository, write a patch, and verify it passes tests. The gold standard for coding agent evaluation.

Values: % Resolved (pass@1). Best: 65% (mini-SWE-agent). Human: 100%.

AgentBench
8 environments: operating system, database, web shopping, code, card games, knowledge graph, lateral thinking puzzles, and house-holding tasks. Tests reasoning + decision-making.

Values: Task success rate. Best: ~50% (GPT-4). Human: ~90%.

WebArena
Realistic web tasks across 4 domains: e-commerce, social forums, software development (GitLab), and content management. Agents navigate and interact with functional websites.

Values: End-to-end success. Best: 14.4% (GPT-4). Human: 78.2%.

SWE-bench Variants

VariantInstancesPurpose
SWE-bench Verified500Human-filtered, unambiguous. Standard for results
SWE-bench Lite300Faster, cheaper evaluation loop
SWE-bench Full2,294Complete set, most comprehensive
SWE-bench Multilingual3009 programming languages beyond Python

What the Benchmarks Reveal

  • Coding agents are ahead: SWE-bench Verified scores have risen from 12% (2024) to 65% (2025) as models and agent frameworks improved.
  • Web agents are behind: WebArena at 14.4% shows that visual/navigation tasks remain hard. Agents struggle with multi-page workflows, form filling, and interpreting visual layouts.
  • The human gap persists: No agent reaches human-level performance on any benchmark. The gap is widest on tasks requiring common sense and visual understanding.

Custom Evaluation Dimensions

Standard benchmarks test general capability. Custom evaluation tests your agent on your tasks.

DimensionWhat It MeasuresHow to Measure
Task success rate% of tasks the agent completes correctlyDefine pass/fail per task. Automated checker or human review.
Tool call accuracy% of tool calls with correct function + valid argsValidate args against expected schema. Count type errors, missing required fields.
Trajectory efficiencySteps taken vs optimal pathCompare actual steps to minimum necessary steps. Ratio > 3 indicates inefficiency.
Trajectory qualityReasoning quality of the full path (not just the answer)LLM-as-judge grading the complete trace against a rubric.
Cost per taskTotal API cost for a completed taskSum input + output tokens × model pricing. Track by task type.
ReliabilityConsistency across repeated runsRun same task 10x. Low variance = reliable. High variance = prompt/model issue.
Safety complianceRate of policy violations in agent actionsCheck outputs against safety rules. Log violations per 100 tasks.

Building an Eval Harness

A minimal eval harness in three steps:

import json
from your_agent import run_agent

def evaluate_agent(agent, test_cases, grader):
    """Run agent on test cases and grade results."""
    results = []

    for tc in test_cases:
        # Step 1: Run the agent
        try:
            output = run_agent(tc["input"])
        except Exception as e:
            results.append({"task": tc["id"], "success": False, "error": str(e)})
            continue

        # Step 2: Grade the output
        grade = grader(output, tc["expected"])
        results.append({
            "task": tc["id"],
            "success": grade["pass"],
            "score": grade.get("score", 0),
            "steps": output.get("steps", 0),
            "cost": output.get("cost", 0),
            "trajectory": grade.get("reasoning", "")
        })

    # Step 3: Aggregate
    passed = sum(1 for r in results if r["success"])
    return {
        "total": len(results),
        "passed": passed,
        "success_rate": passed / len(results),
        "avg_steps": sum(r["steps"] for r in results) / len(results),
        "avg_cost": sum(r["cost"] for r in results) / len(results),
        "details": results
    }

Grader Functions

The grader determines pass/fail. Different tasks need different graders.

def exact_match_grader(output, expected):
    """For tasks with deterministic output (classification, extraction)."""
    return {
        "pass": output["answer"].strip().lower() == expected.strip().lower(),
        "score": 1.0 if output["answer"].strip().lower() == expected.strip().lower() else 0.0
    }

def llm_judge_grader(output, expected, model="gpt-4o"):
    """For open-ended tasks (writing, reasoning, analysis).
    Use a different model than the agent to avoid self-evaluation bias."""
    prompt = f"""Grade this agent output against the expected answer.
Expected: {expected}
Agent output: {output["answer"]}

Rate on: accuracy (0-4), completeness (0-3), conciseness (0-3).
Return JSON: {{"accuracy": int, "completeness": int, "conciseness": int, "pass": bool}}"""

    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    grade = json.loads(response.choices[0].message.content)
    total = sum(grade[k] for k in ["accuracy", "completeness", "conciseness"])
    grade["pass"] = total >= 7  # Threshold for passing
    grade["score"] = total / 10.0
    return grade

def tool_accuracy_grader(output, expected_calls):
    """For tasks where tool call correctness matters more than final output."""
    actual_calls = [(tc["name"], tc["args"]) for tc in output.get("tool_calls", [])]
    expected = [(ec["name"], ec["args"]) for ec in expected_calls]

    correct = sum(1 for ac, ec in zip(actual_calls, expected)
                  if ac[0] == ec[0] and all(ac[1].get(k) == ec[1].get(k)
                  for k in ec[1]))
    return {
        "pass": correct == len(expected_calls),
        "score": correct / len(expected_calls) if expected_calls else 0
    }

Trajectory Scoring

The most important evaluation dimension that benchmarks ignore: did the agent take a good path?

Task: "What's the population of Tokyo and what % of Japan is that?"

Bad trajectory (3/10):               Good trajectory (8/10):
1. Search "Tokyo population"        1. Search "Tokyo population 2024"
2. Search "Japan area" ← wrong     2. Search "Japan population 2024"
3. Search "Japan population"        3. Calculate 37M / 124M = 29.8%
4. Calculate (wrong numbers)        4. Verify with second source
5. Search "Tokyo population 2024"   5. Return answer with sources
6. Calculate (now correct)
7. Return answer

The bad trajectory gets the right answer but took 6 steps, called the wrong tool once, and missed the date qualifier. A grader that only checks the final answer would score both equivalently. A trajectory scorer penalizes inefficiency.

def score_trajectory(trajectory, optimal_steps):
    """Score a trajectory on efficiency, correctness, and safety."""
    score = 10.0

    # Efficiency penalty: each step beyond optimal
    extra_steps = len(trajectory["steps"]) - optimal_steps
    if extra_steps > 0:
        score -= min(extra_steps * 0.5, 3.0)

    # Error penalty: each failed tool call or correction
    errors = sum(1 for step in trajectory["steps"] if step.get("error") or step.get("correction"))
    score -= min(errors * 1.0, 4.0)

    # Safety bonus: no harmful actions
    if not trajectory.get("safety_violations"):
        score += 1.0

    return max(0, score)

The Eval Loop

Evaluation is not a one-time event. It's a loop:

1. Run agent on test suite → collect results
2. Identify failure patterns (which tasks fail? why?)
3. Fix the agent (system prompt, tools, model, memory)
4. Re-run test suite → compare to baseline
5. If improvement → deploy. If not → go to step 3.

Track results over time:

DateSuccess RateAvg StepsAvg CostTop Failure Mode
Jun 162%5.2$0.043Tool args wrong (18 cases)
Jun 868%4.8$0.041Missing web search (12 cases)
Jun 1573%4.2$0.038Hallucinated data (9 cases)

Note:

Start small, scale up. A 10-task eval suite run after every change catches 80% of regressions. A 100-task suite run weekly catches subtle failures. Save the 500+ task benchmark for monthly releases. The key is running evals regularly, not exhaustively.

Key Takeaway

The gap between "my agent works" and "my agent works reliably" is an eval harness. Build one before you need it. Three grader patterns cover most cases: exact match for deterministic outputs, LLM-as-judge for open-ended outputs (use a different model than the agent), and tool accuracy for agentic workflows where the path matters more than the destination. Track success rate, trajectory efficiency, and cost per task — these three numbers tell you more than any benchmark score.