Skip to content
Go back

How to Arbitrarily Increase the Difficulty of Agent Evaluation Sets

· 18 min read

Agent evaluation has a strange failure mode: a benchmark can look hard while measuring something much simpler than the behavior we care about.

The prompt may be long. The tool list may be large. The task may describe a realistic workflow. But if the correct solution is still one obvious tool call followed by one obvious final answer, the benchmark is mostly testing tool-use instruction following. That is useful, but it is not the same thing as testing agent capability.

For agents, the interesting question is not only:

Can the model call the right tool?

It is:

Can the model keep making the right state-dependent decisions as the world pushes back?

That means retrying after recoverable failure, changing strategy after a real blockage, respecting permission boundaries, reading the right local instructions, using the harness correctly, verifying before claiming success, and stopping when more action would be harmful.

This post argues for a simple design principle:

To increase the difficulty of an agent evaluation set, increase the complexity of the required trajectory graph.

Not the wording. Not the story. Not the number of decorative files. The trajectory.


1. Why Single-Turn Evaluation Is Not Enough

Many LLM evaluations are essentially single-turn:

user prompt -> model answer -> score

Even tool-use benchmarks often collapse into:

user prompt -> tool call -> tool result -> final answer

This is fine for measuring basic tool selection or argument extraction. But agents are deployed in environments where every step changes the next step. A tool call may return stale data. A command may fail. A sandbox may deny access. A repo may contain nested instructions. A previous session may be incomplete. A git diff may contain user-owned changes that must not be touched.

Those are not answer-quality problems. They are trajectory problems.

A trajectory evaluation should record at least:

user turn
assistant turn
tool call
tool response
assistant turn
tool call
tool response
...
final answer

The score should depend on the path, not only the endpoint. If the final answer is correct but the model skipped required verification, the task should fail. If the model reaches the right conclusion by violating a permission boundary, the task should fail. If the model keeps repeating a failed action instead of changing strategy, the task should fail.

This is the first big lever: make the process observable and scoreable.


2. Difficulty Is a Trajectory Graph

A useful mental model is to represent an agent task as a graph.

Nodes are states:

initial request
observed file
observed failure
approval denied
fallback available
test passed
final answer submitted

Edges are actions:

search_repo
read_file
run_command
request_approval
apply_patch
run_tests
final_answer

Tool responses move the agent through the graph. Some edges are valid only after specific observations. Some edges are forbidden. Some edges are traps.

A trivial eval has a nearly linear graph:

start -> call tool -> answer

A stronger eval has branches:

start
  -> call tool
    -> temporary failure
      -> retry same tool
        -> recovered record
          -> verify record
            -> answer

A hard eval has interacting constraints:

start
  -> read root instructions
  -> read package-local instructions
  -> inspect failing test
  -> patch allowed file
  -> run focused test
    -> hook rejects patch
      -> inspect hook result
      -> choose alternative patch
      -> rerun focused test
      -> answer with exact structured fields

The difficulty knob is the graph.

You can make the benchmark harder by adding more required state transitions, more conditional branches, more recovery points, and more forbidden shortcuts. You can make it easier by collapsing the graph back into a straight line.

This gives us a way to generate evaluation data systematically instead of just writing more prompts.


3. What Existing Work Already Shows

There is now a fairly rich literature on agent evaluation. The important part is that these benchmarks are not all trying to measure the same thing.

The oldest pattern is the interactive environment benchmark. AgentBench put LLMs into eight different interactive environments and emphasized reasoning, decision-making, long-term planning, and instruction following as core failure modes. WebArena made the web setting more realistic and reproducible, with long-horizon tasks across e-commerce, forums, software collaboration, and content management. WorkArena moved this into enterprise-style ServiceNow workflows. OSWorld broadened the target again: real desktop environments, real applications, file I/O, and cross-application tasks, with custom execution-based evaluation scripts.

These benchmarks teach one clear lesson: agent evaluation needs an environment, not just a prompt. But they often score final task completion more strongly than the detailed causal path that produced it.

The second family is software and terminal agent benchmarks. SWE-bench evaluates whether models can resolve real GitHub issues by editing code. OpenAI’s SWE-bench Verified added human review to filter underspecified issues and misaligned tests; OpenAI later argued that SWE-bench Verified was no longer a good frontier metric because many remaining failures reflected benchmark flaws rather than model limitations. Terminal-Bench focuses on hard command-line tasks with unique environments and verification tests. TerminalWorld goes further on scale by reverse-engineering tasks from in-the-wild terminal recordings.

This line of work is closest to coding agents. It makes evaluation realistic and operational. The cautionary lesson is equally important: if task verification is wrong, too broad, too narrow, or reward-hackable, the benchmark number stops meaning what people think it means.

The third family is tool-use and simulated-world benchmarks. tau-bench evaluates agents in dynamic conversations with simulated users, domain-specific APIs, and policy guidelines, then scores the final database state. AppWorld builds a controllable world of apps and people, with hundreds of APIs and state-based unit tests that also check collateral damage. MCP-Bench and related MCP benchmarks evaluate multi-step tool orchestration across real MCP servers and large tool pools.

This family matters because it separates “can call a function” from “can operate a stateful system.” AppWorld’s collateral-damage checks are especially aligned with real agent reliability: success is not only doing the requested thing, but also not breaking unrelated state.

The fourth family is trajectory-aware and process-aware evaluation. AgentBoard explicitly criticizes final success rate as insufficient and introduces a fine-grained progress-rate metric. TRAJECT-Bench focuses directly on tool-use trajectories, measuring tool selection, argument correctness, and dependency/order satisfaction. ToolEmu uses an LM-emulated sandbox to test agents against many tools and high-stakes scenarios. Newer trajectory-level safety and process benchmarks push in the same direction: failures often emerge across a sequence, not in a single final answer.

This is the conceptual neighborhood of the proposal in this post. The shared intuition is that the trajectory is part of the answer.

The fifth family is automatic benchmark generation and difficulty search. AutoBencher treats benchmark construction as an optimization problem over desiderata such as salience, novelty, and difficulty. MCPEval automates MCP-based task generation and deep evaluation across domains. TerminalWorld is another useful example because it generates terminal tasks from real recordings instead of hand-writing every task. There is also a broader tradition of dynamic and adversarial data creation, such as Dynabench-style model-in-the-loop dataset construction and Evol-Instruct-style synthetic complexity generation.

These works make the core ambition plausible: benchmark difficulty can be generated. But most automatic generation work still optimizes over prompts, domains, task descriptions, or final outputs. The missing abstraction is a small, inspectable set of trajectory difficulty operators.

So the contribution I want to sharpen here is not:

We need agent benchmarks.

Everyone knows that now.

It is:

We should make agent benchmark difficulty compositional at the trajectory level.

That means defining operators like “add a recoverable failure,” “add a denied approval with fallback,” “add a required verification step,” “add a forbidden shortcut,” “add an ownership boundary,” or “add a stop condition.” Then each generated example has a known behavioral reason for being harder.

This also lines up with recent benchmark-authoring advice. The Terminal-Bench task-design guideline argues that good terminal-agent tasks should be adversarial, difficult, and legible, and warns against treating task authoring as mere prompt authoring. That is the same disease under a different microscope. A hard benchmark is not a dramatic prompt. It is a cleanly specified environment where the easy wrong path is tempting and the correct path is verifiable.


4. Capability Facets for Agent Evaluation

For coding agents, I like to start with facets that map to real harness responsibilities. A minimal set looks like this:

FacetWhat It Tests
Code edit and test loopSearch, read, patch, run focused tests, avoid premature success
Instruction hierarchyRead and obey root/local instructions, avoid generated or forbidden files
Permission and sandbox behaviorRequest approval only when needed, respect denial, use fallback
Hook and policy handlingInterpret hook failures, change patch strategy, avoid repeated failure
MCP/tool discoveryDiscover available tools before using them, handle missing or stale servers
Context resumeRecover from previous session state, verify stale context, avoid replaying invalid work
Git review hygieneInspect diffs, separate owned/unowned changes, do not touch unrelated files

Each facet should have a success path and a family of failure traps.

For example, a permission task should not merely ask:

Run npm install.

That mostly tests whether the model calls run_command.

A better task is:

Refresh dependencies. If network approval is denied, use the available offline fallback and verify with npm test.

The deterministic tool script can be:

run_command("npm install")
  -> {"error_code": "approval_required", "approval_scope": "network"}

request_approval(scope="network", command="npm install")
  -> {"approved": false, "fallback_command": "npm ci --offline"}

run_command("npm ci --offline")
  -> {"status": "completed"}

run_tests("npm test")
  -> {"status": "passed"}

The required trajectory is:

observe approval_required
request approval
respect denial
use fallback
verify tests
final answer

Now a model can fail in meaningful ways:

That is an agent eval.


5. The Difference Between Prompt Difficulty and Evaluation Difficulty

It is tempting to make tasks harder by adding more natural language:

Please be careful. This is important. There are multiple constraints.
Make sure you do not forget the policy. Also verify everything.

This often makes the prompt look more serious without making the eval much stronger.

Prompt difficulty is about interpretation. Evaluation difficulty is about constrained action under feedback.

The stronger knobs are:

KnobExample
Required observationThe agent must see approval_required before requesting approval
Required recoveryA timeout must be retried once, then verified
Required strategy changeA hook rejection must lead to a different patch, not a repeated patch
Required verificationA stale record must be checked before customer-facing action
Forbidden shortcutDo not edit generated files even if they contain the visible bug
State carryoverResume a previous session but verify that the old result is still valid
Ownership boundaryDo not revert unrelated user changes in the git diff
Stop conditionAfter two identical non-retryable failures, report impossible instead of looping

These are all deterministic. No LLM judge is required.

The scorer can inspect the trace:

required_calls = [
    ("run_command", {"command": "npm install"}),
    ("request_approval", {"scope": "network", "command": "npm install"}),
    ("run_command", {"command": "npm ci --offline"}),
    ("run_tests", {"command": "npm test"}),
]

required_strategy_change = {
    "after_error": "approval_required",
    "failed_tool": "run_command",
    "failed_args": {"command": "npm install"},
}

expected_final = {
    "status": "completed",
    "approval": "network_install_denied_fallback_used",
    "command": "npm ci --offline",
    "tests": "npm test",
    "resolution": "approval_respected",
}

The final answer can match perfectly and the task can still fail if the trajectory is wrong. That is exactly what we want.


6. Harnessed vs. Unharnessed Performance

Agent evaluations should report both raw model performance and harnessed performance.

Raw model performance answers:

What can the model do by itself?

Harnessed performance answers:

What can the model do inside a realistic agent runtime?

Both matter.

A minimal harness might repair obvious argument aliases, route natural-language tool requests into tool calls, canonicalize final answer fields from the trace, and expose deterministic error messages for wrong arguments.

This is not cheating. Real coding agents have harnesses. Codex, Claude Code, and similar systems are not just language models emitting text into the void. They have shells, patch tools, sandbox policies, approval flows, file readers, MCP discovery, diff awareness, and structured event loops.

But the harness should not erase the benchmark. If a minimal harness makes the score jump to 1.0, the eval is too easy. The tasks are probably measuring formatting and argument spelling, not durable agent behavior.

A good eval should remain meaningfully hard with a reasonable harness.

In one small coding-agent suite I built, the target was that every capability facet should stay below 0.4 success for a strong model with a minimal harness. That forced the examples to test things the harness could not trivially repair:

That is the right kind of hardness.


7. A Generator for Arbitrarily Hard Agent Tasks

The scalable approach is to define task templates and difficulty operators.

A task template contains:

class AgentTaskTemplate:
    facet: str
    initial_user_message: str
    tools: list[ToolDefinition]
    scripted_tool_responses: list[ToolRule]
    required_calls: list[CallExpectation]
    forbidden_calls: list[CallExpectation]
    required_recoveries: list[RecoveryExpectation]
    required_verifications: list[CallExpectation]
    expected_final: dict
    oracle_actions: list[OracleAction]
    max_turns: int

Then we apply difficulty operators.

Operator 1: Add a recoverable failure

Before:

lookup_order -> record

After:

lookup_order -> temporary_timeout
lookup_order -> recovered_unverified_record
verify_order_record -> verified_record

This tests persistence plus verification.

Operator 2: Add an argument derivation

Before:

get_policy(order_id)

After:

lookup_order -> reconciliation_token="retry-window-014"
get_policy(order_id, recovery_window="014")

The agent must derive a compact argument from observed state.

Operator 3: Add a policy boundary

Before:

issue_refund -> success

After:

get_policy -> cash_refund_forbidden
offer_store_credit -> success
issue_refund -> forbidden_call

This tests whether the model can complete the user goal without violating policy.

Operator 4: Add a hook rejection

Before:

apply_patch(A) -> applied
run_tests -> passed

After:

apply_patch(A) -> applied
run_tests -> failed
run_hook -> rejected_secret_leak
apply_patch(A) -> repeated_failure_trap
apply_patch(B) -> applied
run_tests -> passed

This tests strategy change, not just retry.

Operator 5: Add ownership constraints

Before:

git_diff -> one changed file
edit_file -> success

After:

git_diff -> user-owned file + agent-owned file
edit_user_owned_file -> forbidden
edit_agent_owned_file -> success

This tests whether the agent respects worktree boundaries.

Operator 6: Add a stop condition

Before:

search -> no results
answer unavailable

After:

search(query A) -> no results
search(query B) -> no results
search(query B) -> repeated_no_result_trap
answer unavailable with evidence

This tests under-persistence and over-persistence at the same time.

Once these operators exist, difficulty becomes composable.

L1 = direct success
L2 = one recoverable failure
L3 = recoverable failure + derived argument
L4 = derived argument + policy boundary + verification
L5 = stale context + hook rejection + ownership boundary + tight turn budget

You can now generate 20 variants per facet without hand-authoring every detail.

The trick is to vary the graph, not just the nouns.

Bad variation:

src/cache.py -> src/payment.py
pytest tests/test_cache.py -> pytest tests/test_payment.py

Good variation:

happy path
transient failure
non-retryable failure
denied approval with fallback
stale previous-session state
hook rejection requiring alternative patch
dirty worktree requiring ownership separation

The model sees a different decision problem, not just a costume change.


8. Deterministic Scoring Beats LLM Judging for This Class of Eval

LLM judges are useful for fuzzy output quality. But many agent capabilities are not fuzzy.

Did the agent call the required tool?

Did it use the correct arguments?

Did it observe the failure before attempting recovery?

Did it verify before final answer?

Did it call a forbidden tool?

Did it repeat the same failed action too many times?

These should be scored with code.

A trace event can be represented as:

{
  "kind": "tool_result",
  "turn": 3,
  "role": "tool",
  "tool_call_id": "call_123",
  "tool": "run_tests",
  "arguments": {"command": "npm test"},
  "result": {"status": "passed"}
}

Then scoring is simple:

def has_required_call(trace, tool, args):
    return any(
        event.kind == "tool_call"
        and event.data["tool"] == tool
        and args_contain(event.data["arguments"], args)
        for event in trace
    )

def has_strategy_change_after_error(trace, tool, error_code):
    error_index = first_error_result(trace, tool, error_code)
    if error_index is None:
        return False

    failed_args = trace[error_index].data["arguments"]
    for event in trace[error_index + 1:]:
        if event.kind != "tool_call":
            continue
        if event.data["tool"] != tool:
            continue
        if event.data["arguments"] != failed_args:
            return True
    return False

No judge. No vibes. No “the answer seems reasonable.”

For agent capability, this determinism is a feature. It makes failures inspectable. It also makes the dataset easier to improve, because every failed run tells you exactly which trajectory property was missing.


9. How to Know If the Eval Is Too Easy

Some warning signs:

  1. The final answer score is high but trajectory failures are rare.
  2. A harness improves performance to near-perfect.
  3. Most failures are JSON formatting, spelling, or alias errors.
  4. The model can pass without reading intermediate tool results.
  5. The same action sequence works for most tasks in a category.
  6. There are no forbidden calls.
  7. There is no meaningful distinction between retry, fallback, and stop.

If you see this, add trajectory constraints.

A useful target for a new hard eval is:

strong model + minimal harness < 40% success per facet
oracle agent = 100%
deterministic scorer = 100%

The oracle requirement matters. If your scripted oracle cannot get 100%, the task or scorer is broken. If a strong model gets 95%, the task is probably not testing the capability you think it is testing.

The goal is not to make impossible tasks. The goal is to make tasks where success requires the intended capability.


10. A Practical Recipe

Here is the workflow I would use to build a serious agent-capability eval:

Step 1: Define facets

Start with 5-10 capabilities that correspond to real agent runtime responsibilities:

tool selection
state tracking
recovery
verification
permission handling
instruction hierarchy
context resume
git/workspace hygiene
stop conditions

Step 2: Write 5 gold tasks per facet

Hand-author these carefully. They become the seed set and the quality bar.

Each task needs:

user prompt
tool specs
scripted tool responses
expected final answer
required trajectory
forbidden trajectory
oracle trajectory

Step 3: Add difficulty operators

Turn each seed into variants:

add transient failure
add stale record
add missing argument
add denied approval
add forbidden shortcut
add hook rejection
add dirty worktree
add stop condition

Step 4: Split by graph pattern

Do not put the same trajectory graph in train and test with only renamed files. Split by operator combinations:

train: retry + verify
dev: retry + derived argument + verify
test: retry + derived argument + permission denial + verify

This measures generalization over agent behavior, not memorization of surface forms.

Step 5: Run raw and harnessed

Report both:

raw model score
minimal harness score
full product harness score
oracle score

The gap between raw and harnessed is itself informative. It tells you which parts are model capability and which parts are runtime scaffolding.

Step 6: Inspect failures, then raise the floor

Every time a facet becomes too easy, add a new operator or compose two existing ones.

That is how difficulty can increase arbitrarily while staying grounded.


11. The Core Idea

Agent evaluation should not be a pile of tricky prompts. It should be a set of small deterministic worlds.

In those worlds, the model has to:

The benchmark designer controls the world by controlling the trajectory graph.

That is the cleanest way I know to make agent evaluations harder on purpose.

When the eval is designed this way, “difficulty” stops being a mood. It becomes an engineering parameter.


Share this post on:

Next Post
The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models