Tag: Evaluation
All the articles with the tag "Evaluation".
-
How to Arbitrarily Increase the Difficulty of Agent Evaluation Sets
ยท 18 min readA practical framework for making agent benchmarks harder in a controlled way: treat difficulty as trajectory-graph complexity, not prompt wording. Covers deterministic scoring, capability facets, harness effects, and systematic data generation.