Skip to content
Go back

Tool Selection Optimization for LLM Agents at Scale

· 18 min read

The tool selection problem is deceptively simple: given a user query and a set of available tools, pick the right ones. With 5 tools, this is trivial. With 500 tools, it becomes a critical bottleneck that determines whether your agent system works at all.

This post covers the technical approaches to tool selection optimization—from semantic retrieval to learned routing—with a focus on what actually works in production.


1. The Scale Problem

# ToolsSelection AccuracyApproach
5-1090-95%All in context
20-5080-90%Good descriptions critical
100-20060-80%Retrieval necessary
500+40-60%Multi-stage selection

Why it matters: Tool definitions cost 200-500 tokens each. At 500 tools, that’s 200K+ tokens per request—3-5 seconds of inference time and ~45K USD/month at scale. Optimized selection (10 tools) drops this to 400ms and ~900 USD/month.

Bloomberg’s finding: Optimizing tool selection reduced unnecessary tool calls by 70% while maintaining task success rates.


2. Retrieval-Based Tool Selection

2.1 Hybrid Retrieval

Pure semantic retrieval fails on vocabulary mismatch (user says “cancel” but tool says “refund”). Combine dense + sparse:

class HybridToolRetriever:
    def __init__(self, tools, embedding_model, alpha=0.7):
        self.semantic = SemanticRetriever(tools, embedding_model)
        self.keyword = BM25Retriever(tools)
        self.alpha = alpha
    
    def retrieve(self, query: str, k: int = 10) -> List[Tool]:
        # Reciprocal Rank Fusion - no normalization needed
        tool_scores = defaultdict(float)
        for rank, tool in enumerate(self.semantic.retrieve(query, k*2)):
            tool_scores[tool.name] += 1 / (60 + rank)
        for rank, tool in enumerate(self.keyword.retrieve(query, k*2)):
            tool_scores[tool.name] += 1 / (60 + rank)
        
        sorted_tools = sorted(tool_scores.items(), key=lambda x: -x[1])
        return [self.get_tool(name) for name, _ in sorted_tools[:k]]

2.2 Conversation-Aware Retrieval

A query like “now delete it” only makes sense with conversation history. Two established approaches:

1. Query Rewriting (TREC CAsT standard)

def rewrite_query(query: str, history: List[Turn]) -> str:
    """LLM rewrites ambiguous query to be self-contained"""
    prompt = f"Given conversation:\n{format_history(history)}\n\nRewrite '{query}' to be self-contained."
    return llm.generate(prompt)  # "delete it" → "delete user John Smith"

2. History Concatenation (ConvDR approach)

def retrieve_with_history(query: str, history: List[Turn], k: int = 10):
    """Concatenate recent history with query before encoding"""
    context = "\n".join([t.content for t in history[-3:]])  # Last 3 turns
    combined_text = f"{context}\n\nCurrent: {query}"
    return semantic_search(combined_text, k)

References: TREC CAsT benchmark, ConvDR (arXiv:2104.13650). Query rewriting is more accurate but adds latency; concatenation is simpler.

2.3 Embedding Model Selection

ModelCost/1MNotes
Voyage 3 Large$0.18Top on Agentset tool retrieval benchmark
text-embedding-3-large$0.13Balanced accuracy/cost
BGE-M3$0.01Self-hosted, budget option

Note on Gemini Embedding: Leads MTEB general benchmarks but underperforms on tool-specific retrieval (Agentset leaderboard). General embedding quality ≠ tool retrieval quality.

Key insight from “Retrieval Models Aren’t Tool-Savvy” (ACL 2025): Standard embeddings achieve <35% completeness@10 on tool retrieval. Fine-tuning with hard negatives (semantically similar but functionally different tools) significantly improves this.


3. Tool Description Optimization

3.1 Bloomberg’s Context Optimization (ACL 2025)

Key finding: Jointly optimizing agent instructions AND tool descriptions reduced tool calls by 70% (StableToolBench) and 47% (RestBench) while maintaining pass rates.

Benchmarks tested:

BenchmarkDescriptionTools
StableToolBenchStability benchmark for tool-augmented LLMs16,000+ real APIs
RestBenchRESTful API evaluationREST APIs

The insight: Incomplete descriptions force LLMs to make exploratory calls. The optimized version adds:

When NOT to use: 
- If you already have the information from previous tool calls
- For structured data queries (use database_query instead)

Note: Each call costs ~200ms. Batch multiple intents into one call.

3.2 Description Template

{
  "name": "weather_forecast",
  "description": "Get weather forecast for a location. Returns temperature, precipitation %, wind, conditions.",
  "when_to_use": "Weather, temperature, rain, outdoor planning queries",
  "when_not_to_use": "Historical data (use weather_history), air quality (use air_quality_api)",
  "parameters": {
    "location": {"type": "string", "description": "City name or coordinates"},
    "days": {"type": "integer", "default": 7, "description": "Forecast days (1-14)"}
  },
  "example_queries": ["What's the weather in Tokyo?", "Will it rain this weekend?"]
}

Key elements:

3.3 Tool Use Examples (Anthropic)

JSON Schema defines structure but can’t express usage patterns: date formats, ID conventions, or parameter correlations.

Anthropic’s solution: Provide input_examples directly in tool definitions:

{
  "name": "create_ticket",
  "input_schema": { /* ... */ },
  "input_examples": [
    {
      "title": "Login page returns 500 error",
      "priority": "critical",
      "labels": ["bug", "authentication", "production"],
      "reporter": {"id": "USR-12345", "name": "Jane Smith"},
      "due_date": "2024-11-06"
    },
    {
      "title": "Add dark mode support",
      "labels": ["feature-request", "ui"]
    },
    {
      "title": "Update API documentation"
    }
  ]
}

From three examples, Claude learns: date format (YYYY-MM-DD), ID conventions (USR-XXXXX), and when to include optional parameters.

Result: Parameter accuracy improved from 72% to 90% on complex parameter handling.

Best practices:

3.4 On-Demand Tool Discovery (Anthropic Tool Search Tool)

Instead of loading all tool definitions upfront, discover tools on-demand:

ApproachToken CostTools Available
Traditional~72K tokens (50+ MCP tools)All loaded
Tool Search Tool~8.7K tokensFull library, on-demand

Implementation: Mark tools with defer_loading: true:

{
  "tools": [
    {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool"},
    {
      "name": "github.createPullRequest",
      "description": "Create a pull request",
      "input_schema": {...},
      "defer_loading": true
    }
  ]
}

When Claude needs GitHub capabilities, it searches and only loads github.createPullRequest—not all 50+ tools from Slack, Jira, and Google Drive.

Results (internal testing):

When to use: >10 tools, >10K tokens in definitions, MCP-powered systems with multiple servers.


4. Tool Set Management

4.1 Tool Merging (ToolScope)

Large tool sets often have redundancy: search_users vs find_users vs lookup_users. Each redundant tool consumes tokens, creates ambiguity, and increases hallucination risk.

class ToolMerger:
    def merge(self, tools: List[Tool], similarity_threshold=0.85) -> List[Tool]:
        embeddings = embed_tools(tools)
        clusters = self._cluster(embeddings, similarity_threshold)
        
        merged = []
        for cluster in clusters:
            if len(cluster) == 1:
                merged.append(tools[cluster[0]])
            else:
                # Merge into single tool with aliases
                primary = max([tools[i] for i in cluster], key=lambda t: len(t.description))
                aliases = [tools[i].name for i in cluster if tools[i].name != primary.name]
                primary.description += f"\n\nAliases: {', '.join(aliases)}"
                merged.append(primary)
        return merged

Result: 30-40% tool count reduction, improved selection accuracy.

4.2 Multi-Provider Abstraction

When multiple providers offer the same capability (weather: OpenWeatherMap/AccuWeather/WeatherAPI), expose a single virtual tool to the LLM:

class VirtualToolRouter:
    def get_virtual_tools(self) -> List[Tool]:
        # LLM sees ONE tool per capability
        return [
            Tool(name="weather_forecast", 
                 description="Get weather forecast for any location"),
            Tool(name="web_search",
                 description="Search the web for current information"),
        ]
    
    def execute(self, tool_name: str, params: dict, strategy: str = "smart"):
        providers = self.providers[tool_name]
        
        if strategy == "cheapest":
            provider = min(providers, key=lambda p: p.cost)
        elif strategy == "reliable":
            provider = max(providers, key=lambda p: p.reliability)
        
        return self._call_with_fallback(provider, providers, params)

If user preference matters, add optional provider parameter with default="auto".

Key principle: LLM thinks in capabilities (“I need weather data”), not implementations.

4.3 Large API Surfaces (Stripe, AWS, Salesforce)

100+ endpoints per service → can’t fit in context. Solutions:

ApproachExampleToken Overhead
HierarchicalDomain → operation~25 tools max
Intent-based”Charge customer” → [get_customer, create_payment_intent, confirm_payment]~10 intents
CRUD abstractionmanage_customer(operation="create|read|update|delete")~15 resources
Dynamic retrievalEmbed API docs, retrieve k=5 per queryk retrieved

5. Learned Tool Selection

5.1 AutoTool: Fine-Tuning for Selection (arXiv:2512.13278)

Trains Qwen3-8B and Qwen2.5-VL-7B using SFT + RL. Single model does both planning and tool selection.

Phase 1: Supervised learning on selection rationales (why tool X, not tool Y)

Phase 2: RL refinement with reward:

reward = (
    task_success * 0.5 +
    tool_efficiency * 0.2 +      # Fewer tools = better
    no_hallucination * 0.2 +     # No invented tools
    correct_parameters * 0.1
)

5.2 Graph-Based Selection (arXiv:2511.14650)

Model tool co-occurrence as a graph:

How it combines with hybrid retrieval:

TurnMethodWhy
First turnHybrid (BM25 + embedding)No history, need query understanding
Subsequent turnsGraph transitionsCo-occurrence patterns dominate
class HybridGraphSelector:
    def select(self, query: str, tool_history: List[str], k: int = 5):
        if not tool_history:
            # First turn: pure retrieval
            return self.hybrid_retriever.retrieve(query, k)
        
        # Subsequent: graph candidates, reranked by query relevance
        graph_candidates = self.graph.get_likely_next(tool_history[-1], k * 2)
        return self.rerank_by_query(graph_candidates, query, k)

Key insight: Tool selection has inertia—certain combinations appear together. Graph handles “what usually comes next,” retrieval handles “what does the query need.”

5.3 Constrained Decoding (Manus Approach)

Not retrieval, not training—inference-time control. Keep all tools in context, mask logits during generation.

TechniqueHow It WorksLibrary
Logit maskingSet disallowed tokens to -infManus
Grammar-constrainedForce output to match CFG/regexOutlines, LMQL, Guidance
JSON schemaConstrain to valid structureOpenAI JSON mode, vLLM

Why not just remove tools?

ApproachProblem
Dynamically add/remove toolsInvalidates KV-cache
Retrieval-based filteringMight filter out needed tools
Logit maskingTools stay in context, output constrained

Implementation pattern (Manus):

# State machine controls which tool prefixes are allowed
allowed_prefixes = {
    "idle": ["browser_", "shell_", "search_"],
    "browsing": ["browser_"],  # Only browser tools while browsing
    "responding": [],          # No tools - must respond to user
}

def mask_logits(logits, state, tool_token_ids):
    allowed = [t for t in tool_token_ids if any(t.startswith(p) for p in allowed_prefixes[state])]
    for tool, token_id in tool_token_ids.items():
        if tool not in allowed:
            logits[:, token_id] = float('-inf')
    return logits

When to use: Tool availability depends on dynamic state. Avoids KV-cache invalidation from changing tool lists.


6. Multi-Stage Selection (1000+ tools)

Category → Retrieve: Fast classifier (~5ms) picks category, then retrieve within subset.

Query → Category Classifier → [Data|Web|Code|Doc] → Retrieve within category

           <10ms, light model

Multi-stage pipeline:

StageMethodOutputLatency
1. CoarseEmbedding retrievalk=100~20ms
2. RerankCross-encoderk=30~50ms
3. FilterLLM confirmationk=10~100ms

Trigger LLM stage only if top scores are ambiguous (gap < 0.1).


7. Error Recovery

7.1 Research Approaches

FrameworkMethodResult
PALADIN (arXiv:2509.25238)Train on 50K recovery-annotated trajectories95.2% recovery on unseen APIs
Structured Reflection (arXiv:2509.18847)Diagnose failure → propose corrective actionImproved multi-turn success
STAR (arXiv:2503.06060)Foundation model + knowledge graph78% recovery success rate
Toolken+ (arXiv:2410.12004)Add “Reject” option—model can decline to use toolsReduces false tool calls

Key insight from PALADIN: Expose agents to tool failures during training (timeouts, API exceptions, inconsistent outputs) with expert recovery demonstrations.

7.2 Production Patterns

Fallback chain: Learned selection → Retrieval → Category defaults → Universal tools (search, calculator, code_executor).

Retry with exclusion: On tool failure, exclude from next selection. Pre-map alternatives by capability:

CAPABILITY_TOOLS = {
    "web_search": ["tavily", "serper", "google_search"],
    "weather": ["openweathermap", "weatherapi", "accuweather"],
    "code_execution": ["e2b_sandbox", "modal_sandbox", "local_docker"],
}

def select_with_recovery(query: str, failed_tools: Set[str] = None):
    candidates = retrieve_tools(query)
    if failed_tools:
        candidates = [t for t in candidates if t.name not in failed_tools]
    
    if not candidates:
        # Fallback to capability-based alternatives
        capability = infer_capability(query)
        candidates = [Tool(name=t) for t in CAPABILITY_TOOLS.get(capability, [])]
    
    return candidates

Structured reflection (self-correction):

def reflect_on_failure(query: str, tool: str, error: str, history: List[Turn]):
    """LLM diagnoses failure and proposes recovery action"""
    prompt = f"""
    Query: {query}
    Tool called: {tool}
    Error: {error}
    Previous steps: {format_history(history)}
    
    Diagnose what went wrong and propose the next action:
    1. Was this the wrong tool? → Suggest alternative
    2. Wrong parameters? → Suggest correction
    3. API unavailable? → Suggest fallback
    """
    return llm.generate(prompt)

Retrieval failure (vocabulary mismatch):

7.3 The “Reject” Option (Toolken+)

Allow model to not select any tool:

tools_with_reject = tools + [Tool(
    name="NO_TOOL",
    description="Use when the query can be answered directly without tools, or no tool is appropriate"
)]

This reduces false tool calls when the LLM is uncertain.


8. Programmatic Tool Calling (Anthropic)

Instead of sequential tool calls with each result entering context, Claude writes code that orchestrates tools.

8.1 The Problem with Sequential Calls

Example: “Which team members exceeded their Q3 travel budget?”

TraditionalProgrammatic
20+ API round-trips1 code block
2,000+ expense items in contextOnly final result in context
~200KB context consumed~1KB context consumed

8.2 How It Works

Claude writes Python that calls tools; intermediate results stay in sandbox:

team = await get_team_members("engineering")
expenses = await asyncio.gather(*[
    get_expenses(m["id"], "Q3") for m in team
])

exceeded = []
for member, exp in zip(team, expenses):
    total = sum(e["amount"] for e in exp)
    if total > budget[member["level"]]["travel_limit"]:
        exceeded.append({"name": member["name"], "spent": total})

print(json.dumps(exceeded))  # Only this enters Claude's context

Implementation: Mark tools with allowed_callers:

{
  "tools": [
    {"type": "code_execution_20250825", "name": "code_execution"},
    {
      "name": "get_expenses",
      "allowed_callers": ["code_execution_20250825"]
    }
  ]
}

Results:

When to use: Processing large datasets, 3+ dependent tool calls, filtering/transforming results before Claude sees them.

8.3 Filesystem-Based Tool Discovery (MCP “Code Mode”)

An even more aggressive approach: present MCP tools as a filesystem of code APIs:

servers/
├── google-drive/
│   ├── getDocument.ts
│   └── index.ts
├── salesforce/
│   ├── updateRecord.ts
│   └── index.ts
└── slack/
    ├── sendMessage.ts
    └── index.ts

The agent navigates the filesystem, reading only the .ts files it needs:

// ./servers/google-drive/getDocument.ts
export async function getDocument(input: {documentId: string}): Promise<{content: string}> {
  return callMCPTool('google_drive__get_document', input);
}

Result: Token usage dropped from 150,000 → 2,000 tokens (98.7% reduction).

Progressive disclosure: Add a search_tools function with detail levels:

8.4 Privacy-Preserving Operations

Intermediate data stays in the sandbox. For sensitive workloads, tokenize PII before it reaches the model:

// What the agent sees (if it logs the data):
[
  { email: '[EMAIL_1]', phone: '[PHONE_1]', name: '[NAME_1]' },
  { email: '[EMAIL_2]', phone: '[PHONE_2]', name: '[NAME_2]' }
]
// Real data flows between tools, never through the model

8.5 Skills Accumulation

Agents can persist reusable functions:

// ./skills/save-sheet-as-csv.ts
export async function saveSheetAsCsv(sheetId: string) {
  const data = await gdrive.getSheet({ sheetId });
  const csv = data.map(row => row.join(',')).join('\n');
  await fs.writeFile(`./workspace/sheet-${sheetId}.csv`, csv);
  return `./workspace/sheet-${sheetId}.csv`;
}

Over time, agents build a growing toolbox of higher-level capabilities. Add a SKILL.md file to create structured skills that models can reference.

Reference: Cloudflare “Code Mode” published similar findings.


9. Infrastructure Optimizations

9.1 KV Cache Stability

The problem: Dynamic retrieval changes which tools are in context each turn → invalidates KV cache → recomputes all previous tokens.

Solutions:

ApproachHowSavings
Static tool setKeep all tools in context, use masking100% cache hit
Ordered insertionAlways insert tools in same orderPartial cache hit
Tool prefix cachingSeparate tool definitions from conversation~50% savings
Deferred loadingAnthropic’s defer_loading: true85% reduction + cache preserved

Manus insight: This is why they keep all tools in context and use logit masking—KV cache stability matters more than context length at their scale.

Anthropic insight: Tool Search Tool doesn’t break prompt caching because deferred tools are excluded from the initial prompt entirely.

9.2 Prompt/Prefix Caching

Both Anthropic and OpenAI offer prompt caching—tool definitions in system prompt are cached across requests.

# OpenAI: Tools are automatically cached as part of system prompt
# Anthropic: Use cache_control for tool definitions

tools_with_cache = {
    "tools": [...],  # Same tool list = cache hit
    "cache_control": {"type": "ephemeral"}  # Anthropic
}

Impact:

8.3 Embedding Precomputation

Don’t compute tool embeddings at query time:

class CachedToolRetriever:
    def __init__(self, tools, embedding_model):
        # Precompute and store
        self.tool_embeddings = np.array([
            embedding_model.encode(t.description) for t in tools
        ])
        # Optional: quantize for faster search
        self.tool_embeddings_int8 = quantize_to_int8(self.tool_embeddings)
    
    def retrieve(self, query: str, k: int = 10):
        query_emb = self.embedding_model.encode(query)  # Only this at runtime
        scores = cosine_similarity(query_emb, self.tool_embeddings)
        return top_k(scores, k)

Storage: ~3KB per tool (768-dim float32) → 3MB for 1000 tools. Negligible.

9.4 Query Result Caching

For high-traffic systems, cache (query_hash → selected_tools):

class CachedToolSelector:
    def __init__(self, selector, cache_ttl=3600):
        self.selector = selector
        self.cache = LRUCache(maxsize=10000)
        self.ttl = cache_ttl
    
    def select(self, query: str, context_hash: str = None):
        cache_key = hash(query + str(context_hash))
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        result = self.selector.select(query)
        self.cache[cache_key] = result
        return result

Hit rates: 20-40% for chatbots (users ask similar things), <5% for agents (unique tasks).

9.5 Index Sharding (1000+ tools)

For very large tool sets, shard the embedding index by category:

Query → Category Classifier → Shard[category].search(query)

                              Only loads relevant shard

Benefit: Memory footprint scales with active categories, not total tools.

9.6 Hardware Considerations

ComponentCPUGPUWhen to Use GPU
BM25✓ FastN/ANever (string ops)
Embedding encodeSlow✓ Fast>100 queries/sec
Similarity search✓ OK✓ Faster>10K tools
Cross-encoder rerankSlow✓ FastAlways if available

Rule of thumb: GPU for neural components, CPU for keyword search.


10. Production Recommendations

Tool CountApproachKey Investment
5-50All in contextDescription quality
50-200Semantic retrieval + rerankingEmbedding fine-tuning
200-1000Hierarchical routingCategory classifier, A/B testing
1000+Learned selection + graphDedicated model, continuous learning

11. Summary

The 80/20 of tool selection:

  1. Tool descriptions (40%): Clear “when to use”, “when not to use”, examples
  2. Retrieval quality (30%): Hybrid retrieval beats pure semantic
  3. Context awareness (20%): Use conversation history
  4. Learned components (10%): RL fine-tuning for edge cases

Common mistakes:

The best tool selection system is one where you rarely think about it because it just works.


References

Tool Selection Optimization

  1. ToolScope - arXiv:2510.20036 - Tool merging and context-aware filtering
  2. AutoTool - arXiv:2512.13278 - Dynamic tool selection via RL
  3. AutoTool (Graph) - arXiv:2511.14650 - Historical trajectory modeling
  4. ToolBrain - arXiv:2510.00023 - RL framework for tool use training
  5. Bloomberg ACL 2025 - Context optimization for tool calling (StableToolBench, RestBench)

Tool Retrieval

  1. “Retrieval Models Aren’t Tool-Savvy” - ACL 2025 Findings - Shows <35% completeness@10
  2. HYRR: Hybrid Retrieval - arXiv:2212.10528 - Combining BM25 with neural retrieval

Conversational Retrieval

  1. TREC CAsT - Conversational Assistance Track - Benchmark for conversational search
  2. ConvDR - arXiv:2104.13650 - Few-shot conversational dense retrieval with history encoding

Constrained Decoding

  1. Manus - Context Engineering for AI Agents
  2. Outlines - github.com/outlines-dev/outlines - Grammar-constrained generation

Error Recovery

  1. PALADIN - arXiv:2509.25238 - Self-correcting agents with 95.2% recovery on unseen APIs
  2. Structured Reflection - arXiv:2509.18847 - Diagnose failures, propose corrective actions
  3. STAR - arXiv:2503.06060 - Foundation model + knowledge graph for recovery (78% success)
  4. Toolken+ - arXiv:2410.12004 - “Reject” option to reduce false tool calls

Production Features

  1. Anthropic Advanced Tool Use - anthropic.com/engineering/advanced-tool-use - Tool Search Tool (85% token reduction, 49%→74% accuracy), Programmatic Tool Calling (37% token reduction), Tool Use Examples (72%→90% parameter accuracy)
  2. Anthropic Code Execution with MCP - anthropic.com/engineering/code-execution-with-mcp - Filesystem-based tool discovery (98.7% token reduction), privacy-preserving operations, skills accumulation
  3. Cloudflare Code Mode - blog.cloudflare.com/code-mode - Similar findings on code-based MCP tool orchestration

Code examples are synthesized implementations illustrating practical patterns.


Share this post on:

Previous Post
Experience-Augmented In-Context Learning: A Training-Free Complement to RL Post-Training
Next Post
Generative Engine Optimization (GEO): How to Get Your Product Cited by AI