Tool Selection Optimization for LLM Agents at Scale

The tool selection problem is deceptively simple: given a user query and a set of available tools, pick the right ones. With 5 tools, this is trivial. With 500 tools, it becomes a critical bottleneck that determines whether your agent system works at all.

This post covers the technical approaches to tool selection optimization—from semantic retrieval to learned routing—with a focus on what actually works in production.

1. The Scale Problem

# Tools	Selection Accuracy	Approach
5-10	90-95%	All in context
20-50	80-90%	Good descriptions critical
100-200	60-80%	Retrieval necessary
500+	40-60%	Multi-stage selection

Why it matters: Tool definitions cost 200-500 tokens each. At 500 tools, that’s 200K+ tokens per request—3-5 seconds of inference time and ~45K USD/month at scale. Optimized selection (10 tools) drops this to 400ms and ~900 USD/month.

Bloomberg’s finding: Optimizing tool selection reduced unnecessary tool calls by 70% while maintaining task success rates.

2. Retrieval-Based Tool Selection

2.1 Hybrid Retrieval

Pure semantic retrieval fails on vocabulary mismatch (user says “cancel” but tool says “refund”). Combine dense + sparse:

class HybridToolRetriever:
    def __init__(self, tools, embedding_model, alpha=0.7):
        self.semantic = SemanticRetriever(tools, embedding_model)
        self.keyword = BM25Retriever(tools)
        self.alpha = alpha
    
    def retrieve(self, query: str, k: int = 10) -> List[Tool]:
        # Reciprocal Rank Fusion - no normalization needed
        tool_scores = defaultdict(float)
        for rank, tool in enumerate(self.semantic.retrieve(query, k*2)):
            tool_scores[tool.name] += 1 / (60 + rank)
        for rank, tool in enumerate(self.keyword.retrieve(query, k*2)):
            tool_scores[tool.name] += 1 / (60 + rank)
        
        sorted_tools = sorted(tool_scores.items(), key=lambda x: -x[1])
        return [self.get_tool(name) for name, _ in sorted_tools[:k]]

2.2 Conversation-Aware Retrieval

A query like “now delete it” only makes sense with conversation history. Two established approaches:

1. Query Rewriting (TREC CAsT standard)

def rewrite_query(query: str, history: List[Turn]) -> str:
    """LLM rewrites ambiguous query to be self-contained"""
    prompt = f"Given conversation:\n{format_history(history)}\n\nRewrite '{query}' to be self-contained."
    return llm.generate(prompt)  # "delete it" → "delete user John Smith"

2. History Concatenation (ConvDR approach)

def retrieve_with_history(query: str, history: List[Turn], k: int = 10):
    """Concatenate recent history with query before encoding"""
    context = "\n".join([t.content for t in history[-3:]])  # Last 3 turns
    combined_text = f"{context}\n\nCurrent: {query}"
    return semantic_search(combined_text, k)

References: TREC CAsT benchmark, ConvDR (arXiv:2104.13650). Query rewriting is more accurate but adds latency; concatenation is simpler.

2.3 Embedding Model Selection

Model	Cost/1M	Notes
Voyage 3 Large	$0.18	Top on Agentset tool retrieval benchmark
text-embedding-3-large	$0.13	Balanced accuracy/cost
BGE-M3	$0.01	Self-hosted, budget option

Note on Gemini Embedding: Leads MTEB general benchmarks but underperforms on tool-specific retrieval (Agentset leaderboard). General embedding quality ≠ tool retrieval quality.

Key insight from “Retrieval Models Aren’t Tool-Savvy” (ACL 2025): Standard embeddings achieve <35% completeness@10 on tool retrieval. Fine-tuning with hard negatives (semantically similar but functionally different tools) significantly improves this.

3. Tool Description Optimization

3.1 Bloomberg’s Context Optimization (ACL 2025)

Key finding: Jointly optimizing agent instructions AND tool descriptions reduced tool calls by 70% (StableToolBench) and 47% (RestBench) while maintaining pass rates.

Benchmarks tested:

Benchmark	Description	Tools
StableToolBench	Stability benchmark for tool-augmented LLMs	16,000+ real APIs
RestBench	RESTful API evaluation	REST APIs

The insight: Incomplete descriptions force LLMs to make exploratory calls. The optimized version adds:

When NOT to use: 
- If you already have the information from previous tool calls
- For structured data queries (use database_query instead)

Note: Each call costs ~200ms. Batch multiple intents into one call.

3.2 Description Template

{
  "name": "weather_forecast",
  "description": "Get weather forecast for a location. Returns temperature, precipitation %, wind, conditions.",
  "when_to_use": "Weather, temperature, rain, outdoor planning queries",
  "when_not_to_use": "Historical data (use weather_history), air quality (use air_quality_api)",
  "parameters": {
    "location": {"type": "string", "description": "City name or coordinates"},
    "days": {"type": "integer", "default": 7, "description": "Forecast days (1-14)"}
  },
  "example_queries": ["What's the weather in Tokyo?", "Will it rain this weekend?"]
}

Key elements:

when_to_use — triggers selection
when_not_to_use — prevents over-selection
example_queries — improves retrieval matching

3.3 Tool Use Examples (Anthropic)

JSON Schema defines structure but can’t express usage patterns: date formats, ID conventions, or parameter correlations.

Anthropic’s solution: Provide input_examples directly in tool definitions:

{
  "name": "create_ticket",
  "input_schema": { /* ... */ },
  "input_examples": [
    {
      "title": "Login page returns 500 error",
      "priority": "critical",
      "labels": ["bug", "authentication", "production"],
      "reporter": {"id": "USR-12345", "name": "Jane Smith"},
      "due_date": "2024-11-06"
    },
    {
      "title": "Add dark mode support",
      "labels": ["feature-request", "ui"]
    },
    {
      "title": "Update API documentation"
    }
  ]
}

From three examples, Claude learns: date format (YYYY-MM-DD), ID conventions (USR-XXXXX), and when to include optional parameters.

Result: Parameter accuracy improved from 72% to 90% on complex parameter handling.

Best practices:

Use realistic data (real city names, plausible prices)
Show variety: minimal, partial, and full specification patterns
Keep it concise: 1-5 examples per tool
Focus on ambiguity—only add examples where correct usage isn’t obvious from schema

3.4 On-Demand Tool Discovery (Anthropic Tool Search Tool)

Instead of loading all tool definitions upfront, discover tools on-demand:

Approach	Token Cost	Tools Available
Traditional	~72K tokens (50+ MCP tools)	All loaded
Tool Search Tool	~8.7K tokens	Full library, on-demand

Implementation: Mark tools with defer_loading: true:

{
  "tools": [
    {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool"},
    {
      "name": "github.createPullRequest",
      "description": "Create a pull request",
      "input_schema": {...},
      "defer_loading": true
    }
  ]
}

When Claude needs GitHub capabilities, it searches and only loads github.createPullRequest—not all 50+ tools from Slack, Jira, and Google Drive.

Results (internal testing):

85% token reduction (72K → 8.7K)
Opus 4: 49% → 74% accuracy
Opus 4.5: 79.5% → 88.1% accuracy

When to use: >10 tools, >10K tokens in definitions, MCP-powered systems with multiple servers.

4. Tool Set Management

4.1 Tool Merging (ToolScope)

Large tool sets often have redundancy: search_users vs find_users vs lookup_users. Each redundant tool consumes tokens, creates ambiguity, and increases hallucination risk.

class ToolMerger:
    def merge(self, tools: List[Tool], similarity_threshold=0.85) -> List[Tool]:
        embeddings = embed_tools(tools)
        clusters = self._cluster(embeddings, similarity_threshold)
        
        merged = []
        for cluster in clusters:
            if len(cluster) == 1:
                merged.append(tools[cluster[0]])
            else:
                # Merge into single tool with aliases
                primary = max([tools[i] for i in cluster], key=lambda t: len(t.description))
                aliases = [tools[i].name for i in cluster if tools[i].name != primary.name]
                primary.description += f"\n\nAliases: {', '.join(aliases)}"
                merged.append(primary)
        return merged

Result: 30-40% tool count reduction, improved selection accuracy.

4.2 Multi-Provider Abstraction

When multiple providers offer the same capability (weather: OpenWeatherMap/AccuWeather/WeatherAPI), expose a single virtual tool to the LLM:

class VirtualToolRouter:
    def get_virtual_tools(self) -> List[Tool]:
        # LLM sees ONE tool per capability
        return [
            Tool(name="weather_forecast", 
                 description="Get weather forecast for any location"),
            Tool(name="web_search",
                 description="Search the web for current information"),
        ]
    
    def execute(self, tool_name: str, params: dict, strategy: str = "smart"):
        providers = self.providers[tool_name]
        
        if strategy == "cheapest":
            provider = min(providers, key=lambda p: p.cost)
        elif strategy == "reliable":
            provider = max(providers, key=lambda p: p.reliability)
        
        return self._call_with_fallback(provider, providers, params)

If user preference matters, add optional provider parameter with default="auto".

Key principle: LLM thinks in capabilities (“I need weather data”), not implementations.

4.3 Large API Surfaces (Stripe, AWS, Salesforce)

100+ endpoints per service → can’t fit in context. Solutions:

Approach	Example	Token Overhead
Hierarchical	Domain → operation	~25 tools max
Intent-based	”Charge customer” → `[get_customer, create_payment_intent, confirm_payment]`	~10 intents
CRUD abstraction	`manage_customer(operation="create\|read\|update\|delete")`	~15 resources
Dynamic retrieval	Embed API docs, retrieve k=5 per query	k retrieved

5. Learned Tool Selection

5.1 AutoTool: Fine-Tuning for Selection (arXiv:2512.13278)

Trains Qwen3-8B and Qwen2.5-VL-7B using SFT + RL. Single model does both planning and tool selection.

Phase 1: Supervised learning on selection rationales (why tool X, not tool Y)

Phase 2: RL refinement with reward:

reward = (
    task_success * 0.5 +
    tool_efficiency * 0.2 +      # Fewer tools = better
    no_hallucination * 0.2 +     # No invented tools
    correct_parameters * 0.1
)

5.2 Graph-Based Selection (arXiv:2511.14650)

Model tool co-occurrence as a graph:

Nodes = Tools
Edges = P(tool_j | tool_i) from historical trajectories

How it combines with hybrid retrieval:

Turn	Method	Why
First turn	Hybrid (BM25 + embedding)	No history, need query understanding
Subsequent turns	Graph transitions	Co-occurrence patterns dominate

class HybridGraphSelector:
    def select(self, query: str, tool_history: List[str], k: int = 5):
        if not tool_history:
            # First turn: pure retrieval
            return self.hybrid_retriever.retrieve(query, k)
        
        # Subsequent: graph candidates, reranked by query relevance
        graph_candidates = self.graph.get_likely_next(tool_history[-1], k * 2)
        return self.rerank_by_query(graph_candidates, query, k)

Key insight: Tool selection has inertia—certain combinations appear together. Graph handles “what usually comes next,” retrieval handles “what does the query need.”

5.3 Constrained Decoding (Manus Approach)

Not retrieval, not training—inference-time control. Keep all tools in context, mask logits during generation.

Technique	How It Works	Library
Logit masking	Set disallowed tokens to `-inf`	Manus
Grammar-constrained	Force output to match CFG/regex	Outlines, LMQL, Guidance
JSON schema	Constrain to valid structure	OpenAI JSON mode, vLLM

Why not just remove tools?

Approach	Problem
Dynamically add/remove tools	Invalidates KV-cache
Retrieval-based filtering	Might filter out needed tools
Logit masking	Tools stay in context, output constrained

Implementation pattern (Manus):

# State machine controls which tool prefixes are allowed
allowed_prefixes = {
    "idle": ["browser_", "shell_", "search_"],
    "browsing": ["browser_"],  # Only browser tools while browsing
    "responding": [],          # No tools - must respond to user
}

def mask_logits(logits, state, tool_token_ids):
    allowed = [t for t in tool_token_ids if any(t.startswith(p) for p in allowed_prefixes[state])]
    for tool, token_id in tool_token_ids.items():
        if tool not in allowed:
            logits[:, token_id] = float('-inf')
    return logits

When to use: Tool availability depends on dynamic state. Avoids KV-cache invalidation from changing tool lists.

6. Multi-Stage Selection (1000+ tools)

Category → Retrieve: Fast classifier (~5ms) picks category, then retrieve within subset.

Query → Category Classifier → [Data|Web|Code|Doc] → Retrieve within category
                ↑
           <10ms, light model

Multi-stage pipeline:

Stage	Method	Output	Latency
1. Coarse	Embedding retrieval	k=100	~20ms
2. Rerank	Cross-encoder	k=30	~50ms
3. Filter	LLM confirmation	k=10	~100ms

Trigger LLM stage only if top scores are ambiguous (gap < 0.1).

7. Error Recovery

7.1 Research Approaches

Framework	Method	Result
PALADIN (arXiv:2509.25238)	Train on 50K recovery-annotated trajectories	95.2% recovery on unseen APIs
Structured Reflection (arXiv:2509.18847)	Diagnose failure → propose corrective action	Improved multi-turn success
STAR (arXiv:2503.06060)	Foundation model + knowledge graph	78% recovery success rate
Toolken+ (arXiv:2410.12004)	Add “Reject” option—model can decline to use tools	Reduces false tool calls

Key insight from PALADIN: Expose agents to tool failures during training (timeouts, API exceptions, inconsistent outputs) with expert recovery demonstrations.

7.2 Production Patterns

Fallback chain: Learned selection → Retrieval → Category defaults → Universal tools (search, calculator, code_executor).

Retry with exclusion: On tool failure, exclude from next selection. Pre-map alternatives by capability:

CAPABILITY_TOOLS = {
    "web_search": ["tavily", "serper", "google_search"],
    "weather": ["openweathermap", "weatherapi", "accuweather"],
    "code_execution": ["e2b_sandbox", "modal_sandbox", "local_docker"],
}

def select_with_recovery(query: str, failed_tools: Set[str] = None):
    candidates = retrieve_tools(query)
    if failed_tools:
        candidates = [t for t in candidates if t.name not in failed_tools]
    
    if not candidates:
        # Fallback to capability-based alternatives
        capability = infer_capability(query)
        candidates = [Tool(name=t) for t in CAPABILITY_TOOLS.get(capability, [])]
    
    return candidates

Structured reflection (self-correction):

def reflect_on_failure(query: str, tool: str, error: str, history: List[Turn]):
    """LLM diagnoses failure and proposes recovery action"""
    prompt = f"""
    Query: {query}
    Tool called: {tool}
    Error: {error}
    Previous steps: {format_history(history)}
    
    Diagnose what went wrong and propose the next action:
    1. Was this the wrong tool? → Suggest alternative
    2. Wrong parameters? → Suggest correction
    3. API unavailable? → Suggest fallback
    """
    return llm.generate(prompt)

Retrieval failure (vocabulary mismatch):

Detect via low scores (top score < 0.5)
Reformulate with synonyms: “cancel” → “refund”, “terminate”
Build alias index: user terms → tool terms

7.3 The “Reject” Option (Toolken+)

Allow model to not select any tool:

tools_with_reject = tools + [Tool(
    name="NO_TOOL",
    description="Use when the query can be answered directly without tools, or no tool is appropriate"
)]

This reduces false tool calls when the LLM is uncertain.

8. Programmatic Tool Calling (Anthropic)

Instead of sequential tool calls with each result entering context, Claude writes code that orchestrates tools.

8.1 The Problem with Sequential Calls

Example: “Which team members exceeded their Q3 travel budget?”

Traditional	Programmatic
20+ API round-trips	1 code block
2,000+ expense items in context	Only final result in context
~200KB context consumed	~1KB context consumed

8.2 How It Works

Claude writes Python that calls tools; intermediate results stay in sandbox:

team = await get_team_members("engineering")
expenses = await asyncio.gather(*[
    get_expenses(m["id"], "Q3") for m in team
])

exceeded = []
for member, exp in zip(team, expenses):
    total = sum(e["amount"] for e in exp)
    if total > budget[member["level"]]["travel_limit"]:
        exceeded.append({"name": member["name"], "spent": total})

print(json.dumps(exceeded))  # Only this enters Claude's context

Implementation: Mark tools with allowed_callers:

{
  "tools": [
    {"type": "code_execution_20250825", "name": "code_execution"},
    {
      "name": "get_expenses",
      "allowed_callers": ["code_execution_20250825"]
    }
  ]
}

Results:

37% token reduction on complex research tasks
Latency: Eliminate 19+ inference passes for 20-tool workflows
Accuracy: GIA benchmark improved 46.5% → 51.2%

When to use: Processing large datasets, 3+ dependent tool calls, filtering/transforming results before Claude sees them.

8.3 Filesystem-Based Tool Discovery (MCP “Code Mode”)

An even more aggressive approach: present MCP tools as a filesystem of code APIs:

servers/
├── google-drive/
│   ├── getDocument.ts
│   └── index.ts
├── salesforce/
│   ├── updateRecord.ts
│   └── index.ts
└── slack/
    ├── sendMessage.ts
    └── index.ts

The agent navigates the filesystem, reading only the .ts files it needs:

// ./servers/google-drive/getDocument.ts
export async function getDocument(input: {documentId: string}): Promise<{content: string}> {
  return callMCPTool('google_drive__get_document', input);
}

Result: Token usage dropped from 150,000 → 2,000 tokens (98.7% reduction).

Progressive disclosure: Add a search_tools function with detail levels:

Name only
Name + description
Full definition with schemas

8.4 Privacy-Preserving Operations

Intermediate data stays in the sandbox. For sensitive workloads, tokenize PII before it reaches the model:

// What the agent sees (if it logs the data):
[
  { email: '[EMAIL_1]', phone: '[PHONE_1]', name: '[NAME_1]' },
  { email: '[EMAIL_2]', phone: '[PHONE_2]', name: '[NAME_2]' }
]
// Real data flows between tools, never through the model

8.5 Skills Accumulation

Agents can persist reusable functions:

// ./skills/save-sheet-as-csv.ts
export async function saveSheetAsCsv(sheetId: string) {
  const data = await gdrive.getSheet({ sheetId });
  const csv = data.map(row => row.join(',')).join('\n');
  await fs.writeFile(`./workspace/sheet-${sheetId}.csv`, csv);
  return `./workspace/sheet-${sheetId}.csv`;
}

Over time, agents build a growing toolbox of higher-level capabilities. Add a SKILL.md file to create structured skills that models can reference.

Reference: Cloudflare “Code Mode” published similar findings.

9. Infrastructure Optimizations

9.1 KV Cache Stability

The problem: Dynamic retrieval changes which tools are in context each turn → invalidates KV cache → recomputes all previous tokens.

Solutions:

Approach	How	Savings
Static tool set	Keep all tools in context, use masking	100% cache hit
Ordered insertion	Always insert tools in same order	Partial cache hit
Tool prefix caching	Separate tool definitions from conversation	~50% savings
Deferred loading	Anthropic’s `defer_loading: true`	85% reduction + cache preserved

Manus insight: This is why they keep all tools in context and use logit masking—KV cache stability matters more than context length at their scale.

Anthropic insight: Tool Search Tool doesn’t break prompt caching because deferred tools are excluded from the initial prompt entirely.

9.2 Prompt/Prefix Caching

Both Anthropic and OpenAI offer prompt caching—tool definitions in system prompt are cached across requests.

# OpenAI: Tools are automatically cached as part of system prompt
# Anthropic: Use cache_control for tool definitions

tools_with_cache = {
    "tools": [...],  # Same tool list = cache hit
    "cache_control": {"type": "ephemeral"}  # Anthropic
}

Impact:

First request: Full input token cost
Subsequent: ~90% reduction for cached prefix
Requires tool list stability—changing tools invalidates cache

8.3 Embedding Precomputation

Don’t compute tool embeddings at query time:

class CachedToolRetriever:
    def __init__(self, tools, embedding_model):
        # Precompute and store
        self.tool_embeddings = np.array([
            embedding_model.encode(t.description) for t in tools
        ])
        # Optional: quantize for faster search
        self.tool_embeddings_int8 = quantize_to_int8(self.tool_embeddings)
    
    def retrieve(self, query: str, k: int = 10):
        query_emb = self.embedding_model.encode(query)  # Only this at runtime
        scores = cosine_similarity(query_emb, self.tool_embeddings)
        return top_k(scores, k)

Storage: ~3KB per tool (768-dim float32) → 3MB for 1000 tools. Negligible.

9.4 Query Result Caching

For high-traffic systems, cache (query_hash → selected_tools):

class CachedToolSelector:
    def __init__(self, selector, cache_ttl=3600):
        self.selector = selector
        self.cache = LRUCache(maxsize=10000)
        self.ttl = cache_ttl
    
    def select(self, query: str, context_hash: str = None):
        cache_key = hash(query + str(context_hash))
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        result = self.selector.select(query)
        self.cache[cache_key] = result
        return result

Hit rates: 20-40% for chatbots (users ask similar things), <5% for agents (unique tasks).

9.5 Index Sharding (1000+ tools)

For very large tool sets, shard the embedding index by category:

Query → Category Classifier → Shard[category].search(query)
                                    ↑
                              Only loads relevant shard

Benefit: Memory footprint scales with active categories, not total tools.

9.6 Hardware Considerations

Component	CPU	GPU	When to Use GPU
BM25	✓ Fast	N/A	Never (string ops)
Embedding encode	Slow	✓ Fast	>100 queries/sec
Similarity search	✓ OK	✓ Faster	>10K tools
Cross-encoder rerank	Slow	✓ Fast	Always if available

Rule of thumb: GPU for neural components, CPU for keyword search.

10. Production Recommendations

Tool Count	Approach	Key Investment
5-50	All in context	Description quality
50-200	Semantic retrieval + reranking	Embedding fine-tuning
200-1000	Hierarchical routing	Category classifier, A/B testing
1000+	Learned selection + graph	Dedicated model, continuous learning

11. Summary

The 80/20 of tool selection:

Tool descriptions (40%): Clear “when to use”, “when not to use”, examples
Retrieval quality (30%): Hybrid retrieval beats pure semantic
Context awareness (20%): Use conversation history
Learned components (10%): RL fine-tuning for edge cases

Common mistakes:

Over-engineering retrieval when descriptions are bad
Ignoring conversation context
No fallback strategy
Not monitoring selection accuracy in production

The best tool selection system is one where you rarely think about it because it just works.

References

Tool Selection Optimization

ToolScope - arXiv:2510.20036 - Tool merging and context-aware filtering
AutoTool - arXiv:2512.13278 - Dynamic tool selection via RL
AutoTool (Graph) - arXiv:2511.14650 - Historical trajectory modeling
ToolBrain - arXiv:2510.00023 - RL framework for tool use training
Bloomberg ACL 2025 - Context optimization for tool calling (StableToolBench, RestBench)

Tool Retrieval

“Retrieval Models Aren’t Tool-Savvy” - ACL 2025 Findings - Shows <35% completeness@10
HYRR: Hybrid Retrieval - arXiv:2212.10528 - Combining BM25 with neural retrieval

Conversational Retrieval

TREC CAsT - Conversational Assistance Track - Benchmark for conversational search
ConvDR - arXiv:2104.13650 - Few-shot conversational dense retrieval with history encoding

Constrained Decoding

Manus - Context Engineering for AI Agents
Outlines - github.com/outlines-dev/outlines - Grammar-constrained generation

Error Recovery

PALADIN - arXiv:2509.25238 - Self-correcting agents with 95.2% recovery on unseen APIs
Structured Reflection - arXiv:2509.18847 - Diagnose failures, propose corrective actions
STAR - arXiv:2503.06060 - Foundation model + knowledge graph for recovery (78% success)
Toolken+ - arXiv:2410.12004 - “Reject” option to reduce false tool calls

Production Features

Anthropic Advanced Tool Use - anthropic.com/engineering/advanced-tool-use - Tool Search Tool (85% token reduction, 49%→74% accuracy), Programmatic Tool Calling (37% token reduction), Tool Use Examples (72%→90% parameter accuracy)
Anthropic Code Execution with MCP - anthropic.com/engineering/code-execution-with-mcp - Filesystem-based tool discovery (98.7% token reduction), privacy-preserving operations, skills accumulation
Cloudflare Code Mode - blog.cloudflare.com/code-mode - Similar findings on code-based MCP tool orchestration

Code examples are synthesized implementations illustrating practical patterns.