Skip to content
Go back

Tool Selection Optimization for LLM Agents at Scale

· 14 min read

The tool selection problem is deceptively simple: given a user query and a set of available tools, pick the right ones. With 5 tools, this is trivial. With 500 tools, it becomes a critical bottleneck that determines whether your agent system works at all.

This post covers the technical approaches to tool selection optimization—from semantic retrieval to learned routing—with a focus on what actually works in production.


1. The Scale Problem

# ToolsSelection AccuracyApproach
5-1090-95%All in context
20-5080-90%Good descriptions critical
100-20060-80%Retrieval necessary
500+40-60%Multi-stage selection

Why it matters: Tool definitions cost 200-500 tokens each. At 500 tools, that’s 200K+ tokens per request—3-5 seconds of inference time and ~45K USD/month at scale. Optimized selection (10 tools) drops this to 400ms and ~900 USD/month.

Bloomberg’s finding: Optimizing tool selection reduced unnecessary tool calls by 70% while maintaining task success rates.


2. Retrieval-Based Tool Selection

2.1 Hybrid Retrieval

Pure semantic retrieval fails on vocabulary mismatch (user says “cancel” but tool says “refund”). Combine dense + sparse:

class HybridToolRetriever:
    def __init__(self, tools, embedding_model, alpha=0.7):
        self.semantic = SemanticRetriever(tools, embedding_model)
        self.keyword = BM25Retriever(tools)
        self.alpha = alpha
    
    def retrieve(self, query: str, k: int = 10) -> List[Tool]:
        # Reciprocal Rank Fusion - no normalization needed
        tool_scores = defaultdict(float)
        for rank, tool in enumerate(self.semantic.retrieve(query, k*2)):
            tool_scores[tool.name] += 1 / (60 + rank)
        for rank, tool in enumerate(self.keyword.retrieve(query, k*2)):
            tool_scores[tool.name] += 1 / (60 + rank)
        
        sorted_tools = sorted(tool_scores.items(), key=lambda x: -x[1])
        return [self.get_tool(name) for name, _ in sorted_tools[:k]]

2.2 Conversation-Aware Retrieval

A query like “now delete it” only makes sense with conversation history. Two established approaches:

1. Query Rewriting (TREC CAsT standard)

def rewrite_query(query: str, history: List[Turn]) -> str:
    """LLM rewrites ambiguous query to be self-contained"""
    prompt = f"Given conversation:\n{format_history(history)}\n\nRewrite '{query}' to be self-contained."
    return llm.generate(prompt)  # "delete it" → "delete user John Smith"

2. History Concatenation (ConvDR approach)

def retrieve_with_history(query: str, history: List[Turn], k: int = 10):
    """Concatenate recent history with query before encoding"""
    context = "\n".join([t.content for t in history[-3:]])  # Last 3 turns
    combined_text = f"{context}\n\nCurrent: {query}"
    return semantic_search(combined_text, k)

References: TREC CAsT benchmark, ConvDR (arXiv:2104.13650). Query rewriting is more accurate but adds latency; concatenation is simpler.

2.3 Embedding Model Selection

ModelCost/1MNotes
Voyage 3 Large$0.18Top on Agentset tool retrieval benchmark
text-embedding-3-large$0.13Balanced accuracy/cost
BGE-M3$0.01Self-hosted, budget option

Note on Gemini Embedding: Leads MTEB general benchmarks but underperforms on tool-specific retrieval (Agentset leaderboard). General embedding quality ≠ tool retrieval quality.

Key insight from “Retrieval Models Aren’t Tool-Savvy” (ACL 2025): Standard embeddings achieve <35% completeness@10 on tool retrieval. Fine-tuning with hard negatives (semantically similar but functionally different tools) significantly improves this.


3. Tool Description Optimization

3.1 Bloomberg’s Context Optimization (ACL 2025)

Key finding: Jointly optimizing agent instructions AND tool descriptions reduced tool calls by 70% (StableToolBench) and 47% (RestBench) while maintaining pass rates.

Benchmarks tested:

BenchmarkDescriptionTools
StableToolBenchStability benchmark for tool-augmented LLMs16,000+ real APIs
RestBenchRESTful API evaluationREST APIs

The insight: Incomplete descriptions force LLMs to make exploratory calls. The optimized version adds:

When NOT to use: 
- If you already have the information from previous tool calls
- For structured data queries (use database_query instead)

Note: Each call costs ~200ms. Batch multiple intents into one call.

3.2 Description Template

{
  "name": "weather_forecast",
  "description": "Get weather forecast for a location. Returns temperature, precipitation %, wind, conditions.",
  "when_to_use": "Weather, temperature, rain, outdoor planning queries",
  "when_not_to_use": "Historical data (use weather_history), air quality (use air_quality_api)",
  "parameters": {
    "location": {"type": "string", "description": "City name or coordinates"},
    "days": {"type": "integer", "default": 7, "description": "Forecast days (1-14)"}
  },
  "example_queries": ["What's the weather in Tokyo?", "Will it rain this weekend?"]
}

Key elements:


4. Tool Set Management

4.1 Tool Merging (ToolScope)

Large tool sets often have redundancy: search_users vs find_users vs lookup_users. Each redundant tool consumes tokens, creates ambiguity, and increases hallucination risk.

class ToolMerger:
    def merge(self, tools: List[Tool], similarity_threshold=0.85) -> List[Tool]:
        embeddings = embed_tools(tools)
        clusters = self._cluster(embeddings, similarity_threshold)
        
        merged = []
        for cluster in clusters:
            if len(cluster) == 1:
                merged.append(tools[cluster[0]])
            else:
                # Merge into single tool with aliases
                primary = max([tools[i] for i in cluster], key=lambda t: len(t.description))
                aliases = [tools[i].name for i in cluster if tools[i].name != primary.name]
                primary.description += f"\n\nAliases: {', '.join(aliases)}"
                merged.append(primary)
        return merged

Result: 30-40% tool count reduction, improved selection accuracy.

4.2 Multi-Provider Abstraction

When multiple providers offer the same capability (weather: OpenWeatherMap/AccuWeather/WeatherAPI), expose a single virtual tool to the LLM:

class VirtualToolRouter:
    def get_virtual_tools(self) -> List[Tool]:
        # LLM sees ONE tool per capability
        return [
            Tool(name="weather_forecast", 
                 description="Get weather forecast for any location"),
            Tool(name="web_search",
                 description="Search the web for current information"),
        ]
    
    def execute(self, tool_name: str, params: dict, strategy: str = "smart"):
        providers = self.providers[tool_name]
        
        if strategy == "cheapest":
            provider = min(providers, key=lambda p: p.cost)
        elif strategy == "reliable":
            provider = max(providers, key=lambda p: p.reliability)
        
        return self._call_with_fallback(provider, providers, params)

If user preference matters, add optional provider parameter with default="auto".

Key principle: LLM thinks in capabilities (“I need weather data”), not implementations.

4.3 Large API Surfaces (Stripe, AWS, Salesforce)

100+ endpoints per service → can’t fit in context. Solutions:

ApproachExampleToken Overhead
HierarchicalDomain → operation~25 tools max
Intent-based”Charge customer” → [get_customer, create_payment_intent, confirm_payment]~10 intents
CRUD abstractionmanage_customer(operation="create|read|update|delete")~15 resources
Dynamic retrievalEmbed API docs, retrieve k=5 per queryk retrieved

5. Learned Tool Selection

5.1 AutoTool: Fine-Tuning for Selection (arXiv:2512.13278)

Trains Qwen3-8B and Qwen2.5-VL-7B using SFT + RL. Single model does both planning and tool selection.

Phase 1: Supervised learning on selection rationales (why tool X, not tool Y)

Phase 2: RL refinement with reward:

reward = (
    task_success * 0.5 +
    tool_efficiency * 0.2 +      # Fewer tools = better
    no_hallucination * 0.2 +     # No invented tools
    correct_parameters * 0.1
)

5.2 Graph-Based Selection (arXiv:2511.14650)

Model tool co-occurrence as a graph:

How it combines with hybrid retrieval:

TurnMethodWhy
First turnHybrid (BM25 + embedding)No history, need query understanding
Subsequent turnsGraph transitionsCo-occurrence patterns dominate
class HybridGraphSelector:
    def select(self, query: str, tool_history: List[str], k: int = 5):
        if not tool_history:
            # First turn: pure retrieval
            return self.hybrid_retriever.retrieve(query, k)
        
        # Subsequent: graph candidates, reranked by query relevance
        graph_candidates = self.graph.get_likely_next(tool_history[-1], k * 2)
        return self.rerank_by_query(graph_candidates, query, k)

Key insight: Tool selection has inertia—certain combinations appear together. Graph handles “what usually comes next,” retrieval handles “what does the query need.”

5.3 Constrained Decoding (Manus Approach)

Not retrieval, not training—inference-time control. Keep all tools in context, mask logits during generation.

TechniqueHow It WorksLibrary
Logit maskingSet disallowed tokens to -infManus
Grammar-constrainedForce output to match CFG/regexOutlines, LMQL, Guidance
JSON schemaConstrain to valid structureOpenAI JSON mode, vLLM

Why not just remove tools?

ApproachProblem
Dynamically add/remove toolsInvalidates KV-cache
Retrieval-based filteringMight filter out needed tools
Logit maskingTools stay in context, output constrained

Implementation pattern (Manus):

# State machine controls which tool prefixes are allowed
allowed_prefixes = {
    "idle": ["browser_", "shell_", "search_"],
    "browsing": ["browser_"],  # Only browser tools while browsing
    "responding": [],          # No tools - must respond to user
}

def mask_logits(logits, state, tool_token_ids):
    allowed = [t for t in tool_token_ids if any(t.startswith(p) for p in allowed_prefixes[state])]
    for tool, token_id in tool_token_ids.items():
        if tool not in allowed:
            logits[:, token_id] = float('-inf')
    return logits

When to use: Tool availability depends on dynamic state. Avoids KV-cache invalidation from changing tool lists.


6. Multi-Stage Selection (1000+ tools)

Category → Retrieve: Fast classifier (~5ms) picks category, then retrieve within subset.

Query → Category Classifier → [Data|Web|Code|Doc] → Retrieve within category

           <10ms, light model

Multi-stage pipeline:

StageMethodOutputLatency
1. CoarseEmbedding retrievalk=100~20ms
2. RerankCross-encoderk=30~50ms
3. FilterLLM confirmationk=10~100ms

Trigger LLM stage only if top scores are ambiguous (gap < 0.1).


7. Error Recovery

7.1 Research Approaches

FrameworkMethodResult
PALADIN (arXiv:2509.25238)Train on 50K recovery-annotated trajectories95.2% recovery on unseen APIs
Structured Reflection (arXiv:2509.18847)Diagnose failure → propose corrective actionImproved multi-turn success
STAR (arXiv:2503.06060)Foundation model + knowledge graph78% recovery success rate
Toolken+ (arXiv:2410.12004)Add “Reject” option—model can decline to use toolsReduces false tool calls

Key insight from PALADIN: Expose agents to tool failures during training (timeouts, API exceptions, inconsistent outputs) with expert recovery demonstrations.

7.2 Production Patterns

Fallback chain: Learned selection → Retrieval → Category defaults → Universal tools (search, calculator, code_executor).

Retry with exclusion: On tool failure, exclude from next selection. Pre-map alternatives by capability:

CAPABILITY_TOOLS = {
    "web_search": ["tavily", "serper", "google_search"],
    "weather": ["openweathermap", "weatherapi", "accuweather"],
    "code_execution": ["e2b_sandbox", "modal_sandbox", "local_docker"],
}

def select_with_recovery(query: str, failed_tools: Set[str] = None):
    candidates = retrieve_tools(query)
    if failed_tools:
        candidates = [t for t in candidates if t.name not in failed_tools]
    
    if not candidates:
        # Fallback to capability-based alternatives
        capability = infer_capability(query)
        candidates = [Tool(name=t) for t in CAPABILITY_TOOLS.get(capability, [])]
    
    return candidates

Structured reflection (self-correction):

def reflect_on_failure(query: str, tool: str, error: str, history: List[Turn]):
    """LLM diagnoses failure and proposes recovery action"""
    prompt = f"""
    Query: {query}
    Tool called: {tool}
    Error: {error}
    Previous steps: {format_history(history)}
    
    Diagnose what went wrong and propose the next action:
    1. Was this the wrong tool? → Suggest alternative
    2. Wrong parameters? → Suggest correction
    3. API unavailable? → Suggest fallback
    """
    return llm.generate(prompt)

Retrieval failure (vocabulary mismatch):

7.3 The “Reject” Option (Toolken+)

Allow model to not select any tool:

tools_with_reject = tools + [Tool(
    name="NO_TOOL",
    description="Use when the query can be answered directly without tools, or no tool is appropriate"
)]

This reduces false tool calls when the LLM is uncertain.


8. Infrastructure Optimizations

8.1 KV Cache Stability

The problem: Dynamic retrieval changes which tools are in context each turn → invalidates KV cache → recomputes all previous tokens.

Solutions:

ApproachHowSavings
Static tool setKeep all tools in context, use masking100% cache hit
Ordered insertionAlways insert tools in same orderPartial cache hit
Tool prefix cachingSeparate tool definitions from conversation~50% savings

Manus insight: This is why they keep all tools in context and use logit masking—KV cache stability matters more than context length at their scale.

8.2 Prompt/Prefix Caching

Both Anthropic and OpenAI offer prompt caching—tool definitions in system prompt are cached across requests.

# OpenAI: Tools are automatically cached as part of system prompt
# Anthropic: Use cache_control for tool definitions

tools_with_cache = {
    "tools": [...],  # Same tool list = cache hit
    "cache_control": {"type": "ephemeral"}  # Anthropic
}

Impact:

8.3 Embedding Precomputation

Don’t compute tool embeddings at query time:

class CachedToolRetriever:
    def __init__(self, tools, embedding_model):
        # Precompute and store
        self.tool_embeddings = np.array([
            embedding_model.encode(t.description) for t in tools
        ])
        # Optional: quantize for faster search
        self.tool_embeddings_int8 = quantize_to_int8(self.tool_embeddings)
    
    def retrieve(self, query: str, k: int = 10):
        query_emb = self.embedding_model.encode(query)  # Only this at runtime
        scores = cosine_similarity(query_emb, self.tool_embeddings)
        return top_k(scores, k)

Storage: ~3KB per tool (768-dim float32) → 3MB for 1000 tools. Negligible.

8.4 Query Result Caching

For high-traffic systems, cache (query_hash → selected_tools):

class CachedToolSelector:
    def __init__(self, selector, cache_ttl=3600):
        self.selector = selector
        self.cache = LRUCache(maxsize=10000)
        self.ttl = cache_ttl
    
    def select(self, query: str, context_hash: str = None):
        cache_key = hash(query + str(context_hash))
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        result = self.selector.select(query)
        self.cache[cache_key] = result
        return result

Hit rates: 20-40% for chatbots (users ask similar things), <5% for agents (unique tasks).

8.5 Index Sharding (1000+ tools)

For very large tool sets, shard the embedding index by category:

Query → Category Classifier → Shard[category].search(query)

                              Only loads relevant shard

Benefit: Memory footprint scales with active categories, not total tools.

8.6 Hardware Considerations

ComponentCPUGPUWhen to Use GPU
BM25✓ FastN/ANever (string ops)
Embedding encodeSlow✓ Fast>100 queries/sec
Similarity search✓ OK✓ Faster>10K tools
Cross-encoder rerankSlow✓ FastAlways if available

Rule of thumb: GPU for neural components, CPU for keyword search.


9. Production Recommendations

Tool CountApproachKey Investment
5-50All in contextDescription quality
50-200Semantic retrieval + rerankingEmbedding fine-tuning
200-1000Hierarchical routingCategory classifier, A/B testing
1000+Learned selection + graphDedicated model, continuous learning

10. Summary

The 80/20 of tool selection:

  1. Tool descriptions (40%): Clear “when to use”, “when not to use”, examples
  2. Retrieval quality (30%): Hybrid retrieval beats pure semantic
  3. Context awareness (20%): Use conversation history
  4. Learned components (10%): RL fine-tuning for edge cases

Common mistakes:

The best tool selection system is one where you rarely think about it because it just works.


References

Tool Selection Optimization

  1. ToolScope - arXiv:2510.20036 - Tool merging and context-aware filtering
  2. AutoTool - arXiv:2512.13278 - Dynamic tool selection via RL
  3. AutoTool (Graph) - arXiv:2511.14650 - Historical trajectory modeling
  4. ToolBrain - arXiv:2510.00023 - RL framework for tool use training
  5. Bloomberg ACL 2025 - Context optimization for tool calling (StableToolBench, RestBench)

Tool Retrieval

  1. “Retrieval Models Aren’t Tool-Savvy” - ACL 2025 Findings - Shows <35% completeness@10
  2. HYRR: Hybrid Retrieval - arXiv:2212.10528 - Combining BM25 with neural retrieval

Conversational Retrieval

  1. TREC CAsT - Conversational Assistance Track - Benchmark for conversational search
  2. ConvDR - arXiv:2104.13650 - Few-shot conversational dense retrieval with history encoding

Constrained Decoding

  1. Manus - Context Engineering for AI Agents
  2. Outlines - github.com/outlines-dev/outlines - Grammar-constrained generation

Error Recovery

  1. PALADIN - arXiv:2509.25238 - Self-correcting agents with 95.2% recovery on unseen APIs
  2. Structured Reflection - arXiv:2509.18847 - Diagnose failures, propose corrective actions
  3. STAR - arXiv:2503.06060 - Foundation model + knowledge graph for recovery (78% success)
  4. Toolken+ - arXiv:2410.12004 - “Reject” option to reduce false tool calls

Code examples are synthesized implementations illustrating practical patterns.


Share this post on:

Next Post
Generative Engine Optimization (GEO): How to Get Your Product Cited by AI