The tool selection problem is deceptively simple: given a user query and a set of available tools, pick the right ones. With 5 tools, this is trivial. With 500 tools, it becomes a critical bottleneck that determines whether your agent system works at all.
This post covers the technical approaches to tool selection optimization—from semantic retrieval to learned routing—with a focus on what actually works in production.
1. The Scale Problem
| # Tools | Selection Accuracy | Approach |
|---|---|---|
| 5-10 | 90-95% | All in context |
| 20-50 | 80-90% | Good descriptions critical |
| 100-200 | 60-80% | Retrieval necessary |
| 500+ | 40-60% | Multi-stage selection |
Why it matters: Tool definitions cost 200-500 tokens each. At 500 tools, that’s 200K+ tokens per request—3-5 seconds of inference time and ~45K USD/month at scale. Optimized selection (10 tools) drops this to 400ms and ~900 USD/month.
Bloomberg’s finding: Optimizing tool selection reduced unnecessary tool calls by 70% while maintaining task success rates.
2. Retrieval-Based Tool Selection
2.1 Hybrid Retrieval
Pure semantic retrieval fails on vocabulary mismatch (user says “cancel” but tool says “refund”). Combine dense + sparse:
class HybridToolRetriever:
def __init__(self, tools, embedding_model, alpha=0.7):
self.semantic = SemanticRetriever(tools, embedding_model)
self.keyword = BM25Retriever(tools)
self.alpha = alpha
def retrieve(self, query: str, k: int = 10) -> List[Tool]:
# Reciprocal Rank Fusion - no normalization needed
tool_scores = defaultdict(float)
for rank, tool in enumerate(self.semantic.retrieve(query, k*2)):
tool_scores[tool.name] += 1 / (60 + rank)
for rank, tool in enumerate(self.keyword.retrieve(query, k*2)):
tool_scores[tool.name] += 1 / (60 + rank)
sorted_tools = sorted(tool_scores.items(), key=lambda x: -x[1])
return [self.get_tool(name) for name, _ in sorted_tools[:k]]
2.2 Conversation-Aware Retrieval
A query like “now delete it” only makes sense with conversation history. Two established approaches:
1. Query Rewriting (TREC CAsT standard)
def rewrite_query(query: str, history: List[Turn]) -> str:
"""LLM rewrites ambiguous query to be self-contained"""
prompt = f"Given conversation:\n{format_history(history)}\n\nRewrite '{query}' to be self-contained."
return llm.generate(prompt) # "delete it" → "delete user John Smith"
2. History Concatenation (ConvDR approach)
def retrieve_with_history(query: str, history: List[Turn], k: int = 10):
"""Concatenate recent history with query before encoding"""
context = "\n".join([t.content for t in history[-3:]]) # Last 3 turns
combined_text = f"{context}\n\nCurrent: {query}"
return semantic_search(combined_text, k)
References: TREC CAsT benchmark, ConvDR (arXiv:2104.13650). Query rewriting is more accurate but adds latency; concatenation is simpler.
2.3 Embedding Model Selection
| Model | Cost/1M | Notes |
|---|---|---|
| Voyage 3 Large | $0.18 | Top on Agentset tool retrieval benchmark |
| text-embedding-3-large | $0.13 | Balanced accuracy/cost |
| BGE-M3 | $0.01 | Self-hosted, budget option |
Note on Gemini Embedding: Leads MTEB general benchmarks but underperforms on tool-specific retrieval (Agentset leaderboard). General embedding quality ≠ tool retrieval quality.
Key insight from “Retrieval Models Aren’t Tool-Savvy” (ACL 2025): Standard embeddings achieve <35% completeness@10 on tool retrieval. Fine-tuning with hard negatives (semantically similar but functionally different tools) significantly improves this.
3. Tool Description Optimization
3.1 Bloomberg’s Context Optimization (ACL 2025)
Key finding: Jointly optimizing agent instructions AND tool descriptions reduced tool calls by 70% (StableToolBench) and 47% (RestBench) while maintaining pass rates.
Benchmarks tested:
| Benchmark | Description | Tools |
|---|---|---|
| StableToolBench | Stability benchmark for tool-augmented LLMs | 16,000+ real APIs |
| RestBench | RESTful API evaluation | REST APIs |
The insight: Incomplete descriptions force LLMs to make exploratory calls. The optimized version adds:
When NOT to use:
- If you already have the information from previous tool calls
- For structured data queries (use database_query instead)
Note: Each call costs ~200ms. Batch multiple intents into one call.
3.2 Description Template
{
"name": "weather_forecast",
"description": "Get weather forecast for a location. Returns temperature, precipitation %, wind, conditions.",
"when_to_use": "Weather, temperature, rain, outdoor planning queries",
"when_not_to_use": "Historical data (use weather_history), air quality (use air_quality_api)",
"parameters": {
"location": {"type": "string", "description": "City name or coordinates"},
"days": {"type": "integer", "default": 7, "description": "Forecast days (1-14)"}
},
"example_queries": ["What's the weather in Tokyo?", "Will it rain this weekend?"]
}
Key elements:
when_to_use— triggers selectionwhen_not_to_use— prevents over-selectionexample_queries— improves retrieval matching
4. Tool Set Management
4.1 Tool Merging (ToolScope)
Large tool sets often have redundancy: search_users vs find_users vs lookup_users. Each redundant tool consumes tokens, creates ambiguity, and increases hallucination risk.
class ToolMerger:
def merge(self, tools: List[Tool], similarity_threshold=0.85) -> List[Tool]:
embeddings = embed_tools(tools)
clusters = self._cluster(embeddings, similarity_threshold)
merged = []
for cluster in clusters:
if len(cluster) == 1:
merged.append(tools[cluster[0]])
else:
# Merge into single tool with aliases
primary = max([tools[i] for i in cluster], key=lambda t: len(t.description))
aliases = [tools[i].name for i in cluster if tools[i].name != primary.name]
primary.description += f"\n\nAliases: {', '.join(aliases)}"
merged.append(primary)
return merged
Result: 30-40% tool count reduction, improved selection accuracy.
4.2 Multi-Provider Abstraction
When multiple providers offer the same capability (weather: OpenWeatherMap/AccuWeather/WeatherAPI), expose a single virtual tool to the LLM:
class VirtualToolRouter:
def get_virtual_tools(self) -> List[Tool]:
# LLM sees ONE tool per capability
return [
Tool(name="weather_forecast",
description="Get weather forecast for any location"),
Tool(name="web_search",
description="Search the web for current information"),
]
def execute(self, tool_name: str, params: dict, strategy: str = "smart"):
providers = self.providers[tool_name]
if strategy == "cheapest":
provider = min(providers, key=lambda p: p.cost)
elif strategy == "reliable":
provider = max(providers, key=lambda p: p.reliability)
return self._call_with_fallback(provider, providers, params)
If user preference matters, add optional provider parameter with default="auto".
Key principle: LLM thinks in capabilities (“I need weather data”), not implementations.
4.3 Large API Surfaces (Stripe, AWS, Salesforce)
100+ endpoints per service → can’t fit in context. Solutions:
| Approach | Example | Token Overhead |
|---|---|---|
| Hierarchical | Domain → operation | ~25 tools max |
| Intent-based | ”Charge customer” → [get_customer, create_payment_intent, confirm_payment] | ~10 intents |
| CRUD abstraction | manage_customer(operation="create|read|update|delete") | ~15 resources |
| Dynamic retrieval | Embed API docs, retrieve k=5 per query | k retrieved |
5. Learned Tool Selection
5.1 AutoTool: Fine-Tuning for Selection (arXiv:2512.13278)
Trains Qwen3-8B and Qwen2.5-VL-7B using SFT + RL. Single model does both planning and tool selection.
Phase 1: Supervised learning on selection rationales (why tool X, not tool Y)
Phase 2: RL refinement with reward:
reward = (
task_success * 0.5 +
tool_efficiency * 0.2 + # Fewer tools = better
no_hallucination * 0.2 + # No invented tools
correct_parameters * 0.1
)
5.2 Graph-Based Selection (arXiv:2511.14650)
Model tool co-occurrence as a graph:
- Nodes = Tools
- Edges =
P(tool_j | tool_i)from historical trajectories
How it combines with hybrid retrieval:
| Turn | Method | Why |
|---|---|---|
| First turn | Hybrid (BM25 + embedding) | No history, need query understanding |
| Subsequent turns | Graph transitions | Co-occurrence patterns dominate |
class HybridGraphSelector:
def select(self, query: str, tool_history: List[str], k: int = 5):
if not tool_history:
# First turn: pure retrieval
return self.hybrid_retriever.retrieve(query, k)
# Subsequent: graph candidates, reranked by query relevance
graph_candidates = self.graph.get_likely_next(tool_history[-1], k * 2)
return self.rerank_by_query(graph_candidates, query, k)
Key insight: Tool selection has inertia—certain combinations appear together. Graph handles “what usually comes next,” retrieval handles “what does the query need.”
5.3 Constrained Decoding (Manus Approach)
Not retrieval, not training—inference-time control. Keep all tools in context, mask logits during generation.
| Technique | How It Works | Library |
|---|---|---|
| Logit masking | Set disallowed tokens to -inf | Manus |
| Grammar-constrained | Force output to match CFG/regex | Outlines, LMQL, Guidance |
| JSON schema | Constrain to valid structure | OpenAI JSON mode, vLLM |
Why not just remove tools?
| Approach | Problem |
|---|---|
| Dynamically add/remove tools | Invalidates KV-cache |
| Retrieval-based filtering | Might filter out needed tools |
| Logit masking | Tools stay in context, output constrained |
Implementation pattern (Manus):
# State machine controls which tool prefixes are allowed
allowed_prefixes = {
"idle": ["browser_", "shell_", "search_"],
"browsing": ["browser_"], # Only browser tools while browsing
"responding": [], # No tools - must respond to user
}
def mask_logits(logits, state, tool_token_ids):
allowed = [t for t in tool_token_ids if any(t.startswith(p) for p in allowed_prefixes[state])]
for tool, token_id in tool_token_ids.items():
if tool not in allowed:
logits[:, token_id] = float('-inf')
return logits
When to use: Tool availability depends on dynamic state. Avoids KV-cache invalidation from changing tool lists.
6. Multi-Stage Selection (1000+ tools)
Category → Retrieve: Fast classifier (~5ms) picks category, then retrieve within subset.
Query → Category Classifier → [Data|Web|Code|Doc] → Retrieve within category
↑
<10ms, light model
Multi-stage pipeline:
| Stage | Method | Output | Latency |
|---|---|---|---|
| 1. Coarse | Embedding retrieval | k=100 | ~20ms |
| 2. Rerank | Cross-encoder | k=30 | ~50ms |
| 3. Filter | LLM confirmation | k=10 | ~100ms |
Trigger LLM stage only if top scores are ambiguous (gap < 0.1).
7. Error Recovery
7.1 Research Approaches
| Framework | Method | Result |
|---|---|---|
| PALADIN (arXiv:2509.25238) | Train on 50K recovery-annotated trajectories | 95.2% recovery on unseen APIs |
| Structured Reflection (arXiv:2509.18847) | Diagnose failure → propose corrective action | Improved multi-turn success |
| STAR (arXiv:2503.06060) | Foundation model + knowledge graph | 78% recovery success rate |
| Toolken+ (arXiv:2410.12004) | Add “Reject” option—model can decline to use tools | Reduces false tool calls |
Key insight from PALADIN: Expose agents to tool failures during training (timeouts, API exceptions, inconsistent outputs) with expert recovery demonstrations.
7.2 Production Patterns
Fallback chain: Learned selection → Retrieval → Category defaults → Universal tools (search, calculator, code_executor).
Retry with exclusion: On tool failure, exclude from next selection. Pre-map alternatives by capability:
CAPABILITY_TOOLS = {
"web_search": ["tavily", "serper", "google_search"],
"weather": ["openweathermap", "weatherapi", "accuweather"],
"code_execution": ["e2b_sandbox", "modal_sandbox", "local_docker"],
}
def select_with_recovery(query: str, failed_tools: Set[str] = None):
candidates = retrieve_tools(query)
if failed_tools:
candidates = [t for t in candidates if t.name not in failed_tools]
if not candidates:
# Fallback to capability-based alternatives
capability = infer_capability(query)
candidates = [Tool(name=t) for t in CAPABILITY_TOOLS.get(capability, [])]
return candidates
Structured reflection (self-correction):
def reflect_on_failure(query: str, tool: str, error: str, history: List[Turn]):
"""LLM diagnoses failure and proposes recovery action"""
prompt = f"""
Query: {query}
Tool called: {tool}
Error: {error}
Previous steps: {format_history(history)}
Diagnose what went wrong and propose the next action:
1. Was this the wrong tool? → Suggest alternative
2. Wrong parameters? → Suggest correction
3. API unavailable? → Suggest fallback
"""
return llm.generate(prompt)
Retrieval failure (vocabulary mismatch):
- Detect via low scores (top score < 0.5)
- Reformulate with synonyms: “cancel” → “refund”, “terminate”
- Build alias index: user terms → tool terms
7.3 The “Reject” Option (Toolken+)
Allow model to not select any tool:
tools_with_reject = tools + [Tool(
name="NO_TOOL",
description="Use when the query can be answered directly without tools, or no tool is appropriate"
)]
This reduces false tool calls when the LLM is uncertain.
8. Infrastructure Optimizations
8.1 KV Cache Stability
The problem: Dynamic retrieval changes which tools are in context each turn → invalidates KV cache → recomputes all previous tokens.
Solutions:
| Approach | How | Savings |
|---|---|---|
| Static tool set | Keep all tools in context, use masking | 100% cache hit |
| Ordered insertion | Always insert tools in same order | Partial cache hit |
| Tool prefix caching | Separate tool definitions from conversation | ~50% savings |
Manus insight: This is why they keep all tools in context and use logit masking—KV cache stability matters more than context length at their scale.
8.2 Prompt/Prefix Caching
Both Anthropic and OpenAI offer prompt caching—tool definitions in system prompt are cached across requests.
# OpenAI: Tools are automatically cached as part of system prompt
# Anthropic: Use cache_control for tool definitions
tools_with_cache = {
"tools": [...], # Same tool list = cache hit
"cache_control": {"type": "ephemeral"} # Anthropic
}
Impact:
- First request: Full input token cost
- Subsequent: ~90% reduction for cached prefix
- Requires tool list stability—changing tools invalidates cache
8.3 Embedding Precomputation
Don’t compute tool embeddings at query time:
class CachedToolRetriever:
def __init__(self, tools, embedding_model):
# Precompute and store
self.tool_embeddings = np.array([
embedding_model.encode(t.description) for t in tools
])
# Optional: quantize for faster search
self.tool_embeddings_int8 = quantize_to_int8(self.tool_embeddings)
def retrieve(self, query: str, k: int = 10):
query_emb = self.embedding_model.encode(query) # Only this at runtime
scores = cosine_similarity(query_emb, self.tool_embeddings)
return top_k(scores, k)
Storage: ~3KB per tool (768-dim float32) → 3MB for 1000 tools. Negligible.
8.4 Query Result Caching
For high-traffic systems, cache (query_hash → selected_tools):
class CachedToolSelector:
def __init__(self, selector, cache_ttl=3600):
self.selector = selector
self.cache = LRUCache(maxsize=10000)
self.ttl = cache_ttl
def select(self, query: str, context_hash: str = None):
cache_key = hash(query + str(context_hash))
if cache_key in self.cache:
return self.cache[cache_key]
result = self.selector.select(query)
self.cache[cache_key] = result
return result
Hit rates: 20-40% for chatbots (users ask similar things), <5% for agents (unique tasks).
8.5 Index Sharding (1000+ tools)
For very large tool sets, shard the embedding index by category:
Query → Category Classifier → Shard[category].search(query)
↑
Only loads relevant shard
Benefit: Memory footprint scales with active categories, not total tools.
8.6 Hardware Considerations
| Component | CPU | GPU | When to Use GPU |
|---|---|---|---|
| BM25 | ✓ Fast | N/A | Never (string ops) |
| Embedding encode | Slow | ✓ Fast | >100 queries/sec |
| Similarity search | ✓ OK | ✓ Faster | >10K tools |
| Cross-encoder rerank | Slow | ✓ Fast | Always if available |
Rule of thumb: GPU for neural components, CPU for keyword search.
9. Production Recommendations
| Tool Count | Approach | Key Investment |
|---|---|---|
| 5-50 | All in context | Description quality |
| 50-200 | Semantic retrieval + reranking | Embedding fine-tuning |
| 200-1000 | Hierarchical routing | Category classifier, A/B testing |
| 1000+ | Learned selection + graph | Dedicated model, continuous learning |
10. Summary
The 80/20 of tool selection:
- Tool descriptions (40%): Clear “when to use”, “when not to use”, examples
- Retrieval quality (30%): Hybrid retrieval beats pure semantic
- Context awareness (20%): Use conversation history
- Learned components (10%): RL fine-tuning for edge cases
Common mistakes:
- Over-engineering retrieval when descriptions are bad
- Ignoring conversation context
- No fallback strategy
- Not monitoring selection accuracy in production
The best tool selection system is one where you rarely think about it because it just works.
References
Tool Selection Optimization
- ToolScope - arXiv:2510.20036 - Tool merging and context-aware filtering
- AutoTool - arXiv:2512.13278 - Dynamic tool selection via RL
- AutoTool (Graph) - arXiv:2511.14650 - Historical trajectory modeling
- ToolBrain - arXiv:2510.00023 - RL framework for tool use training
- Bloomberg ACL 2025 - Context optimization for tool calling (StableToolBench, RestBench)
Tool Retrieval
- “Retrieval Models Aren’t Tool-Savvy” - ACL 2025 Findings - Shows <35% completeness@10
- HYRR: Hybrid Retrieval - arXiv:2212.10528 - Combining BM25 with neural retrieval
Conversational Retrieval
- TREC CAsT - Conversational Assistance Track - Benchmark for conversational search
- ConvDR - arXiv:2104.13650 - Few-shot conversational dense retrieval with history encoding
Constrained Decoding
- Manus - Context Engineering for AI Agents
- Outlines - github.com/outlines-dev/outlines - Grammar-constrained generation
Error Recovery
- PALADIN - arXiv:2509.25238 - Self-correcting agents with 95.2% recovery on unseen APIs
- Structured Reflection - arXiv:2509.18847 - Diagnose failures, propose corrective actions
- STAR - arXiv:2503.06060 - Foundation model + knowledge graph for recovery (78% success)
- Toolken+ - arXiv:2410.12004 - “Reject” option to reduce false tool calls
Code examples are synthesized implementations illustrating practical patterns.