Give an agent too many tools and it picks the wrong one

To choose a tool, the agent reads all of them. With a hundred near-identical options, the right one is buried among distractors, and selection accuracy collapses.

B

Balagei G Nagarajan

5 MIN READ


Short answer. Your agent picks the wrong tool because it chooses by reading every tool definition in its context, and a large, overlapping toolset buries the right one among distractors while bloating the prompt. Stop showing the model every tool. Index your tools and retrieve only the few relevant to each task, and selection accuracy more than triples while the token bill drops by half.

An AI agent facing an overwhelming wall of hundreds of near-identical glowing tool icons, hesitating, with one correct tool lost in the crowd

Facing a wall of near-identical tools, the agent hesitates and reaches for the wrong one. The right tool is lost in the crowd. Hero image.

Key facts.

  • More tools, worse choices: in a controlled stress test, retrieving only the relevant tools more than tripled tool-selection accuracy, from 13.62% to 43.13%, while cutting prompt tokens by over 50% (RAG-MCP, arXiv:2505.03275, 2025).
  • The degradation is sharp: the same test varied the pool from 1 to over 1,100 tools and found accuracy and task success held up below about 30 tools but fell off steeply past roughly 100, as distractors and semantic overlap overwhelmed selection (RAG-MCP, 2025).
  • The cost is double: every tool definition you add fills the context window, so a sprawling toolset both confuses the model and inflates the token bill of every single call (RAG-MCP, 2025).

Why does adding tools make the agent worse?

Because every available tool is a distractor the model has to rule out. To choose a tool, the agent reads all the tool definitions in its context and picks one. With a handful of tools that is easy. With a hundred, many of them similar, the right choice is buried among near-duplicates, and the model, which cannot reliably filter relevant from irrelevant, picks a plausible wrong one. The RAG-MCP work frames it bluntly: tool sprawl does not just add options, it poisons the decision by flooding the context with distractors. Two things compound it: semantic overlap, where several tools sound like they could do the job, and prompt bloat, where the sheer volume of definitions crowds out the actual task. More tools is not more capability. Past a point it is less.

How fast does it degrade?

Faster than most teams expect. The RAG-MCP stress test scaled the tool pool from one to over a thousand and measured selection accuracy at each step. Below roughly thirty tools, the agent held up. Past about a hundred, accuracy and task success fell off sharply, with failures dominating at the high end as distractors and overlap won. The headline numbers make the size of the problem concrete: showing the model every tool gave 13.62% selection accuracy, while retrieving only the relevant ones first reached 43.13%, more than three times better, on the same tasks. The lesson is that a large tool registry is not a neutral convenience. It actively caps how well the agent can choose.

A heatmap with number of available tools on the x-axis and tool-selection accuracy shaded from green at low tool counts to red past about 100 tools, with a marker at the 30 and 100 thresholds

Selection accuracy as a heatmap against tool count: green and reliable under ~30 tools, sliding to red past ~100 as distractors take over. Diagram.

The fix is retrieval, not a bigger prompt

Stop showing the model every tool. The reliable fix is to treat tools the way you treat documents in RAG: index their descriptions, and at query time retrieve only the handful relevant to the current task, then show the model just those. RAG-MCP does exactly this and more than triples selection accuracy while cutting prompt tokens by over half, because the model now chooses among a few good candidates instead of hundreds of distractors. This scales: the registry can hold thousands of tools, but the model only ever sees the few that matter for the request. It is the same move that fixes context bloat everywhere, retrieve what is relevant instead of stuffing everything into the prompt.

How to keep the toolset sane

PracticeWhat it does
Retrieve tools per queryShow the model only the few relevant tools, not the whole registry
Keep the per-call set smallAim for a handful of tools in context, well under the degradation point
Remove near-duplicatesMerge or namespace tools that overlap, so the model is not guessing between them
Group hierarchicallyPick a category first, then a tool within it, instead of one flat list
Name and describe clearlyDistinct names and crisp descriptions make the right tool easier to find
Measure selection accuracyTrack how often the agent picks the right tool as you add more

The pattern is that an agent chooses a tool by reading them all, so every tool you add makes the choice harder and the prompt heavier, until accuracy collapses under the weight of its own toolbox. Retrieve the few relevant tools per task instead of showing the whole registry, keep the per-call set small, and remove overlap. None of that is a bigger model, which gets just as confused by a hundred near-identical tools. It is a retrieval layer that hands the model only the tools that matter, which is what VibeModel builds as the Pattern Intelligence Layer.

Frequently asked questions

How many tools is too many?
In the RAG-MCP stress test, agents held up below about 30 tools and degraded sharply past roughly 100. Treat a few dozen as a soft ceiling for what the model sees at once, and retrieve a smaller relevant subset per task.

Isn't a bigger context window the answer?
No. A bigger window lets you cram in more tool definitions, which adds more distractors and more prompt bloat. The fix is to show fewer, relevant tools, not to make room for all of them.

What's the single best fix?
Tool retrieval: index your tool descriptions and fetch only the ones relevant to the current request, then show the model just those. That alone more than tripled selection accuracy and halved prompt tokens in the study.


Share this post

Join the discussion

Have a take, a war story, or a question? Sign in with GitHub to comment and react. Comments are powered by GitHub Discussions, ad-free and yours to moderate.

Continue Reading

Find where your agent breaks, before you build it

Faultmap maps where your agent will fail from the goal and your data, then hands you the first test suite it has to pass.