I’ve been building an AI agent that connects to about 15 MCP servers (database, file system, Slack, GitHub, etc.) and I’m running into a problem that Perplexity’s CTO actually called out at Ask 2026: the tool descriptions alone are consuming a massive chunk of the context window.
In my case, with 15 servers averaging maybe 4-5 tools each, the tool schemas and descriptions add up to around 12k tokens before the user even says anything. On a 128k context model that’s manageable, but it still means less room for actual conversation history and retrieved documents.
I’ve tried a few things:
-
Lazy loading - only registering tools from servers that are relevant to the current conversation. Works okay but requires a routing layer that adds latency.
-
Trimming descriptions - stripping tool descriptions down to the bare minimum. But then the model makes more mistakes picking the right tool.
-
Two-stage approach - first call picks which MCP servers are relevant, second call loads only those tools. Doubles the API calls though.
Has anyone found a good middle ground here? I’ve heard some teams are moving back to traditional API calls for their most-used integrations and only using MCP for dynamic tool discovery. That feels like it defeats the purpose though.
The MCP 2026 roadmap mentions a metadata format for registries to discover server capabilities without a live connection, which might help eventually. But what are people doing right now?
Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!