MCP tool descriptions eating 40-50% of your context window - how are you dealing with this?

priya_ai · March 19, 2026, 9:51am

I’ve been building an AI agent that connects to about 15 MCP servers (database, file system, Slack, GitHub, etc.) and I’m running into a problem that Perplexity’s CTO actually called out at Ask 2026: the tool descriptions alone are consuming a massive chunk of the context window.

In my case, with 15 servers averaging maybe 4-5 tools each, the tool schemas and descriptions add up to around 12k tokens before the user even says anything. On a 128k context model that’s manageable, but it still means less room for actual conversation history and retrieved documents.

I’ve tried a few things:

Lazy loading - only registering tools from servers that are relevant to the current conversation. Works okay but requires a routing layer that adds latency.
Trimming descriptions - stripping tool descriptions down to the bare minimum. But then the model makes more mistakes picking the right tool.
Two-stage approach - first call picks which MCP servers are relevant, second call loads only those tools. Doubles the API calls though.

Has anyone found a good middle ground here? I’ve heard some teams are moving back to traditional API calls for their most-used integrations and only using MCP for dynamic tool discovery. That feels like it defeats the purpose though.

The MCP 2026 roadmap mentions a metadata format for registries to discover server capabilities without a live connection, which might help eventually. But what are people doing right now?

Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

alex_ml · March 19, 2026, 10:46am

I’ve been hitting this exact problem building ML pipeline agents that need access to a bunch of data tools. Here’s what’s actually worked for me in production:

Hierarchical tool registries with embeddings-based routing

Instead of your two-stage LLM call approach, I built a lightweight embedding index over all my tool descriptions. When a user message comes in, I do a quick cosine similarity search against the tool descriptions and only inject the top-k most relevant tools into the context. The embedding lookup is like 5ms, way cheaper than an extra LLM call.

Here’s the rough pattern:

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Build index once at startup
tool_descriptions = [t['description'] for t in all_tools]
tool_embeddings = model.encode(tool_descriptions)

def select_tools(user_message, top_k=10):
    query_emb = model.encode([user_message])
    scores = np.dot(tool_embeddings, query_emb.T).flatten()
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [all_tools[i] for i in top_indices]

This cut my tool token usage from ~12k to about 3-4k per request, and the model still picks the right tool 95%+ of the time.

Compressed schema format

The other thing that helped was switching to a minimal schema format. Instead of full JSON Schema with descriptions, examples, and nested types, I compress them down to just the essentials:

{
  "name": "query_db",
  "desc": "Run SQL against analytics DB",
  "params": {"sql": "str", "db": "str?=analytics"}
}

That’s maybe 30% of the tokens compared to a full OpenAI-style function definition, and modern models handle the abbreviated format fine.

Tool groups for common workflows

I also pre-defined “tool groups” for common conversation patterns. If someone starts asking about data analysis, I load the data group (SQL, pandas, plotting). If they’re asking about model training, I load the ML group (experiment tracking, GPU provisioning, dataset tools). You can detect this with a simple classifier or even keyword matching.

The MCP registry metadata you mentioned will definitely help, but the embeddings approach is a solid bridge until that ships. The key insight is that you almost never need all your tools at once; you just need to be smart about which ones to surface.