Using LLM effort controls to cut API costs without tanking quality - what routing strategies work?

Now that both Anthropic and OpenAI offer effort/intelligence controls on their APIs (Claude’s effort levels, GPT-5.4’s reasoning effort parameter), I’m trying to build a smart routing layer that picks the right effort level per request instead of always running at max.

The idea is simple: not every API call needs the model’s full reasoning power. A quick entity extraction or format conversion doesn’t need the same compute as a complex multi-step analysis. If I can route 60-70% of requests to low/medium effort and save high/max for the hard stuff, the cost savings should be significant.

Here’s where I’m stuck:

Classification approach: I’ve been prototyping a lightweight classifier that looks at the incoming prompt and predicts the “difficulty” to pick an effort level. But what features do you use? Prompt length is a terrible proxy. I’ve tried things like: number of constraints in the prompt, presence of multi-step instructions, whether it references external context, and whether the expected output is structured vs freeform. Results are okay but not great.

Fallback patterns: What do you do when the low-effort response is clearly wrong or incomplete? I’m thinking of a verification step where a cheap model checks the output quality and escalates to a higher effort retry if needed. But that adds latency and could end up costing more if the escalation rate is too high. Anyone found a good threshold for when to retry vs just accept?

Latency considerations: Low effort is obviously faster, but in an agentic pipeline where one LLM call feeds into the next, a bad low-effort response early on can cascade into failures downstream. Has anyone built effort-level selection that’s context-aware within a chain, like “this is step 3 of 5 and the previous steps went well, so we can use low effort here”?

Metrics: What are you tracking to validate this actually works? I’m logging effort level, latency, cost, and a quality score from a separate eval, but I’m not sure how to set the quality bar for “good enough at low effort.”

Would love to hear what routing strategies people are actually using in production. Especially curious if anyone’s open-sourced their effort routing logic.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

I’ve been working on almost exactly this problem for our internal API gateway. Here’s what we landed on after a few iterations.

For classification, skip the custom classifier and use a heuristic tree first. We found that a simple rule-based system gets you 80% of the way there without any ML overhead. Our rules look roughly like this:

def select_effort(request):
    # Structured output with a schema = usually straightforward
    if request.response_format and request.response_format.type == 'json_schema':
        if count_constraints(request.messages) < 3:
            return 'low'
    
    # Multi-turn with tool use = almost always needs high effort
    if request.tools and len(request.messages) > 4:
        return 'high'
    
    # Long system prompts with many examples = the prompt is doing the heavy lifting
    if len(request.messages[0].content) > 2000 and '```' in request.messages[0].content:
        return 'medium'
    
    # Default to medium, not high
    return 'medium'

We tried training a classifier on top of this but the marginal improvement wasn’t worth the complexity. The trick is that your defaults matter more than your edge case detection.

For fallbacks, we use a lightweight output validator, not a second LLM call. Running another model to check quality defeats the purpose. Instead we validate structurally:

  • Did the JSON parse correctly?
  • Is the response length within expected bounds for this task type?
  • Do required fields exist and have plausible values?

If validation fails, we retry at one effort level higher. Our escalation rate is about 8%, which means the cost savings from the other 92% of low/medium requests more than cover the retry overhead.

On the cascading failure problem in chains: we actually take the opposite approach from what you’d expect. We use higher effort for early steps in a pipeline and lower effort for later ones. The reasoning is that if steps 1-2 produce clean context, steps 3-5 have an easier job. We track a “chain confidence” score that decays if any step produces a marginal result, and bump effort back up if it drops below a threshold.

Metrics we track: cost per successful request (not just per request), p50/p95 latency by effort level, escalation rate, and a weekly sample of 100 requests reviewed by the team for quality. The quality review is the most important one because automated metrics tend to miss subtle degradation.

Overall we’re seeing about a 40% cost reduction with no measurable quality drop on our evals. Happy to share our routing config if you want to see the full rule set.

We built something like this for our API gateway that sits in front of multiple LLM providers. A few things I’ve picked up from the backend side.

Don’t overthink the classifier. We started with a fancy ML classifier and ended up replacing it with a simple heuristic-based router that works better in practice. The key insight is that you usually know the difficulty from the API endpoint and request shape, not from analyzing the prompt content itself.

// Our routing config - maps API routes to default effort levels
const routeConfig: Record<string, EffortConfig> = {
  '/api/extract-entities': { defaultEffort: 'low', maxTokens: 500 },
  '/api/classify-intent': { defaultEffort: 'low', maxTokens: 100 },
  '/api/summarize': { defaultEffort: 'medium', maxTokens: 2000 },
  '/api/generate-report': { defaultEffort: 'high', maxTokens: 8000 },
  '/api/code-review': { defaultEffort: 'high', maxTokens: 4000 },
  '/api/chat': { defaultEffort: 'medium', escalateOn: 'tool_use' },
};

function getEffortLevel(route: string, payload: RequestPayload): string {
  const config = routeConfig[route];
  if (!config) return 'medium'; // safe default

  // Escalate if conversation has tool calls or errors
  if (config.escalateOn === 'tool_use' && payload.messages?.some(m => m.tool_calls)) {
    return 'high';
  }

  // Escalate if prompt is unusually long (probably complex)
  if (payload.promptTokenEstimate > 3000) {
    return escalate(config.defaultEffort);
  }

  return config.defaultEffort;
}

The “escalation” pattern is more useful than static classification. Start every request at low effort. If the response quality score (we use a fast LLM-as-judge check) is below a threshold, retry at medium. If still bad, retry at high. Yes, you’re paying for failed attempts, but in practice 70%+ of requests succeed at low effort and never need escalation. Net savings are still huge.

Track your routing decisions. We log every request with the effort level chosen, the route, token counts, latency, and a quality score. After a week of data, you can run simple analytics to tune your thresholds. We found that our initial “medium” default for the chat endpoint was overkill; 80% of chat messages were simple follow-ups that worked fine at low effort.

Watch out for the latency difference. Low effort isn’t just cheaper, it’s faster. For us, the p50 latency difference between low and high effort on Claude is about 3x. That matters a lot for user-facing features. We actually use effort level as a latency budget control, not just a cost control. If the user’s been waiting more than 2 seconds for a streaming response to start, something’s wrong.

On the cost side, we’re seeing about 45% reduction in our monthly API spend since implementing this. Most of the savings come from entity extraction and classification endpoints that were running at full effort for no reason.