Now that both Anthropic and OpenAI offer effort/intelligence controls on their APIs (Claude’s effort levels, GPT-5.4’s reasoning effort parameter), I’m trying to build a smart routing layer that picks the right effort level per request instead of always running at max.
The idea is simple: not every API call needs the model’s full reasoning power. A quick entity extraction or format conversion doesn’t need the same compute as a complex multi-step analysis. If I can route 60-70% of requests to low/medium effort and save high/max for the hard stuff, the cost savings should be significant.
Here’s where I’m stuck:
Classification approach: I’ve been prototyping a lightweight classifier that looks at the incoming prompt and predicts the “difficulty” to pick an effort level. But what features do you use? Prompt length is a terrible proxy. I’ve tried things like: number of constraints in the prompt, presence of multi-step instructions, whether it references external context, and whether the expected output is structured vs freeform. Results are okay but not great.
Fallback patterns: What do you do when the low-effort response is clearly wrong or incomplete? I’m thinking of a verification step where a cheap model checks the output quality and escalates to a higher effort retry if needed. But that adds latency and could end up costing more if the escalation rate is too high. Anyone found a good threshold for when to retry vs just accept?
Latency considerations: Low effort is obviously faster, but in an agentic pipeline where one LLM call feeds into the next, a bad low-effort response early on can cascade into failures downstream. Has anyone built effort-level selection that’s context-aware within a chain, like “this is step 3 of 5 and the previous steps went well, so we can use low effort here”?
Metrics: What are you tracking to validate this actually works? I’m logging effort level, latency, cost, and a quality score from a separate eval, but I’m not sure how to set the quality bar for “good enough at low effort.”
Would love to hear what routing strategies people are actually using in production. Especially curious if anyone’s open-sourced their effort routing logic.
Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!