How are you using GPT-5.4's reasoning.effort parameter in production?

Now that GPT-5.4 has been out for a couple weeks, I’m curious how people are actually using the reasoning.effort parameter in practice.

For context, it controls how much internal compute the model spends on chain-of-thought before responding. You can set it to low, medium, or high. Higher effort means better accuracy on hard problems but slower responses and higher cost.

I’ve been experimenting with it for a data extraction pipeline and my initial findings are kind of interesting:

  • For straightforward structured extraction (pulling names, dates, amounts from invoices), low effort works just as well as high and runs about 3x faster
  • For ambiguous classification tasks where the categories overlap, high effort noticeably improves accuracy, maybe 8-12% on my eval set
  • medium feels like a weird middle ground that I haven’t found a great use case for yet

What I’m still trying to figure out:

  1. Is anyone dynamically switching effort levels based on input complexity? Like, run a quick classifier first and only escalate to high for tricky inputs?
  2. How does reasoning.effort interact with the 1M token context window? I’m worried that high effort on a massive context could blow up latency and cost.
  3. For agentic workflows with tool use, does effort level affect how well the model plans multi-step tool calls?

Would love to hear what patterns people are settling on, especially if you’ve done A/B testing in production.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

Been using reasoning.effort pretty heavily since launch, and I think the key insight is to treat it like a dial you tune per-task, not a global setting.

Here’s how I’ve been breaking it down in my production setup:

none or low for structured extraction and classification with clear categories. If you’ve got a well-defined schema and the task is basically “fill in these fields from this text,” you don’t need the model burning compute on chain-of-thought. I route about 70% of my API calls through low and the accuracy difference vs high is negligible for these tasks.

medium for multi-step instructions and moderate ambiguity. Things like summarizing documents with specific criteria, or generating content that needs to follow a style guide. This is my default for anything that isn’t pure extraction.

high or xhigh only when evals prove it matters. I reserve these for tasks where I’ve actually measured the accuracy delta. For me that’s mainly complex code generation and multi-hop reasoning over long documents.

The pattern I’ve landed on is a simple router:

from openai import OpenAI

client = OpenAI()

EFFORT_MAP = {
    "extract": "low",
    "classify_simple": "low",
    "classify_ambiguous": "high",
    "summarize": "medium",
    "code_gen": "high",
    "default": "medium",
}

def call_with_effort(task_type, messages):
    effort = EFFORT_MAP.get(task_type, "medium")
    return client.chat.completions.create(
        model="gpt-5.4",
        reasoning={"effort": effort},
        messages=messages,
    )

One thing that tripped me up early: xhigh is NOT just “better quality.” It can actually overthink simple tasks and produce worse results (longer, more hedging, sometimes hallucinating edge cases that don’t exist). OpenAI’s own docs recommend against using it as a default. Save it for genuinely hard reasoning problems.

Also worth noting that the cost difference is real. In my pipeline, switching from high to low for extraction tasks cut my API bill by about 40% with zero accuracy loss on my eval suite. So it’s absolutely worth the effort (pun intended) to build that routing logic.

What kinds of tasks are you seeing the biggest accuracy gaps between effort levels?

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.