The One Config Flag That 5x’d Our Throughput

Key Takeaways

Reasoning models (gpt-oss, OpenAI o1, and similar) emit chain-of-thought tokens that share the `max_tokens` budget with the user-visible output.
At default settings, complex prompts can burn the entire budget on hidden reasoning, leaving the JSON response truncated and silently invalid.
The `reasoning_effort: “low”` parameter caps reasoning to a handful of tokens, freeing the budget for the response your downstream parser actually needs.
On a real production extraction pipeline, the change took aggregate throughput from 12 to 65 records per minute and eliminated half-formed outputs.
Use `low` for structured extraction, classification, and any task where the answer shape matters more than the model’s deliberation.

A pipeline that was running at 12 completions per minute, half of them silently parse-failing, jumped to 65 per minute with clean parsing after a single change to one request parameter. The flag is reasoning_effort, the model is OpenAI's gpt-oss family, and the reason most teams have not heard of it is that it is buried in the API docs behind the assumption you already know what reasoning models do to your token budget.

What reasoning models do to your token budget

A reasoning model like gpt-oss-120B does not just produce a response. It produces a chain-of-thought trace before the response, which the model uses to plan and verify. That trace lives in message.reasoning, a separate field from the message.content your application reads.

Both fields share the max_tokens budget. If max_tokens is 1024 and the model spends 900 tokens deliberating, your content field gets the leftover 124 tokens. For free-form text that is sometimes acceptable. For a prompt that needs a JSON object back, 124 tokens is half a value and a missing closing brace.

The behaviour is silent. The HTTP response is 200. The token usage looks normal. The downstream JSON parser fails. If your pipeline does not log parse failures separately from API failures, you discover this only by noticing the throughput is wrong.

How we found it

We were rebuilding a structured extraction pipeline that processes UK job descriptions for Leap. 1500 prompt tokens in, 400-1000 tokens of structured JSON out, running at concurrency 256 on a single DGX Spark. On paper it should have been fast. In practice the runner was producing 12 completions per minute and roughly half the outputs were failing JSON parse on the client side.

vLLM logs were clean. HTTP responses were clean. Token usage looked normal. It took a while to spot the pattern: every parse failure was a complex job description, and every complex one was hitting the max_tokens ceiling on the reasoning trace before the model wrote any structured output at all.

reasoning_effort: "low" capped the chain-of-thought to a few tokens regardless of prompt complexity. Pipeline rate jumped to 65 completions per minute. JSON parse failures dropped to zero. Quality of extractions stayed the same or improved (less reasoning to confuse the model, more budget for clean output).

When to use it, when not to

reasoning_effort: "low" is the right setting when:

Output shape matters more than model deliberation. Structured extraction, JSON-strict prompts, classification, tagging, summarisation to a fixed schema.
Concurrency is high and per-request latency budget is tight.
The prompt itself carries enough constraint that the model does not need to reason its way to the answer shape.

Keep the default (or set to medium or high) when:

The task is genuinely reasoning-heavy. Multi-step maths, chain-of-evidence answers, code refactoring proposals.
You have plenty of headroom in max_tokens and want the model's full deliberation.
You are evaluating a model's reasoning capability, not running it as a workhorse.

The mistake is leaving the default in place for tasks that do not need reasoning. The model will always use the headroom you give it. If the headroom is reasoning tokens that never reach your application, the model is burning your budget on work you cannot see.

What this signal means more generally

The reasoning_effort gotcha is a specific instance of a broader pattern. Reasoning models change the contract between max_tokens and what you actually receive. If a team upgrades from a non-reasoning model to a reasoning one without auditing token-budget assumptions, half the pipeline can silently regress and the team will spend a week debugging a hardware or networking ghost that does not exist.

The fix is mechanical. Audit every API call. For each one, ask: is this task reasoning-shaped, or output-shaped? If output-shaped, set reasoning_effort: "low". If reasoning-shaped, raise max_tokens to fit a full chain-of-thought plus the response.

That single audit, done once per pipeline, catches a class of silent failures that monitoring will not flag.

Take the Next Step

If your team has switched to a reasoning model recently and the throughput numbers feel wrong, this is the first place to look. We help teams audit their inference stack for silent-regression patterns and set the right flags for the workload. Get in touch if you want a fresh pair of eyes on yours.