April 6, 2026 | Sovereign AI

The “Fast” Model Wasn’t Fast: A Real Pipeline, 30x Faster on a Bigger Model

Key Takeaways

  • A 9B “fast” model running at concurrency 4 hit 130 records/hour on the production pipeline. A 120B MoE reasoning model at concurrency 256 hit ~3,900 records/hour on the same hardware.
  • “Fast” model selection logic falls apart on inference hardware that scales with concurrency. The big model is faster wall-clock when you let it batch.
  • MoE architecture matters: gpt-oss-120B activates ~5B parameters per token despite being 120B total, so per-token bandwidth cost is low.
  • The bigger reasoning model also produced cleaner extractions: less duplication, more strategic skill inference, sharper requirement decomposition.
  • “This model is too slow” almost always means “this model is too slow single-stream”, which is rarely the actual constraint.

A production extraction pipeline was processing UK job descriptions at 130 records per hour. The model running it was a 9B "fast" model at concurrency 4. The 9B was the second choice: the pipeline started life on a 70B-class model, swapped down for "speed" when the 70B's per-request latency made the backlog feel infeasible. The 9B was supposed to be the pragmatic answer. Two months in, the throughput numbers said the pragmatic answer was wrong.

We rewrote the runner against a 120-billion-parameter mixture-of-experts reasoning model at concurrency 256 on the same DGX Spark. Throughput went to 3,900 records per hour. 30x faster, on a model with 13x more total parameters. The extractions were also better.

This is a story about what "fast" actually means in modern inference, and why the standard model-selection intuition reverses on capacity-and-concurrency hardware.

The pipeline that needed rewriting

The workload: 123,849 UK job descriptions, each needing structured JSON extraction. Title, skills (explicit and inferred), requirements, seniority, culture signals. Each extraction is a 1500-token prompt in, 400-1000 tokens of structured output. This powers Leap, our career-mobility platform.

The original runner used a Llama-3.1-70B model in single-stream chat mode. The per-request latency was 7-9 seconds. With 124,000 records to process, the projected wall time was multiple weeks. Felt impossible. The team swapped the 70B out for the smallest-acceptable-quality alternative, a 9B Nemotron Nano hybrid Mamba-2 model running at concurrency 4. The intuition was: small model, faster per-request, more concurrency, faster overall.

That was wrong, in a way that took two months and 77,000 processed records to expose.

Why the 9B plateaued

At concurrency 4, the 9B was producing 130 records per hour. Per-stream throughput was respectable. Aggregate throughput plateaued because the model architecture cannot batch the way pure-attention transformer models can: Mamba state-space layers are sequential by design, and the scheduler cannot fan out their steps across many concurrent requests. Beyond a low concurrency, requests just queue. The aggregate number does not move.

This is not a Mamba bug. It is a Mamba design choice. The 9B is a fast single-user utility. It was never meant to be a batched-extraction engine. It got pressed into that role because somebody wanted a "fast" model.

The lesson: architecture-to-workload fit matters more than parameter count.

Why the 120B won

We swapped to gpt-oss-120B MXFP4 at concurrency 256. Two things were working in the bigger model's favour:

  1. Mixture-of-experts. The model has 128 experts, 4 active per token. So each output token only triggers reads for ~5 billion parameters, not 120 billion. The effective bytes-per-token at decode is around 2.5 GB, compared to ~25 GB for a dense 49B at FP8. The same memory bandwidth budget gets 10x more tokens because the MoE architecture matches the hardware's strength.

  2. Native 4-bit precision. The Spark's Blackwell GPU has tensor cores that execute MXFP4 natively. Running gpt-oss in its native quantisation format means no emulation, full hardware-rate inference.

At concurrency 256, the pipeline rate was about 65 completions per minute, sustained. That is roughly 3,900 records per hour. The 45,000-record tail that would have taken another two weeks on the 9B got chewed through in well under a day.

A subtle point worth pulling out: the apples-to-apples comparison is "120B at c=256 vs 9B at c=4", which is the 30x figure. If you compared "120B at c=256 vs 9B at c=1" the gap would be ~120x. The "before" number was already concurrent. The headline understates the actual single-stream-to-concurrency-first jump on this hardware.

And the bigger model was better

The intuition says a 9B model with all 9B parameters active should produce richer outputs than a 120B MoE that activates only 5B per token. Active parameters matter most for output quality, the conventional wisdom goes, so the 9B should compete on quality even if it loses on throughput.

The conventional wisdom is wrong in this specific shape. We had two records that ran through both runners. The side-by-side:

Aspect Nemotron 9B (~9B active, dense) gpt-oss 120B (~5B active, MoE)
Explicit skills list OK, some duplication and noise Cleaner, consolidated
Inferred skills Literal ("encoding techniques") Strategic ("accessibility awareness", "CRM forecasting", "cross-functional collaboration")
Requirement decomposition Compound strings Atomised into individual items
Seniority labelling Generic ("senior") Specific ("Director")

On the fields the pipeline actually cares about (inferred capability, transferable skills, requirement atomicity) the 120B MoE was materially better despite having fewer active parameters per token. Expert specialisation plus reasoning training beat raw active-parameter count.

The general lesson

When a team says "this model is too slow", the next question to ask is: too slow at what concurrency?

Almost always the answer is "too slow single-stream". Almost always single-stream is not the actual constraint. The fix is rarely a smaller model. The fix is usually concurrency tuning, prefix caching, and (when it fits) a bigger MoE architecture that the hardware was designed to run.

We made the wrong trade for two months. The 70B-to-9B swap looked sensible on paper. In practice it dropped the pipeline into an architecture that could not scale and a model class that did not have the reasoning depth the workload needed. The right move was always to stay on the bigger model and fan out, not to shrink.

That is the concurrency-first thesis made concrete. The "fast" model was not fast. The "big" model was not slow. At the right concurrency, with the right flags, a smarter model wins on both axes simultaneously.

Take the Next Step

If your team has been swapping down to smaller models in the name of speed and the throughput numbers feel disappointing, the gap is almost certainly in the architecture-to-concurrency match. We help teams design the inference stack that fits their actual workload, not the one that fits the worst-case single-stream review. Get in touch if you want to put real numbers next to the question.