The Single-Stream Benchmark Trap: How to Actually Evaluate AI Inference Hardware

Key Takeaways

Memory bandwidth on inference hardware is a budget that gets *spent* per weight-read pass, not per stream. Adding streams piggybacks on the same pass.
A DGX Spark single-streaming a dense 70B-class model produces around 5-6 tokens per second. The same hardware at concurrency 256 hits 695 tokens per second aggregate.
Reviews that report the single-stream number as “the hardware’s capability” are reporting a choice, not a ceiling.
Real workloads (extraction pipelines, agent fan-out, multi-user chat services) run inherently concurrent. Their real-world throughput is the only measurement that matters.
When buying inference hardware, benchmark at the concurrency you will actually use, not the concurrency the review used.

Search any local AI hardware review. Almost without exception, the reviewer fires up the box, asks one question of one model on one stream, measures the tokens-per-second, and writes the verdict. That number then becomes the headline. Other reviewers pick it up. The buyer reads it. The buyer concludes the hardware is overpriced. The buyer is wrong, and the buyer's wrongness has nothing to do with the hardware.

The actual measured throughput of the same box on the same model, fed 256 concurrent streams instead of one, is roughly 120x higher. The single-stream review is measuring a worst case that no production workload runs in.

What single-stream is actually measuring

Run a single inference request on a modern GPU and what you are measuring is one path through the model's weights. The GPU has to read every weight required to produce one output token, write the token, and repeat. The memory bandwidth between the weights and the compute units is the bottleneck. For a 70B-class FP8 model that is roughly 25 GB of weight reads per output token. On a Spark with 273 GB/s of bandwidth, that is your ceiling: about 11 tokens per second per stream, before any other overhead.

This is not a hardware limitation in the sense most reviewers imply. It is a physics consequence of running the model in the dumbest possible mode.

The thing reviewers miss is what happens when you stop doing that.

What concurrency does

Run two inference streams in parallel against the same model. The GPU reads the weights once. Both streams use that read. The bandwidth budget has not changed, but you have produced output for two streams instead of one. Aggregate throughput nearly doubles. Per-stream throughput stays close to the single-stream number until you saturate something else.

Run sixteen streams. Same weight read, sixteen outputs per pass. Aggregate throughput is now 14x or more single-stream.

Run 256 streams. The KV cache becomes the new ceiling. Aggregate throughput peaks at ~120x single-stream on a Spark running gpt-oss-120B MXFP4. Above that, the scheduler thrashes and throughput collapses. You have found the actual capacity of the hardware.

This is not a Spark-specific story. It is true of any modern inference accelerator. The shape of the throughput curve shifts depending on the model and the quantisation, but single-stream is always the worst case and always the wrong number to put on a slide.

Three workloads that prove the point

Where concurrency-first thinking matters in practice:

Batched extraction pipelines. You have 100,000 documents to run through a structured extraction prompt. Single-stream the run, and you wait for weeks. Run at concurrency 128, and you finish in a day on the same hardware. We rebuilt our own UK job-description extraction this way: the old runner was concurrency 4 on a fast small model and took 24 days for 77k records. The new runner at concurrency 256 on a much bigger reasoning model finishes the same volume in under a day, and the extractions are better.

Agent workflows with parallel tool calls. A research agent fans out to 12 sources, fetches and reads each, then synthesises. Each sub-task runs as fast as it would single-stream because the others are using the same weight-read passes. The agent's wall time is dominated by the slowest sub-task, not by 12x the single-stream latency.

Multi-user chat services. 20 users typing into a chat product simultaneously. Each gets their own stream. Each gets the same per-stream throughput as if they were the only user. The 21st user does not slow down the first 20 (until the box is genuinely full).

None of those workloads are visible in a single-stream benchmark. None of them are exotic. Most teams running an AI feature have at least one of them in production already.

Why teams keep getting this wrong

Three reasons, all easy to spot once you know to look:

Reviews drive purchase decisions, and reviews use the laziest benchmark. Reviewers measure what they can quickly run from a chat interface. That is single-stream. The result is a decade of consumer-shaped reviews about hardware designed for capacity loads.
Vendor benchmarks rarely correct the narrative. Vendors quote peak-aggregate numbers in white papers most buyers never read, while the consumer-facing pages quote the single-stream number anyway because that is what the buyer asked for. Nobody is incentivised to fix the framing.
Buyers are not measuring their own workload. A buyer evaluating inference hardware should run their actual workload against the box. Not a chat completion. Not a "hello world". The exact dispatch pattern they will run in production. Most do not, and the result is a purchase decision driven by someone else's wrong benchmark.

The fix is mechanical. Pick the workload you will actually run. Hit the box with that workload at the concurrency you will actually use. Measure aggregate throughput, not single-stream. Adjust the buy decision based on those numbers.

Take the Next Step

If your team is evaluating local AI hardware (or wondering whether the cloud spend is genuinely necessary) the single-stream review is the wrong evidence to base a decision on. We help teams design representative benchmarks against their real dispatch patterns and read the results clearly. Get in touch if you want a sanity check before signing the purchase order.