NVIDIA DGX Spark Concurrency Benchmark: 120x Throughput the Single-Stream Reviews Miss

Written by Mike McGreal, Dendro Logic | Engineering | Published: April 22, 2026

The online narrative

Search “DGX Spark review” and you’ll find a consistent refrain from people who bought one, ran a single chat completion, and wrote it off:

“£4,000 for 2.7 tokens per second on Llama 70B? Are they taking the piss?”

“It’s slower than my RTX 4090.”

“My laptop does Qwen 3 at 120 tok/s. Why would I pay this much for a desk appliance that decodes slower?”

The numbers they’re quoting aren’t wrong. Fire up an nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:latest container with stock flags and ask it one question, and yes – you get ~5-6 tokens/sec. The LMSYS DGX Spark In-Depth Review measured Llama 3.1 70B FP8 at 2.7 tok/s decode. That’s the reality of single-stream dense-decode on a box with 273 GB/s of memory bandwidth.

But here’s what nobody running those single-stream benchmarks tells you:

The same £4,000 box, same model, same session, delivers 695 tokens per second of aggregate throughput when you feed it 256 concurrent streams. At that concurrency, the single-stream critique is off by a factor of 120×.

I spent an afternoon finding out what a DGX Spark actually does when you stop benchmarking it like a consumer GPU, and the answer reshaped how I think about this hardware category entirely.

What I’m benchmarking with

A single Nvidia DGX Spark – the 2TB / 128GB RAM variant with the GB10 Grace-Blackwell Superchip. 128 GB of unified LPDDR5x memory pooled between the 20-core ARM Neoverse-V2 CPU and the Blackwell GPU, with 273 GB/s of aggregate bandwidth.

Two models served via nvcr.io/nvidia/vllm:26.03-py3 – Nvidia’s own vLLM base image for Blackwell:

Nemotron Super 49B v1.5 NVFP4 – a Llama-70B-derived reasoning model, Neural-Architecture-Search-compressed to ~26B effective params, 4-bit NVFP4 quantised for Blackwell’s native 4-bit tensor cores. Served via Nvidia’s NIM, which uses vLLM internally.
OpenAI gpt-oss-120B MXFP4 – OpenAI’s open-weight 120B model. Mixture-of-experts: 128 experts, 4 active per token, so only ~5B active parameters fire per token despite 120B total weights. MXFP4 quantised (similar to NVFP4 but OpenAI’s native format).

The workload is real: a character-extraction prompt from a larger text-processing pipeline I’m building, ~1500 prompt tokens in, 400 output tokens max, structured JSON output. Not a synthetic prompt-length benchmark – actual production shape.

The fundamental fact everyone misses

Memory bandwidth is a budget you spend. When you run one inference stream, you spend it on reading weights for one sequence. When you run two streams simultaneously, you spend the same bandwidth budget reading the same weights, and both streams get the result.

This is why batched throughput scales near-linearly with concurrency until you hit a different bottleneck – KV cache, compute, scheduling overhead. Every paper on LLM inference says this. It’s table stakes for anyone running production inference workloads.

Yet review after review of the Spark treats it like a single-user desktop chatbot, measures its worst-case profile, and declares it underwhelming.

Let’s look at what actually happens when you stop doing that.

The 49B concurrency sweep

Here’s what Nemotron Super 49B v1.5 NVFP4 does on a single DGX Spark, across the full concurrency range, with the standard vLLM flags:

Concurrency	Aggregate TPS	Per-seq TPS	Wall time (s)	Scaling factor
1	5.79	5.79	69.1	1.00×
2	11.43	5.72	70.0	1.97×
4	22.79	5.70	70.2	3.94×
8	45.41	5.68	70.5	7.84×
16	86.47	5.41	74.0	14.9×
32	161.90	5.08	79.1	28.0×
48	227.71	4.77	84.3	39.3×
64	289.00	4.56	88.6	49.9×
96	390.41	4.11	98.4	67.4×
128	490.68	3.92	104.4	84.7×
160	547.89	3.54	116.8	94.6×
192	607.82	3.27	126.4	105.0×
224	651.61	3.07	137.5	112.5×
256	695.11	2.85	147.3	120.1× ← peak
320	514.69 ↓	2.37	248.7	88.9× (regressed)

Look at that first column. The critique treats 5.79 tok/s as “what the hardware is capable of.” But scan down: at 8 concurrent streams the same hardware delivers 45 tok/s aggregate. At 32 streams, 162 tok/s. At 128 streams, 490 tok/s.

And the per-sequence column tells a separate story – it drops from 5.79 to 2.85 across the whole range. That’s what the single-stream reviewers are reporting, dressed up as a hardware ceiling, when it’s actually a choice about how to use the machine.

Put it in pictures: stream count doesn’t “share” bandwidth. Each additional stream piggybacks on the same weight-read pass. The wall time grows from 69s at c=1 to 147s at c=256 – you’ve processed 256× the work in ~2× the wall clock. That’s what near-linear scaling looks like.

The regression at c=320 is the interesting bit. That’s where the scheduler hit KV-cache saturation on the 49B – every active sequence needs memory to hold its attention state, and there’s a ceiling. Past c=256 the scheduler starts thrashing and throughput collapses. We found the wall. It’s at 120× single-stream for this model at this prompt shape.

The gpt-oss 120B sweep (the plot twist)

Then I ran the same benchmark on openai/gpt-oss-120B via vLLM with MXFP4 quantisation – a model I shouldn’t even be able to fit on a consumer GPU, because the weights alone are 66 GB.

Concurrency	Aggregate TPS	Per-seq TPS	Wall time (s)
1	33.53	33.58	11.9
2	59.52	29.79	13.4
4	87.56	21.92	18.3
8	123.39	15.44	25.9
16	173.19	10.89	36.8
32	252.30	7.95	50.6
64	373.27	5.90	68.6
96	481.35	5.11	79.8
128	564.71	4.52	90.6
192	705.41	3.85	108.8
256	862.84	3.62	118.7
384	771.17 ↓	3.15	199.1

Three numbers jump out:

Single-stream is 33.53 tok/s. That’s a six-times improvement over the 49B on the same hardware. We haven’t added RAM, we haven’t overclocked anything, we’re just running a different model architecture.
Peak aggregate is 863 tok/s. Higher than the 49B’s 695, and the stop-point is compute-bound, not memory-bound.
At peak, GPU KV cache usage was 9%. Not 90. Nine. The model has so much KV headroom that in theory it could go further – it just hit a different wall (the mixture-of-experts routing overhead at very high concurrency).

The critics who quote 2.7 tok/s are describing a narrow, worst-case slice of what this hardware does. The Spark at 33 tok/s single-stream is already comparable to a mid-grade cloud API (Claude Sonnet sits at ~60-80 tok/s, Gemini Flash at ~100+). And at batched throughput, it’s serving close to an order of magnitude more tokens than any single-user RTX card ever will.

Why gpt-oss is dramatically faster than the 49B

Two things working together.

MoE architecture matches the memory-bandwidth profile

Dense models force you to read all the weights per token. For a 49B FP8 model that’s ~25 GB of weight reads per output token – 91% of the 273 GB/s bandwidth budget is gone for just one sequence. Scaling concurrency helps because those reads are shared, but the ceiling is hard.

Mixture-of-experts models activate only a small fraction of their parameters per token. gpt-oss has 128 experts, 4 active per token – so each token triggers reads for ~5B of params, not 120B. Effective bytes-per-token at decode drops from 25 GB to ~2.5 GB. The same bandwidth budget now gets you 10× the tokens on the same hardware.

This is exactly why the DGX Spark exists. The machine has 128 GB of unified memory specifically so you can load the whole MoE model, then pay the low per-token cost of having only 4 experts fire at a time. A 24 GB consumer GPU literally cannot hold gpt-oss 120B at any quantisation, regardless of its nominal bandwidth.

The online critics comparing a 5090’s raw bandwidth to the Spark’s 273 GB/s are missing the architectural point. You can’t run this model on a 5090 at all. The comparison isn’t “same model, faster GPU” – it’s “this model vs. a model you could fit instead.”

MXFP4 is what Blackwell was built for

The sm_102 Blackwell GPU in the Spark has tensor cores that execute 4-bit floating-point operations natively. Not via emulation – natively. When you quantise a model to MXFP4 or NVFP4, you halve the bytes-per-param (vs FP8) and the chip’s tensor cores process the lower-precision values at their native rate.

gpt-oss-120B was released in MXFP4 format from OpenAI directly. Running it on Blackwell isn’t a workaround – it’s using the format the model ships in, on the hardware it was designed to run on.

What this means in practice

The critique is true in a narrow sense: if your workload is “one person typing into a chat interface,” the DGX Spark’s single-stream decode will feel slower than your gaming GPU. But that workload makes no sense on this hardware. If that’s what you actually need, buy a 5090 and run a 14B model on it.

The workloads the Spark was designed for:

Batched extraction/transformation pipelines. Running 500 text documents through a structured extraction? Run 32 concurrently, finish in 1/30th the time.
Agent workflows with parallel tool calls. An agentic system that fans out to 16 sub-tasks concurrently? Each task runs as fast as single-stream. You pay no concurrency tax.
Multi-user chat services. 20 simultaneous users? Each still gets their 4-5 tok/s on the 49B, or 15 tok/s on gpt-oss – same as if they were the only user. The 21st user doesn’t slow down the first 20.
Code review / refactoring tools. Scan a codebase, issue parallel review requests against many files at once. The throughput ceiling is irrelevant. It’s what the aggregate lets you ship.

You can’t benchmark those workloads with a single-stream chat completion. The “DGX Spark is overpriced” narrative is what happens when you benchmark a capacity machine with throughput metrics designed for latency machines.

The practical production config

What I’m actually running on my Spark now, after finding this out:

docker run -d --name gptoss-120b \
 --gpus all --shm-size=16g \
 --restart unless-stopped \
 -v /path/to/hf-cache:/hf-cache \
 -p 8000:8000 \
 -e HF_HUB_CACHE=/hf-cache \
 -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
 -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
 nvcr.io/nvidia/vllm:26.03-py3 \
 vllm serve openai/gpt-oss-120b \
 --host 0.0.0.0 --port 8000 \
 --max-model-len 65536 \
 --gpu-memory-utilization 0.65 \
 --enable-prefix-caching \
 --trust-remote-code

The full production stack I run on my DGX Spark:

Port	Model	Role	Tier
8000	gpt-oss 120B MXFP4	main reasoning / pipeline	heavy / medium
8002	Nemotron Nano 9B v2	fast utility / simple tasks	light

At a total GPU memory footprint around 85 GB, with 30+ GB of headroom for OS, other containers, and the second brain’s background services. My extraction pipeline now runs in ~20 minutes where it used to take 3.5 hours.

A cautionary counter-example: the 9B

Concurrency-as-superpower isn’t automatic – it depends on the model and how its container is configured. Here’s what the same benchmark does on Nvidia’s DGX-Spark-specific Nemotron Nano 9B v2 NIM (nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2-dgx-spark:1.0.0-variant), which is a hybrid Mamba-2 + attention architecture at NVFP4:

Concurrency	Aggregate TPS	Per-seq TPS	Avg per-seq wall (s)
1	26.19	26.21	15
2	51.08	25.54	16
8	153.18	19.20	21
16	155.59	14.00	31
32	156.82	9.63	52
64	156.69	6.06	94
128	155.11	3.58	179
256	154.97	2.13	342
384	153.75 (24 timeouts)	1.71	471

Single-stream this thing is fast – 26 tok/s solo on a 9B. But notice what happens from c=8 onwards: the aggregate number plateaus and stops growing. Per-seq wall time grows linearly while aggregate TPS stays flat. That’s requests queueing, not batching.

Why? Two reasons, both structural:

Mamba-2 is sequential. Mamba state-space layers process tokens in a way that resists the batched-attention tricks that make transformer-only models scale on vLLM. The scheduler can’t fan out Mamba steps across the batch the same way it fans out pure attention.
The NIM is latency-tuned, not throughput-tuned. Nvidia’s official DGX-Spark-variant for this model has a conservative max_num_seqs – it’s intended as a fast responsive utility, not a batched-extraction engine.

The takeaway for buyers: the Spark’s concurrency advantage is real but architecture-dependent. Transformer MoE models (gpt-oss) scale beautifully. Pure-transformer dense models (the 49B) scale well. Hybrid state-space models (Nemotron Nano Mamba-2) scale only up to a low plateau, then just queue.

Pick models to match the workload. Use the 9B for interactive single-user utility (26 tok/s, very snappy). Use gpt-oss 120B for batched throughput (860 TPS aggregate). Don’t assume a single benchmark tells you the hardware’s full profile.

The real test: a production pipeline, 30× faster and smarter

Theory is fine. Here’s what happened when I moved actual production work from the “fast” 9B to gpt-oss 120B.

The workload: structured extraction of LinkedIn job postings for Leap, my career-mobility platform. 123,849 real UK job descriptions. Each needs the same JSON schema extracted – title, skills, inferred capabilities, requirements, seniority, culture signals. Exactly the agentic extraction shape I described earlier.

The old pipeline, running since 12 March 2026: Nemotron Nano 9B at concurrency 4 on polaris:8002. It did 77,648 records in 24 days of wall clock before stalling on 5 April. That works out to 130 records/hour – about 2 per minute. “Fast” in single-stream terms, respectable in practice. It was the second choice: the pipeline started life on Llama 3.1 70B, but per-request latency made the 123k-record backlog feel impossible, so I swapped to the 9B purely for speed.

The new pipeline, rewritten to target gpt-oss 120B at concurrency 256: sustained ~65 completions per minute = ~3,900 records per hour. 30× faster wall-clock. The 45,000-record tail that would have taken another 15 days on the old runner gets chewed through in under a day.

And that 30× figure is apples-to-apples against a concurrency-4 baseline – not single-stream. Every tokens-per-second “review” of the Spark you read online is measuring a c=1 setup. If the old pipeline had been single-stream like those reviews, rate would have been roughly a quarter of what I was actually seeing: 32 records/hour. Against a c=1 baseline the same new runner is **120× faster**, not 30×. I’ve been understating the improvement because my “before” was already concurrent. The true single-stream-to-concurrency-first gap on this hardware is enormous.

Two real gotchas I hit getting there, both worth knowing about:

Per-seq timeout trap. At c=256, per-seq decode is ~3-4 tps. A 1024-token response takes 4-5 minutes on the wire. My initial 3-minute HTTP timeout on the client killed every request before vLLM could respond. Node logged zero completions. VLLM logged 256 running requests and nothing visibly wrong. Bumped the client timeout to 15 minutes and completions started arriving.
Reasoning-mode token budget. gpt-oss is a reasoning model – chain-of-thought lives in message.reasoning, which shares the max_tokens budget with message.content. At max_tokens: 1024, complex job descriptions burned the budget on reasoning and truncated the JSON output, which then failed to parse client-side. Half the responses were silently dropping. Solution: reasoning_effort: "low" as a top-level request parameter drops chain-of-thought to a handful of tokens. Throughput went from 12/min (half silently parse-failing) to 65/min with clean parsing. That one flag was the single largest performance unlock of the night.

Here’s the part that shouldn’t work, according to the “active parameters matter most for output quality” intuition: the bigger model also produced better extractions. I had two records that went through both runners – same job description, different models. The side-by-side:

Aspect	Nemotron 9B (~9B active, dense)	gpt-oss 120B (~5B active, MoE)
Explicit skills list	OK, some duplication / noise	Cleaner, consolidated
Inferred skills	Literal (“encoding techniques”)	Strategic (“accessibility awareness”, “CRM forecasting”, “cross-functional collaboration”)
Requirement decomposition	Compound strings	Atomised into individual items
Seniority labelling	Generic (“senior”)	Specific (“Director”)

On the fields this pipeline actually cares about – inferred capability, transferable skills, requirement atomicity – the 120B MoE is materially better despite having fewer active parameters per token. Expert specialisation plus reasoning training beats raw active-param count on this workload.

The full scorecard: gpt-oss 120B at c=256 is ~30× faster and meaningfully better quality than Nemotron 9B at c=4 was on the same pipeline.

This is the concurrency-first thesis made concrete. The “fast” model wasn’t fast. The “big” model wasn’t slow. At the right concurrency, with the right flags, a smarter model outperforms a smaller “fast” model on both axes simultaneously. You pay no accuracy tax for the speedup – you gain accuracy.

The original Llama 70B → 9B swap was the wrong trade. I could have stayed on a bigger model the whole time and just fanned out instead of shrinking. A good reminder that “this model is too slow” almost always means “this model is too slow single-stream” – which is never the actual constraint on Spark.

What still doesn’t work on Blackwell

I’d be lying if I pretended the Spark has no rough edges. When I tried to benchmark Nemotron 3 Super 120B-A12B NVFP4 – theoretically an even better fit than gpt-oss thanks to NVFP4 – I hit two Blackwell kernel bugs in vLLM 26.03: a CUDA graph capture that crashed with illegal instruction, and a CUTLASS grouped-GEMM initialisation failure in FlashInfer’s MoE FP4 path. Both have workarounds that disable exactly the optimisations you want (eager mode, no FlashInfer MoE), crippling the benchmark.

Forum users on Nvidia’s DGX Spark board are reporting 65+ tok/s single-seq on that model with those optimisations working. The toolchain just isn’t fully cooked for all quantised-MoE combinations yet. That will improve in the next vLLM release, and I will retest then.

This is real-world Blackwell maturity in April 2026. Great for gpt-oss, Llama 3.1 family, and anything already native. Occasionally fiddly for cutting-edge quant+architecture combos. The trajectory is rapidly getting better.

The thing the reviews keep missing

Here’s what the Spark genuinely is, clear-eyed:

It’s 128 GB of unified memory on a compute substrate tuned specifically for Blackwell 4-bit MoE inference. You cannot replicate that with any consumer GPU regardless of clock speed. A 5090 has more peak bandwidth per gigabyte, but you can only load a fraction of any real modern model onto it – and the Spark will happily run a model 20× bigger at concurrency 16+ the 5090 will never match.

Comparing their tokens-per-second in chat is like comparing a desktop CPU’s single-thread performance to a server’s total throughput under fan-out load and declaring the desktop better. It’s not the same benchmark. It’s not even the same problem.

I was on the edge of returning mine after the first week of single-stream testing. I’m glad I didn’t. What I actually have is a personal capacity-and-concurrency box that lets me run a 120B-class reasoning model at near-cloud-API speed for interactive use, and at near-cloud-API aggregate throughput for batched work. At £4,000, for a dedicated-model-research appliance, that’s a genuinely good deal – but only once you stop asking it to be something it isn’t.

For anyone considering a Spark

Some actionable takeaways:

Size your expectations to MoE models. gpt-oss 120B MXFP4 is the current flagship fit. Any MoE in the 20-130B range with a well-supported quantisation format on Blackwell will shine.
Run vLLM, not just NIM. NIM is fine for quick-starts but its default flags aren’t always the best. vLLM’s --enable-prefix-caching is free throughput for any workload with repeated system prompts.
Benchmark at concurrency you’ll actually use. Single-stream numbers tell you nothing about batched workloads. The only honest benchmark is one that matches your real dispatch pattern.
Keep two models running. A small fast utility (9B-class) on one port and your heavy 120B-class on another. Route light tasks accordingly. Very small KV overhead.
Stay close to the forums. NVIDIA’s DGX Spark forum has the current-state-of-the-art configs for each quant / model combination. The kernels are improving monthly.

The Spark gets better every month. The reviews get worse every month – because the software stack is now delivering, but nobody’s going back to re-run their launch-day benchmarks.

Benchmarks run 2026-04-21 on a single Nvidia DGX Spark, GB10 / 128 GB / 2 TB NVMe. Workload: structured character extraction, ~1.5K token prompts, 400 output tokens max. Both models served via nvcr.io/nvidia/vllm:26.03-py3 with standard flags. Full data, sweep scripts, and methodology: repository coming soon.