DGX Spark Optimisation, Four Iterations On: From Defaults to 308 Tokens Per Second Sustained

Key Takeaways

A single CLI flag, `–max-num-seqs 256`, recovers +22% aggregate throughput on gpt-oss-120b that the default vLLM image silently leaves on the floor.
EAGLE3 speculative decoding does not compose with concurrency-first inference. At saturation there is no idle compute for the draft model to fill.
Building vLLM from source via `spark-vllm-docker` unlocks FlashInfer attention with sinks, fastsafetensors loading, and the full CUTLASS MXFP4 layer coverage. Net 30% lift across the entire concurrency curve.
The canonical from-source recipe has a hidden c=5 and c=10 throughput dip caused by CUDA graph padding. One `–compilation-config` flag closes it, with a small c=256 prefill regression as the trade.
Same hardware, same model, same weights. End to end from naive defaults to current production: 4.3x single-stream and 2x sustained aggregate, all from tuning, build choices, and reading boot logs carefully.

A month ago I posted a benchmark of NVIDIA's DGX Spark hitting around 850 tokens/sec aggregate at c=256 on gpt-oss-120b MXFP4. The headline argument: stop measuring single-stream tokens-per-second the way every reviewer does, because it pattern-matches the wrong thing. Spark's superpower isn't latency. It's bandwidth shared across many concurrent streams. Pattern-match the strength and the device looks far stronger than the press makes out.

Since then I've put it through harder workloads – a research orchestrator that fans out a dozen sub-agents at a time, an enrichment pipeline that batches across the whole vault, a nightly compilation job that touches thousands of markdown files. Each one surfaced a config audit, a negative result, or a structural quirk worth writing about.

This is the consolidated story: four iterations of tuning on the same hardware, same model, same weights. End-to-end gain is ~4.3× single-stream and ~2× sustained aggregate versus naive defaults. Every step was tuning, build choices, and reading boot logs carefully. None of it required new tricks.

The hardware in question

For the avoidance of doubt, my box is an ASUS Ascent GX10, 2 TB variant. It's the OEM equivalent of NVIDIA's DGX Spark reference platform – same GB10 Grace-Blackwell Superchip, same 128 GB LPDDR5x unified memory pooled between CPU and GPU, same 273 GB/s aggregate memory bandwidth. CPU is the 20-core ARM (Cortex-X925 + A725) on the Grace side, Blackwell GPU on the other.

Throughout this article "Spark" refers to the platform family. The specific unit is the ASUS variant, but the silicon, drivers, and software stack are identical to anything else carrying the GB10 Superchip. Anyone running an NVIDIA-branded DGX Spark, ASUS GX10, or any other GB10 OEM box will see the same numbers from the same flags.

Iteration 1: a missing flag worth +22%

The April benchmark used this launch:

vllm serve openai/gpt-oss-120b \
  --gpu-memory-utilization 0.65 \
  --max-model-len 65536 \
  --enable-prefix-caching \
  --trust-remote-code

It runs. It serves. And it leaves performance on the floor.

The diagnosis came from production load. When I started dispatching realistic batches – 48 concurrent classification calls, 256 concurrent research questions – time-to-first-token blew up to 134 seconds. That's not a bandwidth problem, it's a queueing problem. vLLM's batch scheduler caps concurrent slots via --max-num-seqs, and the implicit default in the nvcr.io/nvidia/vllm:26.03-py3 image sat below 256. Requests beyond the cap queued instead of being batched – exactly the wrong shape for hardware whose superpower is multiplexing bandwidth across many streams.

Adding the flag explicitly:

--max-num-seqs 256

Re-ran the sweep on the same image, same hardware, same model. 1,042 tokens/sec aggregate at c=256. Apples-to-apples uplift over the April figure: +22% from one flag.

A methodology note worth being upfront about. My first re-measurement reported 1,145 tps. On review, the test prompts shared a long literal suffix that fed vLLM's prefix cache, inflating the headline by about 10%. I re-ran with prompts that share no token-aligned prefix: 1,042. Both numbers are valid – 1,145 is what production sees on prefix-shared workloads (system prompts, agent loops); 1,042 is the cleaner figure against the original article. Lead with the lower one.

Iteration 2: EAGLE3 didn't help, and the why is more interesting than the result

With the new baseline established, the obvious next stop was speculative decoding. NVIDIA has been publishing EAGLE3 draft heads for gpt-oss-120b since August 2025 – small predictor models that propose tokens for the big model to verify, in theory cutting per-token latency by 40-60%.

Wired up the throughput-tuned draft via vLLM's speculative config:

--speculative-config '{
  "model": "nvidia/gpt-oss-120b-Eagle3-throughput",
  "method": "eagle3",
  "num_speculative_tokens": 1
}'

Same sweep. Result: ±2% across all concurrency levels. Within measurement noise.

The integration was working. vLLM's spec-decoding metrics showed mean acceptance length of 1.26-1.73 tokens and per-position acceptance rate 26-72% (averaging ~45%). The draft was making predictions, the target was accepting them, the speculation was happening. It just wasn't producing a speedup.

Then the architectural argument clicked.

EAGLE3 wins when the GPU has idle cycles for the draft model to fill. It's a way to convert spare compute into faster latency by letting a small model run alongside the big one. On overprovisioned datacentre GPUs serving low-concurrency interactive traffic, the GPU sits half-idle while one user waits, and the draft uses that headroom productively.

Spark's architectural thesis is the opposite. It maximises concurrent utilisation. At c=256, bandwidth is fully booked. There's no idle slack for the draft to occupy. The draft cost gets paid on every token, but the speedup gets eaten by saturation. The two strategies are mutually exclusive at the limit.

There's a corroborating effect in the numbers. Per-stream acceptance rate drops from 0.73 at c=1 to 0.29 at c=256. At high concurrency each stream's draft gets less attention budget, so its predictions match worse, so the speculation pays off less often. Both effects compound.

The conclusion isn't "EAGLE3 is bad." It's "EAGLE3 and concurrency-first inference solve different problems and don't compose." On a Spark serving single-user interactive workloads, it might still pay. On a Spark running batch fan-out the way the device is actually optimised for, you're already at saturation and the draft can't help you.

Iteration 3: building from source unlocked the FlashInfer + CUTLASS path

Three optimisations were available in vLLM but unreachable in the prebuilt NGC image:

--attention-backend FLASHINFER rejected at startup with 'sink setting not supported' for gpt-oss attention sinks. The version in that image just isn't on the right code path.
--load-format fastsafetensors failed because the wheel wasn't installed.
--quantization mxfp4 --mxfp4-backend CUTLASS --mxfp4-layers moe,qkv,o,lm_head – the running container chose Marlin backend automatically because CUTLASS support needs a newer build.

NVIDIA's spark-vllm-docker repo has a build recipe (build-and-copy.sh --exp-mxfp4) that produces a vllm-node-mxfp4 image with all three available. 54 minutes of compile time later, a fresh container booted with the canonical recipe.

Same hardware, same gpt-oss-120b MXFP4 weights, same llama-benchy 0.3.7, same prompt 2048 / gen 128 / 3 runs / --no-cache. Apples-to-apples uplift:

c	Previous (hand-tuned) tg	Canonical (from-source) tg	Δ
1	34.8	58.3	+67.6%
2	61.4	79.3	+29.1%
4	89.0	103.7	+16.5%
16	165.5	184.6	+11.6%
64	201.8	219.5	+8.8%
128	234.5	259.0	+10.5%
256	259.5	295.8	+14.0%

Prefill aggregate climbs uniformly +23% to +30% across the entire range. Single-stream almost doubled, which makes interactive Claude Code → brain calls noticeably snappier from the user's seat. The c=256 sustained ceiling moves from 262 to 296 tps – the genuinely useful number for batch fan-out workloads.

That's the headline. But two concurrencies broke the trend.

Iteration 4: the cudagraph trap

Same sweep, same image, same flags:

c	Hand-tuned tg	Canonical tg	Δ
5	97.0	81.1	−16.4%
10	138.5	104.3	−24.7%

The from-source build is slower than the prebuilt image at exactly c=5 and c=10. Re-ran cold, re-ran warm. Both times the dip reproduced.

The cause showed up in the boot log. vLLM's compilation config lists every batch size it pre-captures CUDA graphs for. Vanilla canonical's auto-generated list is:

[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, ...]

5 and 10 aren't in it. So at c=5 the model dispatches into graph size 8 with three slots padded full of dummy work. At c=10 it pads from 10 to 16 – six wasted slots. The padding overhead is exactly the dip.

The hand-tuned image either uses a different (or no explicit) graph capture profile, so it scales linearly through those concurrencies. Demo concurrencies – 1, 2, 5, 10, the ones the Spark Arena leaderboard tests at – sit on the wrong side of the canonical recipe's optimisation choices.

The fix is one CLI flag:

--compilation-config '{
  "cudagraph_capture_sizes":
    [1, 2, 4, 5, 8, 10, 16, 24, 32, ..., 1024]
}'

Vanilla's 83 sizes plus 5 and 10. Two extra graphs to capture at boot, ~2-3% additional cold-start cost (most of cold-start is one-time inductor compilation, which is cached after first boot). Re-bench:

c	Vanilla canonical tg	Patched canonical tg	Δ
5	81.1	118.7	+46.4%
10	104.3	160.5	+53.9%

Dip closed. Patched canonical also picks up small wins at c=4, c=32, c=64, c=128 – sizes already in the vanilla list, a side-effect of the larger captured set affecting compile decisions.

The fix isn't free at peak batch – and that's worth being honest about

I expected the cudagraph fix to be neutral or slightly negative at high concurrencies. It's slightly worse than that:

metric	Vanilla c=256	Patched c=256	Δ
pp aggregate	6,629 ± 59 tps	5,391 ± 24 tps	−18.7%
tg aggregate	295.8 ± 1.5 tps	285.8 ± 0.7 tps	−3.7%

Verified across 8 runs. Reproducible, not noise. Likely cause: 85 captured graph sizes hold marginally more workspace memory than 83, which marginally squeezes KV cache headroom at peak batch – vLLM's scheduler responds with slightly less aggressive prefill batching.

The user-visible token generation rate drops only 3.7% at c=256, which is what most workloads actually feel. Prefill drops 19%, which bites only when prefill dominates total latency – long inputs with short outputs. For a typical chunk in my enricher (~2k input, ~500 output), tg dominates total chunk time, so the end-to-end cost is closer to 5-10%.

Net production decision: keep the cudagraph fix. The interactive c=1 win (+74% over hand-tuned) and the closed c=5/c=10 dip matter more for my workload mix than the c=256 prefill regression. Anyone optimising for sustained max-batch saturation should drop the cudagraph override and accept the dip at non-aligned concurrencies. The right answer depends on what shape your traffic actually has.

End-to-end ledger

Same hardware. Same model. Same weights.

Stage	c=1 tg	c=10 tg	c=256 tg	c=256 peak
Default vLLM, no flags	~14	~70	~700	~850
`--max-num-seqs 256` fix	~15	~103	1,042	1,140
Hand-tuned image, all flags	34.8	138.5	259.5	1,140
Vanilla canonical from-source	58.3	104.3 (dip)	295.8	1,024
Patched canonical (current prod)	60.4	160.5	285.8	1,024

Single-stream gain: ~4.3×. Sustained aggregate gain at c=10: ~2.3×. Sustained aggregate gain at c=256: ~1.4× tg sustained, but the peak of 1,140 tps at the prior stage is a more striking comparison if you're benchmark-shopping.

Three takeaways for anyone running this hardware

Set --max-num-seqs explicitly. The default cost is an order of magnitude in TTFT under realistic concurrency, and the symptom (queueing) doesn't look like the cause.
Build from source. The prebuilt NGC image is fine for getting started, but FlashInfer with sink support, fastsafetensors, and the full CUTLASS MXFP4 layer coverage all need spark-vllm-docker --exp-mxfp4. 54 minutes once buys you ~30% across the board.
--compilation-config is the most underrated vLLM flag. It controls what CUDA graphs get captured at boot, which silently determines what concurrencies you'll be fast at. Default behaviour optimises for power-of-2 demo numbers and leaves gaps you only see when you bench non-aligned counts.

And one closing principle: optimisations don't compose by default. EAGLE3 doesn't compose with concurrency-first batching. The cudagraph fix doesn't compose with peak-batch saturation. The hand-tuned image's c=5/c=10 advantage doesn't compose with the from-source build's everywhere-else wins. Profile your actual traffic before locking a config in.

Sustained 1,000+ tokens/sec of LLM inference on a desk-side device is real. So is the ceiling at the limits, and so are the trade-offs in between. The most interesting findings are still the ones that contradict your prior – three iterations in, the negative result on EAGLE3 has taught me more about the device than any of the positive uplifts.

The full launch command, performance table, leaderboard comparisons, and trade-offs for this recipe live in the Engineering Recipes catalogue on this site, alongside any future configurations I publish. Next stop: NVIDIA's just-published Nemotron-3 Nano Omni 30B-3A multimodal – different architecture, different bottlenecks, probably different lessons.