All Recipes | Recipe

openai/gpt-oss-120b on ASUS GX10 / DGX Spark (concurrency-first)

production verified 2026-05-06

Concurrency-first recipe for openai/gpt-oss-120b on the GB10 Superchip. Beats the public Spark Arena leaderboard by +46% at c=5 and +54% at c=10 by adding non-aligned graph capture sizes to the canonical recipe. Trades ~19% c=256 prefill for big interactive and mid-band wins.

Hardware

ASUS GX10 / NVIDIA DGX Spark (GB10 Superchip, 128 GB LPDDR5x, 273 GB/s)

Quantization

MXFP4 (moe + qkv + o + lm_head)

Serving

vLLM via spark-vllm-docker (vllm-node-mxfp4 image, FlashInfer + CUTLASS)

Why this recipe exists

The canonical spark-vllm-docker recipe for gpt-oss-120b is a great starting point, but it auto-generates CUDA graph capture sizes at [1, 2, 4, 8, 16, 24, 32, ..., 1024]. Five and ten are missing from that list. So at c=5 the model dispatches into graph size 8 (40% of slots padded with dummy work), and at c=10 it pads to 16 (60% wasted). The result is a 17-25% throughput dip at exactly the concurrencies the public Spark Arena leaderboard tests at.

This recipe overrides cudagraph_capture_sizes to include 5 and 10 alongside vanilla's 83 sizes. Two extra graphs to capture at boot, ~2-3% additional cold-start cost, and the dip closes. The full reasoning, methodology, and ledger of every iteration that got us here is in the linked write-up.

Reading the performance table

The table reports three throughput metrics. They measure different things on the same run.

Prefill (pp): aggregate input-token ingest rate across all concurrent streams. How fast the system absorbs prompts.
Generation (tg): aggregate output-token rate, sustained as a mean across the entire benchmark window.
Peak (peak): highest single-second sample of the aggregate generation rate observed during the run.

A model that bursts to 1,024 tps for short windows but averages 285 tps over a longer run is a real, repeatable result. Both numbers are honest. Sustained mean is what most workloads actually feel; peak is what shows up in headline benchmarks of bursty traffic. The April baseline measurement on this hardware was ~850 tps peak. The hand-tuned image (the previous iteration) hit 1,140 tps peak. This patched canonical recipe trades a slightly lower peak (1,024) for a higher sustained mean (+18% on tg vs hand-tuned at c=256), which is the trade I want for my workload mix.

When to use this recipe

Interactive workloads (c=1 to c=10): large win versus any default vLLM image. Single-stream throughput goes from ~14 tps (default) to 60.4 tps (this recipe).
Batch fan-out (c=128 to c=256): wins at c=128 (+15.6% over hand-tuned baseline), small regression at c=256 prefill but tg holds.
Spark Arena leaderboard concurrencies (c=5, c=10): closes the canonical recipe's dip, lifts above the published top.

When NOT to use this recipe

Sustained max-batch saturation as your primary workload: the c=256 prefill regression matters more than the c=5/c=10 fix. Drop the --compilation-config override and use the vanilla canonical recipe.
Quick experimentation on a fresh box: the from-source image build is a 54-minute commitment. If you just want to try gpt-oss-120b on Spark, start with nvcr.io/nvidia/vllm:26.03-py3 plus --max-num-seqs 256 and the basic optimisation flags. You will leave ~30% on the floor versus this recipe, but you can iterate in seconds.

Performance

Concurrency	Prefill (tps)	Generation (tps)	Peak (tps)
1	5,115	60.4	60.4
2	5,317	80.3	84
4	5,935	107.2	121
5	6,297	118.7	128
8	6,518	141.9	173
10	6,721	160.5	192
16	6,846	181.2	272
32	6,732	202	384
64	6,998	229.2	568
128	7,034	271	768
256	5,391	285.8	1,024

vs Other Benchmarks

Source	Concurrency	Theirs (tps)	Ours (tps)	Delta
Spark Arena (vanilla canonical recipe, no cudagraph fix)	5	81.1	118.7	+46.4%
Spark Arena (vanilla canonical recipe, no cudagraph fix)	10	104.3	160.5	+53.9%
Hand-tuned image (NGC vllm, --max-num-seqs 256)	1	34.8	60.4	+73.6%
Hand-tuned image (NGC vllm, --max-num-seqs 256)	256	259.5	285.8	+10.1%

Key Flags

--max-num-seqs 256 - explicit batch slot cap, recovers +22% over the implicit default
--kv-cache-dtype fp8 - halves KV cache memory, frees concurrent slot headroom at high c
--quantization mxfp4 --mxfp4-backend CUTLASS - explicit MXFP4 with the Blackwell-native CUTLASS path
--mxfp4-layers moe,qkv,o,lm_head - extends MXFP4 quantisation beyond just MoE layers
--attention-backend FLASHINFER - sink-aware FlashInfer attention (requires from-source build)
--load-format fastsafetensors - parallel safetensors loader, faster cold boot
--reasoning-parser openai_gptoss - separates CoT from response content (gpt-oss harmony format)
--tool-call-parser openai --enable-auto-tool-choice - native gpt-oss tool calling
--compilation-config with explicit cudagraph_capture_sizes including 5 and 10 - closes the c=5 / c=10 dip
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 - environment variable enabling the MoE-specific FlashInfer MXFP4 path

Trade-offs

c=256 prefill regression: patched recipe loses ~19% on prefill aggregate at c=256 vs vanilla canonical (5,391 vs 6,629 tps). Token generation only drops ~3.7%. Acceptable for interactive-dominant workloads, less so for batch fan-out at peak saturation.
First-ever cold boot ~30 min for torch.compile / inductor compilation. Subsequent boots from a warm cache are ~3-5 min.
Image build cost: spark-vllm-docker --exp-mxfp4 takes ~54 min on a Spark/GX10. One-off cost; the resulting image stays cached.
Custom recipe: the cudagraph override is a one-line config, but it is not (yet) part of the official sparkrun-testing recipe registry. Submitting via sparkrun arena benchmark posts vanilla numbers, not patched.

Full Launch Command

vllm serve openai/gpt-oss-120b 
  --host 0.0.0.0 --port 8000 
  --tool-call-parser openai 
  --reasoning-parser openai_gptoss 
  --enable-auto-tool-choice 
  --gpu-memory-utilization 0.70 
  --enable-prefix-caching 
  --load-format fastsafetensors 
  --quantization mxfp4 
  --mxfp4-backend CUTLASS 
  --mxfp4-layers moe,qkv,o,lm_head 
  --attention-backend FLASHINFER 
  --kv-cache-dtype fp8 
  --max-num-batched-tokens 8192 
  --compilation-config '{"cudagraph_capture_sizes":[1,2,4,5,8,10,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,256,272,288,304,320,336,352,368,384,400,416,432,448,464,480,496,512,528,544,560,576,592,608,624,640,656,672,688,704,720,736,752,768,784,800,816,832,848,864,880,896,912,928,944,960,976,992,1008,1024]}'

# env: VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# image: vllm-node-mxfp4 (built from spark-vllm-docker --exp-mxfp4)
# container: docker run -d --name gptoss-120b --restart unless-stopped --runtime nvidia --gpus all --ipc host --shm-size 64g -p 8000:8000 ...

Read the full write-up

DGX Spark Optimisation, Four Iterations On: From Defaults to 308 Tokens Per Second Sustained