openai/gpt-oss-120b on ASUS GX10 / DGX Spark (concurrency-first)
Concurrency-first recipe for openai/gpt-oss-120b on the GB10 Superchip. Beats the public Spark Arena leaderboard by +46% at c=5 and +54% at c=10 by adding non-aligned graph capture sizes to the canonical recipe. Trades ~19% c=256 prefill for big interactive and mid-band wins.
Why this recipe exists
The canonical spark-vllm-docker recipe for gpt-oss-120b is a great starting point, but it auto-generates CUDA graph capture sizes at [1, 2, 4, 8, 16, 24, 32, ..., 1024]. Five and ten are missing from that list. So at c=5 the model dispatches into graph size 8 (40% of slots padded with dummy work), and at c=10 it pads to 16 (60% wasted). The result is a 17-25% throughput dip at exactly the concurrencies the public Spark Arena leaderboard tests at.
This recipe overrides cudagraph_capture_sizes to include 5 and 10 alongside vanilla's 83 sizes. Two extra graphs to capture at boot, ~2-3% additional cold-start cost, and the dip closes. The full reasoning, methodology, and ledger of every iteration that got us here is in the linked write-up.
Reading the performance table
The table reports three throughput metrics. They measure different things on the same run.
- Prefill (pp): aggregate input-token ingest rate across all concurrent streams. How fast the system absorbs prompts.
- Generation (tg): aggregate output-token rate, sustained as a mean across the entire benchmark window.
- Peak (peak): highest single-second sample of the aggregate generation rate observed during the run.
A model that bursts to 1,024 tps for short windows but averages 285 tps over a longer run is a real, repeatable result. Both numbers are honest. Sustained mean is what most workloads actually feel; peak is what shows up in headline benchmarks of bursty traffic. The April baseline measurement on this hardware was ~850 tps peak. The hand-tuned image (the previous iteration) hit 1,140 tps peak. This patched canonical recipe trades a slightly lower peak (1,024) for a higher sustained mean (+18% on tg vs hand-tuned at c=256), which is the trade I want for my workload mix.
When to use this recipe
- Interactive workloads (c=1 to c=10): large win versus any default vLLM image. Single-stream throughput goes from ~14 tps (default) to 60.4 tps (this recipe).
- Batch fan-out (c=128 to c=256): wins at c=128 (+15.6% over hand-tuned baseline), small regression at c=256 prefill but tg holds.
- Spark Arena leaderboard concurrencies (c=5, c=10): closes the canonical recipe's dip, lifts above the published top.
When NOT to use this recipe
- Sustained max-batch saturation as your primary workload: the c=256 prefill regression matters more than the c=5/c=10 fix. Drop the
--compilation-configoverride and use the vanilla canonical recipe. - Quick experimentation on a fresh box: the from-source image build is a 54-minute commitment. If you just want to try gpt-oss-120b on Spark, start with
nvcr.io/nvidia/vllm:26.03-py3plus--max-num-seqs 256and the basic optimisation flags. You will leave ~30% on the floor versus this recipe, but you can iterate in seconds.
Performance
| Concurrency | Prefill (tps) | Generation (tps) | Peak (tps) |
|---|---|---|---|
| 1 | 5,115 | 60.4 | 60.4 |
| 2 | 5,317 | 80.3 | 84 |
| 4 | 5,935 | 107.2 | 121 |
| 5 | 6,297 | 118.7 | 128 |
| 8 | 6,518 | 141.9 | 173 |
| 10 | 6,721 | 160.5 | 192 |
| 16 | 6,846 | 181.2 | 272 |
| 32 | 6,732 | 202 | 384 |
| 64 | 6,998 | 229.2 | 568 |
| 128 | 7,034 | 271 | 768 |
| 256 | 5,391 | 285.8 | 1,024 |
vs Other Benchmarks
| Source | Concurrency | Theirs (tps) | Ours (tps) | Delta |
|---|---|---|---|---|
| Spark Arena (vanilla canonical recipe, no cudagraph fix) | 5 | 81.1 | 118.7 | +46.4% |
| Spark Arena (vanilla canonical recipe, no cudagraph fix) | 10 | 104.3 | 160.5 | +53.9% |
| Hand-tuned image (NGC vllm, --max-num-seqs 256) | 1 | 34.8 | 60.4 | +73.6% |
| Hand-tuned image (NGC vllm, --max-num-seqs 256) | 256 | 259.5 | 285.8 | +10.1% |
Key Flags
--max-num-seqs 256- explicit batch slot cap, recovers +22% over the implicit default--kv-cache-dtype fp8- halves KV cache memory, frees concurrent slot headroom at high c--quantization mxfp4 --mxfp4-backend CUTLASS- explicit MXFP4 with the Blackwell-native CUTLASS path--mxfp4-layers moe,qkv,o,lm_head- extends MXFP4 quantisation beyond just MoE layers--attention-backend FLASHINFER- sink-aware FlashInfer attention (requires from-source build)--load-format fastsafetensors- parallel safetensors loader, faster cold boot--reasoning-parser openai_gptoss- separates CoT from response content (gpt-oss harmony format)--tool-call-parser openai --enable-auto-tool-choice- native gpt-oss tool calling--compilation-configwith explicitcudagraph_capture_sizesincluding 5 and 10 - closes the c=5 / c=10 dipVLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1- environment variable enabling the MoE-specific FlashInfer MXFP4 path
Trade-offs
- c=256 prefill regression: patched recipe loses ~19% on prefill aggregate at c=256 vs vanilla canonical (5,391 vs 6,629 tps). Token generation only drops ~3.7%. Acceptable for interactive-dominant workloads, less so for batch fan-out at peak saturation.
- First-ever cold boot ~30 min for torch.compile / inductor compilation. Subsequent boots from a warm cache are ~3-5 min.
- Image build cost:
spark-vllm-docker --exp-mxfp4takes ~54 min on a Spark/GX10. One-off cost; the resulting image stays cached. - Custom recipe: the cudagraph override is a one-line config, but it is not (yet) part of the official sparkrun-testing recipe registry. Submitting via
sparkrun arena benchmarkposts vanilla numbers, not patched.
Full Launch Command
vllm serve openai/gpt-oss-120b
--host 0.0.0.0 --port 8000
--tool-call-parser openai
--reasoning-parser openai_gptoss
--enable-auto-tool-choice
--gpu-memory-utilization 0.70
--enable-prefix-caching
--load-format fastsafetensors
--quantization mxfp4
--mxfp4-backend CUTLASS
--mxfp4-layers moe,qkv,o,lm_head
--attention-backend FLASHINFER
--kv-cache-dtype fp8
--max-num-batched-tokens 8192
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,5,8,10,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,256,272,288,304,320,336,352,368,384,400,416,432,448,464,480,496,512,528,544,560,576,592,608,624,640,656,672,688,704,720,736,752,768,784,800,816,832,848,864,880,896,912,928,944,960,976,992,1008,1024]}'
# env: VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# image: vllm-node-mxfp4 (built from spark-vllm-docker --exp-mxfp4)
# container: docker run -d --name gptoss-120b --restart unless-stopped --runtime nvidia --gpus all --ipc host --shm-size 64g -p 8000:8000 ...