A live catalogue of model and hardware combinations I've benchmarked, with the launch flags that produced the headline numbers and the trade-offs that came with them.
Most "model + hardware" posts on the internet are either marketing material or a single review with default flags. Neither tells you what the silicon is actually capable of when you tune it for the workload it's designed to run. Each recipe here is verified against a measurable benchmark, dated, and includes the trade-offs. The numbers are mine. The flags are yours to use, fork, or argue with.
Concurrency-first recipe for openai/gpt-oss-120b on the GB10 Superchip. Beats the public Spark Arena leaderboard by +46% at c=5 and +54% at c=10 by adding non-aligned graph capture sizes to the canonical recipe. Trades ~19% c=256 prefill for big interactive and mid-band wins.
The canonical spark-vllm-docker recipe for gpt-oss-120b is a great starting point, but it auto-generates CUDA graph capture sizes at [1, 2, 4, 8, 16, 24, 32, ..., 1024]. Five and ten are missing from that list. So at c=5 the model dispatches into graph size 8 (40% of slots padded with dummy work), and at c=10 it pads to 16 (60% wasted). The result is a 17-25% throughput dip at exactly the concurrencies the public Spark Arena leaderboard tests at.
This recipe overrides cudagraph_capture_sizes to include 5 and 10 alongside vanilla's 83 sizes. Two extra graphs to capture at boot, ~2-3% additional cold-start cost, and the dip closes. The full reasoning, methodology, and ledger of every iteration that got us here is in the linked write-up.
The table reports three throughput metrics. They measure different things on the same run.
A model that bursts to 1,024 tps for short windows but averages 285 tps over a longer run is a real, repeatable result. Both numbers are honest. Sustained mean is what most workloads actually feel; peak is what shows up in headline benchmarks of bursty traffic. The April baseline measurement on this hardware was ~850 tps peak. The hand-tuned image (the previous iteration) hit 1,140 tps peak. This patched canonical recipe trades a slightly lower peak (1,024) for a higher sustained mean (+18% on tg vs hand-tuned at c=256), which is the trade I want for my workload mix.
--compilation-config override and use the vanilla canonical recipe.nvcr.io/nvidia/vllm:26.03-py3 plus --max-num-seqs 256 and the basic optimisation flags. You will leave ~30% on the floor versus this recipe, but you can iterate in seconds.| Concurrency | Prefill (tps) | Generation (tps) | Peak (tps) |
|---|---|---|---|
| 1 | 5,115 | 60.4 | 60.4 |
| 2 | 5,317 | 80.3 | 84 |
| 4 | 5,935 | 107.2 | 121 |
| 5 | 6,297 | 118.7 | 128 |
| 8 | 6,518 | 141.9 | 173 |
| 10 | 6,721 | 160.5 | 192 |
| 16 | 6,846 | 181.2 | 272 |
| 32 | 6,732 | 202 | 384 |
| 64 | 6,998 | 229.2 | 568 |
| 128 | 7,034 | 271 | 768 |
| 256 | 5,391 | 285.8 | 1,024 |
| Source | Concurrency | Theirs (tps) | Ours (tps) | Delta |
|---|---|---|---|---|
| Spark Arena (vanilla canonical recipe, no cudagraph fix) | 5 | 81.1 | 118.7 | +46.4% |
| Spark Arena (vanilla canonical recipe, no cudagraph fix) | 10 | 104.3 | 160.5 | +53.9% |
| Hand-tuned image (NGC vllm, --max-num-seqs 256) | 1 | 34.8 | 60.4 | +73.6% |
| Hand-tuned image (NGC vllm, --max-num-seqs 256) | 256 | 259.5 | 285.8 | +10.1% |
--max-num-seqs 256 - explicit batch slot cap, recovers +22% over the implicit default--kv-cache-dtype fp8 - halves KV cache memory, frees concurrent slot headroom at high c--quantization mxfp4 --mxfp4-backend CUTLASS - explicit MXFP4 with the Blackwell-native CUTLASS path--mxfp4-layers moe,qkv,o,lm_head - extends MXFP4 quantisation beyond just MoE layers--attention-backend FLASHINFER - sink-aware FlashInfer attention (requires from-source build)--load-format fastsafetensors - parallel safetensors loader, faster cold boot--reasoning-parser openai_gptoss - separates CoT from response content (gpt-oss harmony format)--tool-call-parser openai --enable-auto-tool-choice - native gpt-oss tool calling--compilation-config with explicit cudagraph_capture_sizes including 5 and 10 - closes the c=5 / c=10 dipVLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 - environment variable enabling the MoE-specific FlashInfer MXFP4 pathspark-vllm-docker --exp-mxfp4 takes ~54 min on a Spark/GX10. One-off cost; the resulting image stays cached.sparkrun arena benchmark posts vanilla numbers, not patched.vllm serve openai/gpt-oss-120b
--host 0.0.0.0 --port 8000
--tool-call-parser openai
--reasoning-parser openai_gptoss
--enable-auto-tool-choice
--gpu-memory-utilization 0.70
--enable-prefix-caching
--load-format fastsafetensors
--quantization mxfp4
--mxfp4-backend CUTLASS
--mxfp4-layers moe,qkv,o,lm_head
--attention-backend FLASHINFER
--kv-cache-dtype fp8
--max-num-batched-tokens 8192
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,5,8,10,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,256,272,288,304,320,336,352,368,384,400,416,432,448,464,480,496,512,528,544,560,576,592,608,624,640,656,672,688,704,720,736,752,768,784,800,816,832,848,864,880,896,912,928,944,960,976,992,1008,1024]}'
# env: VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# image: vllm-node-mxfp4 (built from spark-vllm-docker --exp-mxfp4)
# container: docker run -d --name gptoss-120b --restart unless-stopped --runtime nvidia --gpus all --ipc host --shm-size 64g -p 8000:8000 ...