Engineering | Recipes

LLM Inference Recipes

A live catalogue of model and hardware combinations I've benchmarked, with the launch flags that produced the headline numbers and the trade-offs that came with them.

Most "model + hardware" posts on the internet are either marketing material or a single review with default flags. Neither tells you what the silicon is actually capable of when you tune it for the workload it's designed to run. Each recipe here is verified against a measurable benchmark, dated, and includes the trade-offs. The numbers are mine. The flags are yours to use, fork, or argue with.

openai/gpt-oss-120b on ASUS GX10 / DGX Spark (concurrency-first)

production

Concurrency-first recipe for openai/gpt-oss-120b on the GB10 Superchip. Beats the public Spark Arena leaderboard by +46% at c=5 and +54% at c=10 by adding non-aligned graph capture sizes to the canonical recipe. Trades ~19% c=256 prefill for big interactive and mid-band wins.

ASUS GX10 / NVIDIA DGX Spark (GB10 Superchip, 128 GB LPDDR5x, 273 GB/s) |MXFP4 (moe + qkv + o + lm_head) |verified 2026-05-06

Why this recipe exists

The canonical spark-vllm-docker recipe for gpt-oss-120b is a great starting point, but it auto-generates CUDA graph capture sizes at [1, 2, 4, 8, 16, 24, 32, ..., 1024]. Five and ten are missing from that list. So at c=5 the model dispatches into graph size 8 (40% of slots padded with dummy work), and at c=10 it pads to 16 (60% wasted). The result is a 17-25% throughput dip at exactly the concurrencies the public Spark Arena leaderboard tests at.

This recipe overrides cudagraph_capture_sizes to include 5 and 10 alongside vanilla's 83 sizes. Two extra graphs to capture at boot, ~2-3% additional cold-start cost, and the dip closes. The full reasoning, methodology, and ledger of every iteration that got us here is in the linked write-up.

Reading the performance table

The table reports three throughput metrics. They measure different things on the same run.

  • Prefill (pp): aggregate input-token ingest rate across all concurrent streams. How fast the system absorbs prompts.
  • Generation (tg): aggregate output-token rate, sustained as a mean across the entire benchmark window.
  • Peak (peak): highest single-second sample of the aggregate generation rate observed during the run.

A model that bursts to 1,024 tps for short windows but averages 285 tps over a longer run is a real, repeatable result. Both numbers are honest. Sustained mean is what most workloads actually feel; peak is what shows up in headline benchmarks of bursty traffic. The April baseline measurement on this hardware was ~850 tps peak. The hand-tuned image (the previous iteration) hit 1,140 tps peak. This patched canonical recipe trades a slightly lower peak (1,024) for a higher sustained mean (+18% on tg vs hand-tuned at c=256), which is the trade I want for my workload mix.

When to use this recipe

  • Interactive workloads (c=1 to c=10): large win versus any default vLLM image. Single-stream throughput goes from ~14 tps (default) to 60.4 tps (this recipe).
  • Batch fan-out (c=128 to c=256): wins at c=128 (+15.6% over hand-tuned baseline), small regression at c=256 prefill but tg holds.
  • Spark Arena leaderboard concurrencies (c=5, c=10): closes the canonical recipe's dip, lifts above the published top.

When NOT to use this recipe

  • Sustained max-batch saturation as your primary workload: the c=256 prefill regression matters more than the c=5/c=10 fix. Drop the --compilation-config override and use the vanilla canonical recipe.
  • Quick experimentation on a fresh box: the from-source image build is a 54-minute commitment. If you just want to try gpt-oss-120b on Spark, start with nvcr.io/nvidia/vllm:26.03-py3 plus --max-num-seqs 256 and the basic optimisation flags. You will leave ~30% on the floor versus this recipe, but you can iterate in seconds.
Serving: vLLM via spark-vllm-docker (vllm-node-mxfp4 image, FlashInfer + CUTLASS)

Performance

ConcurrencyPrefill (tps)Generation (tps)Peak (tps)
15,11560.460.4
25,31780.384
45,935107.2121
56,297118.7128
86,518141.9173
106,721160.5192
166,846181.2272
326,732202384
646,998229.2568
1287,034271768
2565,391285.81,024

vs Other Benchmarks

SourceConcurrencyTheirs (tps)Ours (tps)Delta
Spark Arena (vanilla canonical recipe, no cudagraph fix)581.1118.7+46.4%
Spark Arena (vanilla canonical recipe, no cudagraph fix)10104.3160.5+53.9%
Hand-tuned image (NGC vllm, --max-num-seqs 256)134.860.4+73.6%
Hand-tuned image (NGC vllm, --max-num-seqs 256)256259.5285.8+10.1%

Key Flags

  • --max-num-seqs 256 - explicit batch slot cap, recovers +22% over the implicit default
  • --kv-cache-dtype fp8 - halves KV cache memory, frees concurrent slot headroom at high c
  • --quantization mxfp4 --mxfp4-backend CUTLASS - explicit MXFP4 with the Blackwell-native CUTLASS path
  • --mxfp4-layers moe,qkv,o,lm_head - extends MXFP4 quantisation beyond just MoE layers
  • --attention-backend FLASHINFER - sink-aware FlashInfer attention (requires from-source build)
  • --load-format fastsafetensors - parallel safetensors loader, faster cold boot
  • --reasoning-parser openai_gptoss - separates CoT from response content (gpt-oss harmony format)
  • --tool-call-parser openai --enable-auto-tool-choice - native gpt-oss tool calling
  • --compilation-config with explicit cudagraph_capture_sizes including 5 and 10 - closes the c=5 / c=10 dip
  • VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 - environment variable enabling the MoE-specific FlashInfer MXFP4 path

Trade-offs

  • c=256 prefill regression: patched recipe loses ~19% on prefill aggregate at c=256 vs vanilla canonical (5,391 vs 6,629 tps). Token generation only drops ~3.7%. Acceptable for interactive-dominant workloads, less so for batch fan-out at peak saturation.
  • First-ever cold boot ~30 min for torch.compile / inductor compilation. Subsequent boots from a warm cache are ~3-5 min.
  • Image build cost: spark-vllm-docker --exp-mxfp4 takes ~54 min on a Spark/GX10. One-off cost; the resulting image stays cached.
  • Custom recipe: the cudagraph override is a one-line config, but it is not (yet) part of the official sparkrun-testing recipe registry. Submitting via sparkrun arena benchmark posts vanilla numbers, not patched.

Full Launch Command

vllm serve openai/gpt-oss-120b 
  --host 0.0.0.0 --port 8000 
  --tool-call-parser openai 
  --reasoning-parser openai_gptoss 
  --enable-auto-tool-choice 
  --gpu-memory-utilization 0.70 
  --enable-prefix-caching 
  --load-format fastsafetensors 
  --quantization mxfp4 
  --mxfp4-backend CUTLASS 
  --mxfp4-layers moe,qkv,o,lm_head 
  --attention-backend FLASHINFER 
  --kv-cache-dtype fp8 
  --max-num-batched-tokens 8192 
  --compilation-config '{"cudagraph_capture_sizes":[1,2,4,5,8,10,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,256,272,288,304,320,336,352,368,384,400,416,432,448,464,480,496,512,528,544,560,576,592,608,624,640,656,672,688,704,720,736,752,768,784,800,816,832,848,864,880,896,912,928,944,960,976,992,1008,1024]}'

# env: VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# image: vllm-node-mxfp4 (built from spark-vllm-docker --exp-mxfp4)
# container: docker run -d --name gptoss-120b --restart unless-stopped --runtime nvidia --gpus all --ipc host --shm-size 64g -p 8000:8000 ...