Recipes – Dendro Logic

LLM Inference Recipes

A live catalogue of model and hardware combinations I've benchmarked, with the launch flags that produced the headline numbers and the trade-offs that came with them.

Most "model + hardware" posts on the internet are either marketing material or a single review with default flags. Neither tells you what the silicon is actually capable of when you tune it for the workload it's designed to run. Each recipe here is verified against a measurable benchmark, dated, and includes the trade-offs. The numbers are mine. The flags are yours to use, fork, or argue with.

Why this recipe exists

The canonical spark-vllm-docker recipe for gpt-oss-120b is a great starting point, but it auto-generates CUDA graph capture sizes at [1, 2, 4, 8, 16, 24, 32, ..., 1024]. Five and ten are missing from that list. So at c=5 the model dispatches into graph size 8 (40% of slots padded with dummy work), and at c=10 it pads to 16 (60% wasted). The result is a 17-25% throughput dip at exactly the concurrencies the public Spark Arena leaderboard tests at.

This recipe overrides cudagraph_capture_sizes to include 5 and 10 alongside vanilla's 83 sizes. Two extra graphs to capture at boot, ~2-3% additional cold-start cost, and the dip closes. The full reasoning, methodology, and ledger of every iteration that got us here is in the linked write-up.

Reading the performance table

The table reports three throughput metrics. They measure different things on the same run.

Prefill (pp): aggregate input-token ingest rate across all concurrent streams. How fast the system absorbs prompts.
Generation (tg): aggregate output-token rate, sustained as a mean across the entire benchmark window.
Peak (peak): highest single-second sample of the aggregate generation rate observed during the run.

A model that bursts to 1,024 tps for short windows but averages 285 tps over a longer run is a real, repeatable result. Both numbers are honest. Sustained mean is what most workloads actually feel; peak is what shows up in headline benchmarks of bursty traffic. The April baseline measurement on this hardware was ~850 tps peak. The hand-tuned image (the previous iteration) hit 1,140 tps peak. This patched canonical recipe trades a slightly lower peak (1,024) for a higher sustained mean (+18% on tg vs hand-tuned at c=256), which is the trade I want for my workload mix.

When to use this recipe

Interactive workloads (c=1 to c=10): large win versus any default vLLM image. Single-stream throughput goes from ~14 tps (default) to 60.4 tps (this recipe).
Batch fan-out (c=128 to c=256): wins at c=128 (+15.6% over hand-tuned baseline), small regression at c=256 prefill but tg holds.
Spark Arena leaderboard concurrencies (c=5, c=10): closes the canonical recipe's dip, lifts above the published top.

When NOT to use this recipe

Sustained max-batch saturation as your primary workload: the c=256 prefill regression matters more than the c=5/c=10 fix. Drop the --compilation-config override and use the vanilla canonical recipe.
Quick experimentation on a fresh box: the from-source image build is a 54-minute commitment. If you just want to try gpt-oss-120b on Spark, start with nvcr.io/nvidia/vllm:26.03-py3 plus --max-num-seqs 256 and the basic optimisation flags. You will leave ~30% on the floor versus this recipe, but you can iterate in seconds.

Concurrency	Prefill (tps)	Generation (tps)	Peak (tps)
1	5,115	60.4	60.4
2	5,317	80.3	84
4	5,935	107.2	121
5	6,297	118.7	128
8	6,518	141.9	173
10	6,721	160.5	192
16	6,846	181.2	272
32	6,732	202	384
64	6,998	229.2	568
128	7,034	271	768
256	5,391	285.8	1,024

Concurrency

Prefill (tps)

Generation (tps)

Peak (tps)

5,115

60.4

5,317

80.3

5,935

107.2

121

6,297

118.7

128

6,518

141.9

173

6,721

160.5

192

6,846

181.2

272

6,732

202

384

6,998

229.2

568

128

7,034

271

768

256

5,391

285.8

1,024

Source	Concurrency	Theirs (tps)	Ours (tps)	Delta
Spark Arena (vanilla canonical recipe, no cudagraph fix)	5	81.1	118.7	+46.4%
Spark Arena (vanilla canonical recipe, no cudagraph fix)	10	104.3	160.5	+53.9%
Hand-tuned image (NGC vllm, --max-num-seqs 256)	1	34.8	60.4	+73.6%
Hand-tuned image (NGC vllm, --max-num-seqs 256)	256	259.5	285.8	+10.1%

Source

Concurrency

Theirs (tps)

Ours (tps)

Delta

Spark Arena (vanilla canonical recipe, no cudagraph fix)

81.1

118.7

+46.4%

Spark Arena (vanilla canonical recipe, no cudagraph fix)

104.3

160.5

+53.9%

Hand-tuned image (NGC vllm, --max-num-seqs 256)

34.8

60.4

+73.6%

Hand-tuned image (NGC vllm, --max-num-seqs 256)

256

259.5

285.8

+10.1%

vllm serve openai/gpt-oss-120b --host 0.0.0.0 --port 8000 --tool-call-parser openai --reasoning-parser openai_gptoss --enable-auto-tool-choice --gpu-memory-utilization 0.70 --enable-prefix-caching --load-format fastsafetensors --quantization mxfp4 --mxfp4-backend CUTLASS --mxfp4-layers moe,qkv,o,lm_head --attention-backend FLASHINFER --kv-cache-dtype fp8 --max-num-batched-tokens 8192 --compilation-config '{"cudagraph_capture_sizes":[1,2,4,5,8,10,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,256,272,288,304,320,336,352,368,384,400,416,432,448,464,480,496,512,528,544,560,576,592,608,624,640,656,672,688,704,720,736,752,768,784,800,816,832,848,864,880,896,912,928,944,960,976,992,1008,1024]}' # env: VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 # image: vllm-node-mxfp4 (built from spark-vllm-docker --exp-mxfp4) # container: docker run -d --name gptoss-120b --restart unless-stopped --runtime nvidia --gpus all --ipc host --shm-size 64g -p 8000:8000 ...

LLM Inference Recipes

openai/gpt-oss-120b on ASUS GX10 / DGX Spark (concurrency-first)

Why this recipe exists

Reading the performance table

When to use this recipe

When NOT to use this recipe

Performance

vs Other Benchmarks

Key Flags

Trade-offs

Full Launch Command

openai/gpt-oss-120b on ASUS GX10 / DGX Spark (concurrency-first)

Why this recipe exists

Reading the performance table

When to use this recipe

When NOT to use this recipe

Performance

vs Other Benchmarks

Key Flags

Trade-offs

Full Launch Command

Project Gallery