What You Can Actually Run on £4,000 of Local AI Hardware
Key Takeaways
- A single DGX Spark holds 128 GB of unified LPDDR5x memory, enough to keep a 120B-class MoE model resident at native 4-bit precision plus headroom.
- Two services, two ports: one heavy reasoning model on 8000, one fast utility model on 8002.
- The combined stack uses about 85 GB of GPU memory, leaving 30+ GB for OS, brain processes, and supporting containers.
- vLLM with `–enable-prefix-caching` is free throughput for any workload with repeated system prompts.
- For most teams, the on-prem capex pays for itself inside two months versus equivalent frontier API spend on extraction-shaped workloads.
A single NVIDIA DGX Spark, sat on a desk, holds the inference for a 120-billion-parameter reasoning model plus a smaller utility model, plus the team-context brain it powers, plus its own memory vault. Total bill of materials: about £4,000 retail, one-off. Total monthly cloud LLM bill on a typical day: £0. This is the actual working stack we run at Dendro Logic, and the configuration is short enough to fit in a blog post.
Why two models, not one
The temptation when you buy capable AI hardware is to run one big model. The reality is that workloads come in two shapes, and they want different models.
Shape one: structured extraction, agent reasoning, complex synthesis. Wants a 100B+ reasoning model. Runs in the background, throughput-bound, latency-tolerant.
Shape two: interactive utility. Quick classifications, snappy single-shot responses, "what does this mean" lookups. Wants a 7-9B model that single-streams at 25-30 tokens per second so the user does not wait.
If you only have one big model, every quick utility query waits behind the heavy model's batched throughput pattern. If you only have a small model, your quality on the interesting work is mediocre. Two models, two ports, two roles is the answer. The Spark has the memory headroom to hold both.
The actual stack
What runs on the Spark:
| Port | Model | Quantisation | Role | Tier |
|---|---|---|---|---|
| 8000 | OpenAI gpt-oss-120B | MXFP4 | Main reasoning, extraction, synthesis | Heavy / Medium |
| 8002 | Nemotron Nano 9B v2 | NVFP4 | Fast utility, simple tasks | Light |
GPU memory footprint at idle: 85 GB out of 128. Headroom: 30+ GB for OS, the brain process, supporting containers, and surge capacity when both models are concurrently loaded.
The docker config that holds the heavy lifting
The gpt-oss container, with the flags that actually matter:
docker run -d --name gptoss-120b \
--gpus all --shm-size=16g \
--restart unless-stopped \
-v /path/to/hf-cache:/hf-cache \
-p 8000:8000 \
-e HF_HUB_CACHE=/hf-cache \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
nvcr.io/nvidia/vllm:26.03-py3 \
vllm serve openai/gpt-oss-120b \
--host 0.0.0.0 --port 8000 \
--max-model-len 65536 \
--gpu-memory-utilization 0.65 \
--enable-prefix-caching \
--trust-remote-code
A few things worth pulling out:
--enable-prefix-cachingis the single most underrated vLLM flag. Any workload that reuses a system prompt (most agents, most extraction pipelines, most chat sessions) gets the prompt's KV cache warm on the first hit and free on every subsequent one. Free throughput.--gpu-memory-utilization 0.65leaves 35% of the Spark's GPU memory for the second model and OS. Tune this once when both models are running and forget it.VLLM_FLASHINFER_MOE_BACKEND=throughputpicks the kernel that scales batched MoE work. The default is latency-tuned and caps your aggregate throughput well below what the hardware can do.--max-model-len 65536sets a 64K context. Plenty for most extraction and agent work. Lift it only when you have a documented need.
The 9B utility container is smaller and uses Nvidia's DGX-Spark-specific NIM image. Its config is conservative, latency-tuned, and intentionally not pushed for batched throughput. It serves snappy single-stream queries to humans and agents that want a fast answer.
What this replaces
Before this stack landed, our extraction work hit a frontier API. Reasonable per-call cost, but the full back-catalogue extraction (about 124,000 records) at premium pricing would have run to the low thousands of pounds. The same work on the Spark cost the electricity, the one-off capex of the box, and a Sunday afternoon of --max-num-seqs tuning.
For interactive use, our pre-commit code review and our day-to-day brain queries hit Gemini Flash via Vercel AI Gateway. Cheap, fast, and the right call for "human is at the terminal" tier work. The HEAVY tier endpoint is intentionally rate-limited at the application layer so a runaway script cannot generate a bill before anyone notices.
The Spark handles every batched, async, or background workload at zero marginal cost. That includes wiki compilation, daily session summarisation, the structured extraction pipeline that powers Leap, and any agent task that does not need a human in the loop.
When the maths works, when it does not
The maths works when:
- Your team's extraction or synthesis volume is enough to justify the £4k upfront.
- Your workload is async or tolerant of medium latency.
- You want sovereign data handling without a sovereignty surcharge from a hyperscaler.
- You can spare an afternoon to wire vLLM and a Sunday to tune flags.
The maths does not work when:
- Your only AI use case is a 5-message-a-day chatbot. Frontier API is cheaper.
- You need maximum quality on every request and budget is not a constraint. Frontier API is cleaner.
- You have no engineer who can tolerate
nvidia-smiand a Docker file.
For most teams running modern AI workloads, the gap between "we pay an API for everything" and "we run the bulk on metal we own" is the difference between a recurring bill and a one-off capex. The hardware is there. The configuration fits in a blog post. The decision is whether your team is willing to do the boring infrastructure work that turns AI from a service into a tool.
Take the Next Step
If your AI bill has been climbing and your workload looks anything like extraction, synthesis, or async agent traffic, an on-prem MEDIUM tier is probably worth a quick costing exercise. We help teams scope, install, and tune the stack we run ourselves. Get in touch if you want to put real numbers next to the question.