HEAVY, MEDIUM, LIGHT: Three Tiers Beats One Big Model
Key Takeaways
- HEAVY tier: interactive sessions where a human is waiting and quality matters. Frontier models, gateway-routed, paid per token. Used sparingly.
- MEDIUM tier: background and async work. Local 120B-class reasoning models on hardware you control. No per-token bill.
- LIGHT tier: deterministic transforms, classification, normalisation, regex-shaped tasks. Small models or no model at all. Microsecond latency.
- Routing rule: chosen by workload, not by request size. The cost saving comes from what you stop sending to HEAVY.
- For Dendro Logic, this pattern means our pre-commit code review, our daily session re-summarisation, and our extraction pipelines all run at zero marginal cost.
Most teams running LLM workloads pay for one tier of intelligence on every single request. A pricing tag, a sentiment classification, an extraction task that should take 50 milliseconds: all of them hit the same frontier-model endpoint that charges premium rates per token and adds a network round trip on each call. The bill is high, the latency is uneven, and the model occasionally fabricates an answer to a task that should never have reached it.
The fix is not a cheaper model. The fix is routing. Three tiers, three different jobs, one rule for which tier handles which workload.
The default everyone falls into
A team adopts an LLM tool. The tool exposes one model. Every request goes through it. Within a quarter, the bill has a feature on the finance team's dashboard, and the engineering response is to negotiate volume pricing rather than to question whether half those requests should have hit the model at all.
The hidden assumption is that "LLM" is a single thing. It is not. Modern AI infrastructure has three tiers with very different economics:
- Frontier: Claude, GPT-4-class, Gemini Pro. Best quality. Paid per token. Latency in the seconds. Use when humans are waiting and the answer matters.
- Self-hosted reasoning: gpt-oss-120B, Llama-class, Mixtral. High quality, free at the margin if you own the hardware. Latency depends on concurrency, can be very low at scale.
- Small or non-LLM: 7B-class, classifiers, regex, dictionary lookups. Near-zero cost, microsecond latency. Brittle outside their lane but unmatched within it.
The waste is sending tier-1 work to tier-3 (overkill) and tier-3 work to tier-1 (overpriced). A routing rule fixes both.
The rule
For every workload, ask one question: how does this fail, and who notices?
- HEAVY when failure is visible to a human in a session, the user is waiting, and the answer needs nuance, source-grounding, or freshly-reasoned synthesis. Pre-commit code review where the developer is at the terminal. Customer-facing chat. A senior engineer asking the brain a complex architecture question.
- MEDIUM when failure is a background task that can be retried, the work is happening async, and quality matters but the user is not literally watching. Daily session summarisation. Overnight extraction pipelines. Wiki article compilation. Research drafts.
- LIGHT when the task is deterministic enough that a small model or a regex can do the job correctly, and burning a frontier-model call on it would be embarrassing if anyone audited the bill. Tag normalisation. Email classification. Routing decisions. Pricing-tier lookups.
Three rules of thumb hold:
- If you would not pay £0.05 for the answer, it does not belong on HEAVY.
- If a 7B-class model's accuracy is "fine" for the task, LIGHT is correct.
- If the task is async and quality-sensitive, MEDIUM is almost always the right home.
What this looks like in practice
We run all three tiers across Dendro Logic's stack. The split lands like this:
| Tier | Model | Where it runs | Workloads |
|---|---|---|---|
| HEAVY | Gemini Flash via Vercel AI Gateway | API hop | Pre-commit code review, interactive memory_ask queries, customer-facing chat in CoreThread and Leap |
| MEDIUM | gpt-oss-120B MXFP4 | Polaris (a single Nvidia Spark sitting on the desk, fully on-prem) | Daily session summarisation, extraction pipelines, wiki compilation, research drafts |
| LIGHT | Embedding models, classifiers, regex | Inline in the brain process | Tag normalisation, scope routing, hallucination-source lookup, decay weighting |
The MEDIUM tier is the one most teams skip. They have a frontier API and a small model and assume the gap between them is a quality gap. It is mostly a deployment gap. A 120B reasoning model on hardware you own is most teams' actual best fit for the bulk of their async work, and the running cost is the electricity bill.
Our extraction pipeline running 123,849 records through MEDIUM at concurrency 256 cost zero pounds in API charges. The same pipeline on a frontier API would have run into thousands.
The two failure modes of skipping the routing rule
Failure mode 1: HEAVY everywhere. The bill is the loudest signal. The quieter signal is latency drift: every microservice waits an extra 800ms for an API hop on a task that needed neither the network nor the frontier-model deliberation. The system feels slower without anyone being able to point to a specific cause.
Failure mode 2: LIGHT everywhere. The bill stays small. Quality fails silently. Classification accuracy at 89% feels acceptable until the 11% of misrouted tickets show up as customer complaints two months later. LIGHT is correct only when the task tolerates the accuracy LIGHT actually delivers.
The right move is not "use the cheaper model". The right move is to look at every workload and ask which tier it actually needs. The audit is mechanical and pays back fast.
Take the Next Step
If your team is running every AI request through a single frontier endpoint and the bill has been climbing, the gap is almost always in the workload-to-tier match. We help teams audit their inference stack, map workloads to the right tier, and stand up the MEDIUM-tier hardware that turns most of the spend into a one-off capital cost. Get in touch if you want to take that look.