The “Fast” Model Wasn’t Fast: A Real Pipeline, 30x Faster on a Bigger Model
Key Takeaways A 9B “fast” model running at concurrency 4 hit 130 records/hour on the production pipeline. A 120B MoE reasoning model at concurrency 256 hit ~3,900 records/hour on the same hardware. “Fast” model selection logic falls apart on inference hardware that scales with concurrency. The big model is faster wall-clock when you let it […]