What Open Weights Would Actually Do to Your Monthly LLM Bill.
Two scenarios, one $60,000/month baseline. What a real switch to open-weights inference saves once the reasoning tail, tokenizer overhead, and eval work are in the model — and what it costs to get there.
A team spending $60,000/month on Claude Sonnet 4.6 (which compounds to a $955,000 annual bill at 5% monthly usage growth) can realistically expect:
- 55–70% savings (~$525K–$670K annually) from a wholesale swap to Kimi K2.5 on a hosted provider, with a Sonnet fallback for prompts that fail quality checks.
- 65–78% savings (~$620K–$745K annually) from three-tier per-prompt routing (DeepSeek V3.2 for simple prompts, Kimi K2.5 for the middle, Sonnet 4.6 for reasoning). The upside comes from shrinking the reasoning tail, which dominates the bill.
- Self-hosting does not pay below roughly $20,000/month on a single model. Hosted OSS providers are cheaper until that threshold.
Where you land in these ranges is almost entirely determined by how much of your production traffic can safely route off the frontier tier — which is a function of your eval infrastructure, not the model's price sheet. Post 1 in this series argued the line has been crossed; this post models the bill.
Last week we argued that open-source LLMs have reached frontier capability at 6–11x lower list-price inference cost than closed models, and that switching speed is now the real moat. The natural follow-up is concrete: if your team made the switch, what would your monthly LLM bill actually look like? We modeled a $60,000/month baseline on Claude Sonnet 4.6 that compounds to a $955,000 annual bill at 5% monthly usage growth. A wholesale swap to Kimi K2.5 with a reasoning-tail fallback saves 55–70%, roughly $525K–$670K per year. Three-tier per-prompt routing (DeepSeek V3.2 + Kimi K2.5 + Sonnet 4.6) saves 65–78%, roughly $620K–$745K per year. Where you land in these ranges is determined almost entirely by how much of your prompt distribution can safely route off the frontier tier — which is an eval-infrastructure question, not a model-price question. Getting there also costs 4–8 engineer-weeks of migration work plus ongoing ops.
The team: a $60,000/month SaaS running on Claude Sonnet 4.6
To make the math concrete, we use the same mid-size SaaS profile we introduced in The Hidden Cost of LLMflation: $60,000 per month in inference spend, 5% month-over-month usage growth, production traffic served entirely by Claude Sonnet 4.6. At Sonnet 4.6's pricing of $3 per million input tokens and $15 per million output tokens,1 that bill implies roughly 7.5 billion input tokens and 2.5 billion output tokens per month (a typical 3:1 input-to-output ratio for RAG and agent workloads).
One important note on the baseline: $60,000/month is not $720,000/year. With 5% monthly usage growth, the bill compounds. By month twelve the bill reaches $102,620/month, and the twelve-month total is $955,000. Every savings percentage in this post is measured against that $955K annual baseline, not a flat-line $720K projection.
The price sheet: what open weights list for today
Here is the pricing comparison that makes switching worth modeling. All prices are per million tokens, sampled as of April 2026:
| Model | Input ($/M) | Output ($/M) | Provider | License |
|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | Anthropic | Closed |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Anthropic | Closed |
| GPT-5.4 | $2.50 | $15.00 | OpenAI | Closed |
| GPT-5.4 mini | $0.75 | $4.50 | OpenAI | Closed |
| GLM-5.1 | $0.95 | $3.15 | Z.ai | MIT (open) |
| Kimi K2.5 | $0.45 | $2.25 | DeepInfra | MIT (open) |
| DeepSeek V3.2 (chat) | $0.14 | $0.28 | DeepSeek | MIT (open) |
The headline ratio: Kimi K2.5 is 6.7x cheaper than Sonnet 4.6 on both input and output. Against Opus 4.6 the gap is 11.1x. DeepSeek V3.2 in chat mode is cheaper still: 21x less than Sonnet 4.6 on input, 54x less on output.2 On capability, Kimi K2.5 scores 76.8% on SWE-Bench Verified versus Sonnet 4.5's 77.2%, and GLM-5.1 leads the open-weights pack at 58.4% on SWE-Bench Pro.3 The capability gap is small. The price gap is large.
So the switch is worth modeling. But the price sheet does not transfer to your bill in full. Here is where it actually lands.
Scenario A: Wholesale swap with a fallback
The simplest move: swap Sonnet 4.6 for Kimi K2.5 on DeepInfra across production. Add a quality fallback rule: any prompt where the open-source model fails a quality check gets retried on Sonnet. No routing infrastructure, no eval-driven segmentation, no prompt re-tuning.
The friction the price sheet does not show: Kimi K2.5's tokenizer and chat-template format typically produce 10–15% more tokens per request than Anthropic's, and without eval work to identify which prompts are safely routable, the quality fallback rate is typically 20–30% of traffic for most production workloads. Here is the realistic midpoint (25% fallback) plus the range:
| Fallback rate to Sonnet | Monthly cost (Month 1) | Savings vs baseline | Annual captured |
|---|---|---|---|
| 15% (disciplined eval) | ~$17,100 | 72% | ~$683,000 |
| 25% (realistic midpoint) | ~$22,100 | 63% | ~$603,000 |
| 35% (conservative / agent-heavy) | ~$27,100 | 55% | ~$525,000 |
At the realistic midpoint, Scenario A saves roughly 63% of the annual bill, or $603,000 against $955,000. The 15% fallback rate at the top of the range is the best case for teams that have done meaningful eval work to identify the prompts that route cleanly. The 35% rate at the bottom reflects agent-heavy or reasoning-heavy workloads where more prompts need the frontier model for quality.
Notice where the cost concentrates. Even at 25% fallback, the Sonnet share accounts for ~68% of the total Scenario A bill despite being only a quarter of the traffic. Closed-model prices dominate whenever they stay in the path. The reasoning tail is where the remaining dollars live.
Scenario B: Per-prompt routing across three tiers
The more sophisticated move, which requires real eval work: segment production traffic by complexity and route each prompt to the cheapest model that clears the quality bar for that specific request. A typical three-tier split:
- Simple prompts (classification, extraction, formatting, known-template summarization) → DeepSeek V3.2 chat at $0.14/$0.28.
- Mid-complexity prompts (most RAG, most agent steps, tool-call arguments, most customer-facing generation) → Kimi K2.5 at $0.45/$2.25.
- Reasoning-heavy prompts (multi-step analysis, complex extraction, high-stakes judgment) → Sonnet 4.6 at $3/$15.
The critical variable is what percentage stays on Sonnet. Shrinking the reasoning tail from 25% to 15% is worth more than any model-price improvement you could negotiate. Here is the sensitivity on the reasoning-tail share, holding the simple tier at ~20% (DeepSeek) and the balance on Kimi:
| Reasoning tail (% on Sonnet) | Monthly cost (Month 1) | Savings vs baseline | Annual captured |
|---|---|---|---|
| 10% (high eval maturity) | ~$12,700 | 79% | ~$754,000 |
| 20% (realistic midpoint) | ~$18,400 | 69% | ~$661,000 |
| 30% (conservative / agent-heavy) | ~$24,400 | 59% | ~$563,000 |
At the realistic midpoint of 20% reasoning tail, Scenario B saves 69%, or $661,000 annually. The 10% best case (worth an additional $93,000 per year) is achievable, but requires mature eval infrastructure to continuously validate that prompts you have pushed off Sonnet have not regressed in quality.4
Every percentage point you can move from Sonnet to Kimi is worth roughly $500/month at this scale. That is what continuous evaluation buys you.
The twelve-month picture: scenarios at the realistic midpoint
Both scenarios compound with the 5% monthly usage growth assumption. Projected over twelve months, against the $955,000 baseline:
| Scenario (midpoint) | Month 1 | Month 6 | Month 12 | Annual total | Range of savings |
|---|---|---|---|---|---|
| Baseline (100% Sonnet 4.6) | $60,000 | $76,577 | $102,620 | $955,000 | — |
| A: Wholesale swap + fallback | $22,100 | $28,200 | $37,800 | $352,000 | 55–72% |
| B: Per-prompt routing (3-tier) | $18,400 | $23,500 | $31,500 | $294,000 | 59–79% |
At realistic midpoints, the delta between Scenarios A and B at this scale is roughly $58,000 per year. That is the direct financial value of investing in per-prompt routing with eval-driven segmentation versus a blanket wholesale swap.
Real teams are already hitting these ranges
The ranges above are not theoretical. Several companies have publicly reported savings at or above Scenario B, each in their own framing:
- Cloudflare moved its internal security-review agent (processing more than 7 billion tokens per day) from a mid-tier proprietary model to Kimi K2.5 on its Workers AI platform. Their reported result: "we cut costs by 77% simply by making the switch to Workers AI."5 That translates to roughly the high end of Scenario B's range on their scale of workload.
- Sully.ai migrated medical-note generation from closed-source to self-hosted open-source on NVIDIA Blackwell with NVFP4 quantization. NVIDIA reports: "Sully.ai's inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65%."6 Above the hosted-OSS Scenario B range — which matches our model, which shows self-hosting pushes realized savings higher once single-model throughput justifies it.
- Decagon, an AI customer-service SaaS, reported a 6x reduction in cost per query running open-source models on Together AI.7 That is roughly an 83% drop — at the top of Scenario B's range, achieved on a hosted provider with no self-hosting.
- Airbnb's Brian Chesky said in late 2025 that the company "relies heavily" on Alibaba's Qwen open-weights model for its customer-service AI, calling Qwen "very good" and "fast and cheap," and noting that ChatGPT's integration abilities were "not quite ready" for Airbnb's needs.8
The savings range holds across workload types: coding agents (Cloudflare), medical notes (Sully.ai), customer-service automation (Decagon, Airbnb). The question is not whether the savings are real. The question is what share your own production prompt distribution makes available to you — which places you on one of the rows of the sensitivity tables above.
Where the savings leak
Four structural leaks separate the quoted 6.7x price-sheet gap from the realized 2–4x gap on the bill:
Tokenizer and chat-template overhead (10–15%). Open-source models tokenize less efficiently than Anthropic's on mixed prose and code, and their chat templates add more structural tokens. That overhead lands in the input bill with no offsetting benefit.
The reasoning tail. In both scenarios, the frontier-tier share dominates the bill. Every percentage point you can move from Sonnet to Kimi is worth roughly $500/month. Moving too aggressively costs quality. This is the tradeoff that continuous evaluation against your real prompt distribution is designed to resolve.
Cache economics are asymmetric. Anthropic's 90% prompt-cache discount is powerful for repetitive system prompts and stable RAG retrieval. DeepSeek's cache hits at $0.028/M are a 5x discount for workloads that hit them. Most other OSS providers have less aggressive caching. A migration that loses Anthropic's cache hits and does not gain equivalent OSS-side caching can leak 15–25% of the expected savings.9
Provider variance. Hosted OSS providers have different latency, availability, and batching behavior. Cold starts, occasional 429s, and retry loops add real cost that price sheets do not show. This typically amounts to 3–8% overhead depending on the provider and workload shape.
What the migration actually costs
The savings above are against the baseline bill. None of them account for the engineering cost to capture them. A realistic open-weights migration to Scenario B looks like:
- Eval harness build (1–2 weeks). Replay 500–2,000 production prompts against candidate open-source models. Score against your own quality rubric, not a public benchmark.
- Shadow-traffic validation (1–2 weeks). Send a held-out production slice through candidate models in parallel with Sonnet. Compare quality, latency, and failure modes before any real traffic shift.
- Routing and fallback logic (1–2 weeks). Build the per-prompt classifier or heuristic that decides which tier gets which request, plus a fallback path for prompts that regress.
- Ongoing quality monitoring (0.5–1 engineer-month per quarter). New models ship every few weeks. Each one is a re-eval against your traffic, a possible tier-promotion decision, and an SLO recheck.
Total: roughly 4–8 engineer-weeks for the first model plus ongoing overhead of a fractional engineer. At typical loaded engineering cost ($350K–$500K per year per senior engineer), the first-year migration cost is likely in the $50K–$100K range, with $50K–$100K per year in ongoing ops. Against $600K–$700K in annual savings, the payback is still under two months — but it is not free, and teams that assume it is end up in Scenario A with a higher-than-expected fallback rate and a fraction of the potential upside.
When self-hosting actually starts to pay
At our $60,000/month baseline, self-hosting does not make sense. The fixed cost of reserved GPU capacity for Kimi K2.5 on 4× H200 at $2.80/GPU-hour runs to $8,064/month before operations overhead.10 Against the Kimi tier cost on DeepInfra in Scenario B (~$6,200/month at the midpoint), self-hosting loses money at this throughput.
The rough threshold: self-hosting on reserved H200 or B200 capacity typically starts to pay when a single open-weights model serves sustained workloads above $20,000/month on a hosted provider. Above that line, NVFP4 quantization on Blackwell plus vLLM or SGLang inference engines can deliver 2–4x cost improvement over hosted rates.11 This is precisely the stack Sully.ai used to hit the 90% inference-cost drop cited above.
Meanwhile, the hosted OSS rate is falling on its own. DeepInfra's effective Kimi K2.5 rate dropped from $0.20/M to $0.05/M over the last six months through Blackwell + NVFP4 quantization rollouts: a 4x reduction with no action required from customers.12 For teams below the $20K/month threshold, the fastest path is to stay on the hosted rate and let the provider's capex absorb the next optimization cycle.
The Divyam view: the math is routing math, and routing math is eval math
The difference between Scenario A and Scenario B is not about which open-source model you pick. It is about whether your infrastructure decides per-prompt which tier gets each request, and whether you have the eval work to justify sending each prompt to the tier you send it to. That decision requires two things that most teams do not have in production today: a live evaluation of each candidate model against your actual production prompt distribution, and a routing layer that can act on it without rewriting your application.
Divyam.AI's Model Router does the routing. EvalMate does the continuous per-prompt evaluation. Together they shrink your reasoning tail — which, as the math above shows, is the single biggest lever on your realized savings. Our open-source divyam-llm-interop library handles the provider-format translation so that swapping Kimi K2.5 for GLM-5.1 next month is a configuration change, not an integration project.
The measurable outcome is not a single point estimate. It is moving your team from the conservative end of the Scenario B range to the high end — from ~59% captured savings to ~79% — without the engineering quarter that eval infrastructure and routing logic normally require.
What to do this week
If you are considering an open-weights switch, three actions worth taking before running any migration:
One: compute your actual token profile and prompt distribution. Pull the last thirty days of input and output tokens from your provider's billing export. Separately, sample 500–1,000 production prompts and classify them roughly by complexity: simple, mid, reasoning-heavy. The share of that last category is the single most important number for your savings estimate — more important than any vendor benchmark.
Two: shadow-traffic Kimi K2.5 or GLM-5.1 on 5% of your production workload for two weeks. Measure the quality delta against your own rubric. You will find roughly how much of your traffic routes cleanly and how much needs the Sonnet fallback. That number places you in one of the rows of the sensitivity tables above.
Three: compute your own scenario projection using your own reasoning-tail share. Substitute your numbers into the math here. The annual saving is almost certainly large enough to justify the engineering work, but knowing whether it is $300K or $700K before you start is worth the week of measurement.
The 6–11x gap on the price sheet is real. The realized gap on your bill will be smaller, and it will still be the largest line-item optimization available to your engineering organization this year.
- Realistic open-weights savings are 55–78%, not the 11x on the price sheet. Scenario A (wholesale swap) lands 55–72% depending on fallback rate. Scenario B (per-prompt routing) lands 59–79% depending on reasoning-tail share.
- The reasoning tail dominates the bill. Every percentage point you can safely move from Sonnet to Kimi is worth roughly $500/month at this scale. Eval infrastructure, not model price, is what shrinks it.
- Scenario B is ~$58K/year better than Scenario A at midpoint. That is the direct financial value of eval-driven per-prompt routing over blanket wholesale swap.
- Migration costs 4–8 engineer-weeks for the first model plus ~0.5 engineer-month per quarter ongoing. Payback under two months, but it is not free.
- Self-hosting pays above ~$20K/month on a single-model tier. Below that, hosted OSS providers (DeepInfra, Together, Fireworks, Groq) beat most self-host break-evens because their fixed GPU cost is a rounding error at scale.
References
-
1Anthropic, Claude Sonnet 4.6 pricing (2026) Claude Sonnet 4.6: $3 per million input tokens, $15 per million output tokens. 1M token context included at standard pricing.
platform.claude.com -
2DeepSeek API pricing, V3.2 (2026) Chat mode: $0.14 per million input tokens, $0.28 per million output. Cache hits: $0.028 per million. Official API, MIT-licensed model weights.
api-docs.deepseek.com -
3Artificial Analysis, Kimi K2.5 and GLM-5.1 benchmarks Kimi K2.5 on SWE-Bench Verified: 76.8%. GLM-5.1 on SWE-Bench Pro: 58.4% (#1 globally, ahead of GPT-5.4 at 57.7% and Claude Opus 4.6 at 57.3%).
artificialanalysis.ai -
4Divyam.AI, "The Hidden Cost of LLMflation" (2026) Per-prompt optimization captures 60% of LLMflation versus 25% for wholesale model migration. The $60K/month baseline with 5% growth reaches $955K annual.
divyam.ai -
5Cloudflare, "Powering the agents: Workers AI now runs large models, starting with Kimi K2.5" (2026) Cloudflare moved its internal security-review agent (7B+ tokens/day) from a mid-tier proprietary model (projected $2.4M/year) to Kimi K2.5 on Workers AI. Quote: "we cut costs by 77% simply by making the switch to Workers AI."
blog.cloudflare.com -
6NVIDIA Blog on Sully.ai, open-source inference on Blackwell (2026) NVIDIA, on Sully.ai's migration to self-hosted OSS on Blackwell with NVFP4 via Baseten: "Sully.ai's inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65%."
blogs.nvidia.com -
7NVIDIA Blog on Decagon, open-source via Together AI (2026) NVIDIA reports Decagon's "cost per query… dropped by 6x" running open-source models on Together AI. The same post cites Latitude (4x improvement in cost per token on DeepInfra) and Sentient (25–50% better cost efficiency on Fireworks AI).
blogs.nvidia.com -
8Brian Chesky, CEO Airbnb (as reported by Yahoo Finance / SCMP, late 2025) On Airbnb's AI customer-service agent, Chesky said the company "relies heavily" on Alibaba's Qwen model, calling it "very good" and "fast and cheap," and noted that ChatGPT's integration abilities were "not quite ready" for Airbnb's needs.
finance.yahoo.com -
9Anthropic, prompt caching documentation (2026) Cached input tokens priced at 10% of standard rate (90% discount). DeepSeek's analogous cache-hit pricing: $0.028/M, a 5x discount.
docs.claude.com -
10Lambda Labs / CoreWeave H200 pricing (2026) H200 reserved at ~$2.80/GPU-hour on CoreWeave; on-demand at $4.50–$6.00 on AWS/GCP/Azure. 4×H200 for a large OSS model ≈ $8,064/month continuous.
lambda.ai -
11NVIDIA Blog, "Inference for open-source models on Blackwell" (2026) NVFP4 quantization on Blackwell cut DeepInfra's effective inference cost from $0.20/M (Hopper) to $0.05/M (NVFP4), a 4x reduction on the same model weights.
blogs.nvidia.com -
12DeepInfra / NVIDIA, Kimi K2.5 inference cost reduction (Q1 2026) Effective per-million-token rate fell from $0.20 (Hopper) → $0.10 (Blackwell) → $0.05 (NVFP4 quantization on Blackwell), a 4x drop on the same model over two quarters.
blogs.nvidia.com
This is the second post in our Open-Weights Moment series. Read the first: Open Source LLMs Just Caught Up: Why Your LLM Router Needs to Switch in a Day. Next in the series: how continuous evaluation turns this cost model into a per-prompt routing policy you can ship.