How do I choose between open source and closed source LLMs?

As of 2026, open source LLMs such as GLM-5.1, Kimi K2.5, DeepSeek V3.2, Qwen 3.5, and Llama 4 match closed models (GPT-5, Claude 4.6, Gemini 3) on most production workloads at 10–17x lower inference cost. The decision is no longer model-quality-driven; it is switching-speed-driven. Teams whose stack can switch between open and closed models in a day capture the cost savings; teams whose stack takes a quarter to switch live with Model Inertia. The practical answer is to run both, gated by an LLM router that picks per-prompt.

LLM routing is an inference-routing layer that directs each prompt to the model best suited to handle it, across both open-source and closed-source providers. It is the fastest path to capturing the 10–17x cost gap between open and closed LLMs without rewriting application logic, because the unified API and routing decision happen below the application layer. LLM routers are offered as commercial SaaS, self-hosted, and open-source implementations depending on deployment and governance needs.

Engineering

Open Source LLMs Just Caught Up: Why Your LLM Router Needs to Switch in a Day

Open-source LLMs just reached the frontier on capability and undercut closed models 10x to 17x on cost. The companies that benefit are not the ones that picked the right model. They are the ones whose stack can switch in a day.

April 14, 2026 · 11 min read

TL;DR

Equivalent intelligence now exists across proprietary and open-source models: GLM-5.1, Kimi K2.5, DeepSeek V3.2, and Qwen 3.5 have reached frontier capability at 10 to 17x lower inference cost than closed models. The companies that capture the savings are not the ones that picked the right model, they are the ones whose stack, powered by evaluation-driven intelligent LLM routing, can switch in a day so no single provider holds pricing power over them.

Earlier today, Flo Crivello, the founder of Lindy, posted a single observation that captured a turn the AI industry has been quietly making for months.

We've tested new OSS models the moment they're released for a while at Lindy. Inference is our #1 cost by a lot (more than payroll) — cutting it by 2-5x would be transformative.
Last year, OSS models were "not even close."
3 mos ago, "almost there." Came close to making Kimi…
— Flo Crivello (@Altimor) April 14, 2026

Two things are true in that post at the same time. Inference is now Lindy's largest line item, more than payroll. And the cheapest path to cutting it 2-5x runs through models that did not exist on most production stacks ninety days ago. The intersection of those two facts is what changes for every AI-native team this quarter.

It is also the clearest sign yet of a structural shift: equivalent intelligence now exists across proprietary and open-source models, so enterprise AI no longer has to be hostage to the pricing power of a few model providers. The hard part is no longer picking a model. It is extracting reliable quality from that ensemble in production, with the discipline to know which model actually earns each request. That is a measurement problem before it is a routing problem.

Open caught up. Here are the receipts.

The benchmark gap is gone for the first time. Z.ai's GLM-5.1, released April 7, sits at the top of SWE-Bench Pro at 58.4, ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3).¹ Moonshot's Kimi K2.5 scores 50.2% on Humanity's Last Exam with tools, ahead of GPT-5.2 (45.5%) and Claude 4.5 Opus (43.2%).² DeepSeek V3.2 matches GPT-4o on MMLU at 94.2% and offers chat-mode pricing of $0.14 per million input tokens, $0.28 per million output.³ MiniMax M2.5 holds the highest open-weight score on SWE-Bench Verified at 80.2%.⁴

The cost gap is wider than the capability gap is narrow. Claude Opus 4.6 charges $5 per million input tokens and $25 per million output. Kimi K2.5 on DeepInfra is $0.45 input, $2.25 output. That is roughly 11x cheaper on both sides for the same coding workload, with comparable benchmarks.⁵ GLM-5.1 lists at $0.95/$3.15 direct, or about $1.55 blended on Fireworks and Together. The pattern across the leaderboard is consistent: open-weight inference at near-frontier quality runs 10-17x cheaper than the closed equivalents.⁶

This is not a forecast. It is the price sheet, today.

Operators are voting with their inference bills.

Flo's tweet is one data point in a pattern that has been forming through 2025 and accelerating in early 2026.

Cursor. On March 19, Cursor shipped Composer 2, its default coding model for many users. Three days later, the company confirmed it was built on Kimi K2.5 as the base, with Cursor's own RL on top, served via Fireworks. Co-founder Aman Sanger: "We've evaluated a lot of base models on perplexity-based evals and Kimi K2.5 proved to be the strongest."⁷

Cloudflare. Internal developer tooling (OpenCode) and a security-review agent both moved to Kimi K2.5. Cloudflare's published numbers: 77% inference cost reduction versus the proprietary models they replaced, on workloads processing more than 7 billion tokens per day.⁸

Airbnb. Brian Chesky, October 2025: "We're relying a lot on Alibaba's Qwen model. It's very good. It's also fast and cheap. We use OpenAI's latest models, but we typically don't use them that much in production because there are faster and cheaper models."⁹

Sully.ai reported a 90% inference cost reduction on medical note generation by switching from closed-source to open-source on NVIDIA Blackwell with NVFP4 quantization.¹⁰ Stripe reported 73% via vLLM-backed open-source deployment.¹¹ Andreessen Horowitz partner Martin Casado, on the startups his firm sees pitching: "I'd say 80% chance they are using a Chinese open-source model."¹²

None of these teams switched because open-source is fashionable. They switched because the math broke through a threshold and stayed there.

The closed labs are responding with prices, not silence.

The clearest signal that something has changed is what the frontier labs did to their own price sheets in the last six weeks. Anthropic cut Claude Opus 4.6 by 67%, from $15/$75 (Opus 4.1) to $5/$25, and removed the long-context surcharge entirely.¹³ OpenAI launched GPT-5.4 mini at $0.75/$4.50 and nano at $0.20/$1.25 in March, with the explicit positioning of matching the OSS-tier price point.¹⁴ Anthropic also moved to a hybrid enterprise pricing model this morning, restructuring how it captures usage commitment up-front.¹⁵

Frontier labs cutting prices 67% in a quarter is the loudest acknowledgment they could give that the open-source curve is now bounding their own.

Switching speed is the moat now.

Here is where the second half of Flo's tweet matters more than the first. He did not say Lindy switched to GLM-5.1. He said GLM-5.1 will likely be the default soon. The gap between "this model is better and cheaper" and "this model is in production" is where most teams lose the savings.

We have written before about Model Inertia and the $333,000-per-year cost of capturing only 25% of LLMflation. The open-weights moment makes both problems sharper.

A model switch is not a model-name change in your config. It is an eval rerun against your real prompt distribution, a prompt re-tune for the new model's quirks, a routing decision per request type, a provider format translation, and an SLO recheck before traffic shifts. Most teams take a quarter to do this end-to-end. Frontier closed-source labs ship every three to four weeks. Open-source labs are now shipping a frontier-class model roughly that often as well, at a tenth of the cost.

The bet that wins is no longer "Will OSS keep catching up?" It is "Can my stack adopt the next frontier-grade model in a day?"

That is a different infrastructure problem than the one most production AI teams have built for. It requires that you first define quality as measurable criteria, then apply that criteria to train a selector that routes per-prompt rather than by blanket default, score a sample of production traffic to stay honest, and sit on a provider-agnostic interop layer so a switch does not mean a rewrite.

The Divyam view: build the loop, not a default.

The teams capturing the open-weights moment have the same thing in common. They treat model selection as a decision their infrastructure makes per request, not a quarterly procurement choice they revisit when the bill gets uncomfortable.

Divyam.AI's Model Router was built for this. EvalMate turns a few representative examples into the quality criteria for your application (the rubric, the aligned judge, the reward model), and the Router deploys that quality intelligence: a selector routes each request to the lowest-cost model that clears the quality bar for it, while the reward model scores a sample of traffic off the critical path. When GLM-5.1 ships and beats Opus on the prompts that match your application, the Router promotes it. When DeepSeek's next chat-mode price drop changes the math on your simple summarization traffic, the Router re-routes. Every model earns its traffic, and no migration sprint, prompt re-tuning project, or vendor-lock rewrite stands in the way.

The open foundation matters here too. Last week we open-sourced divyam-llm-interop, the Apache-2.0 Python library that handles request and response translation across providers and API generations. It is the layer that makes "switch the model" mechanical instead of heroic. Open-weights frontier models do not help if the integration cost of swapping them in is a week of engineering time per change.

What to do this week

If your team has not yet absorbed the implication of Flo's tweet, three concrete actions are worth taking before this quarter closes.

One: define your quality bar as measurable criteria, then benchmark your top three open-weights models against your real prompt distribution. Not MMLU. Not SWE-Bench. Your prompts, your quality bar, your latency budget. EvalMate turns a few representative examples into that criteria and makes the benchmarking a day, not a quarter.

Two: identify which 30-60% of your prompts could safely route to a cheaper model with no quality loss. That share, not the headline model swap, is where the per-prompt savings actually live.

Three: measure your switching latency end-to-end. From "GLM-5.1 ships" to "GLM-5.1 serving traffic," what is the elapsed time? That number is your moat. The teams that win the next twelve months will not be the ones that picked the right model in April. They will be the ones whose infrastructure can pick the right model every week.

Inference cost more than payroll is the financial fact. Open weights at the frontier is the technical fact. Switching speed as the moat is what those two facts demand from your stack. The teams that build for it now will be running this year's quality at next year's prices, on this year's models, before their competitors finish a single migration.

Key Takeaways

The benchmark gap is gone. GLM-5.1 leads SWE-Bench Pro. Kimi K2.5 leads Humanity's Last Exam with tools. DeepSeek V3.2 matches GPT-4o on MMLU. MiniMax M2.5 leads SWE-Bench Verified open-weights.
The cost gap is 10-17x. Kimi K2.5 on DeepInfra is roughly 11x cheaper than Claude Opus 4.6 on both input and output, with comparable coding benchmarks. DeepSeek chat mode is $0.14/$0.28 per million tokens.
Operators are already voting. Cursor (Composer 2 = Kimi K2.5 base), Cloudflare (77% cost cut), Airbnb (Qwen-first), Sully.ai (90% cut), Stripe (73% cut), Lindy (next), and 80% of US AI startup pitches per a16z.
Closed labs are cutting prices, not denying the trend. Anthropic Opus 4.6 cut 67%. OpenAI launched GPT-5.4 mini and nano at OSS-tier price points.
Switching speed is the new moat. The bet is no longer which model is best. It is how fast your stack can adopt the next frontier-grade model. A quarter of switching latency on a curve that ships every 3-4 weeks is not a tradeoff. It is technical debt.

References

1

Z.ai, "GLM-5.1: The Next Level of Open Source" (April 7, 2026) SWE-Bench Pro 58.4, AIME 2026 95.3, GPQA-Diamond 86.2. 754B-parameter MoE, MIT license, 8+ hours autonomous execution.
venturebeat.com
2

Moonshot AI, Kimi K2.5 release (January 27, 2026) Humanity's Last Exam (tools): 50.2%. SWE-Bench Verified: 76.8%. 1T total / 32B active parameters, modified MIT license.
techcrunch.com
3

DeepSeek API pricing (V3.2, 2026) Chat mode: $0.14 input / $0.28 output per million tokens. Cache hits: $0.028 per million. MMLU 94.2%, MIT license.
api-docs.deepseek.com
4

BenchLM, "Best Open Source LLM 2026" MiniMax M2.5: SWE-Bench Verified 80.2%, the highest open-weight score on the leaderboard.
benchlm.ai
5

Artificial Analysis, "Kimi K2.5 Provider Pricing" DeepInfra $0.45/$2.25, Together AI $0.50/$2.80, Fireworks $0.60/$2.50 per million tokens.
artificialanalysis.ai
6

BentoML, "Navigating Open-Source LLMs in 2026" Open-source vs proprietary: ~17x cheaper at ~90% of capability. GPT-4-equivalent inference fell from $20/M tokens (late 2022) to ~$0.40/M (early 2026).
bentoml.com
7

TechCrunch, "Cursor admits its new coding model was built on top of Moonshot AI's Kimi" (March 22, 2026) Cursor's Composer 2 (default for many users) is built on Kimi K2.5 via Fireworks. Co-founder Aman Sanger confirmed the foundation publicly.
techcrunch.com
8

Cloudflare blog, "Powering the agents: Workers AI now runs large models, starting with Kimi K2.5" Internal OpenCode and security-review agent moved to Kimi K2.5. 77% cost reduction vs. proprietary models on internal workloads, 7B+ tokens/day.
blog.cloudflare.com
9

Brian Chesky, CEO Airbnb (Q3 2025 earnings remarks) On Airbnb's AI customer-service agent: "We're relying a lot on Alibaba's Qwen model. It's very good. It's also fast and cheap."
finance.yahoo.com
10

NVIDIA Blog, "Inference for open-source models on Blackwell" Sully.ai: 90% inference cost reduction on medical note generation, 65% faster response times, switching to OSS on NVFP4 + TensorRT-LLM + Dynamo.
blogs.nvidia.com
11

Programming Helper, "vLLM 2026 — Open-Source LLM Inference Engine" Stripe: 73% inference cost reduction after switching to vLLM-backed open-source deployment. vLLM in production at Amazon Rufus, LinkedIn, Roblox, Meta.
programming-helper.com
12

Martin Casado, General Partner Andreessen Horowitz (The Economist, late 2025) On startups pitching a16z: "I'd say 80% chance they are using a Chinese open-source model." Singled out Qwen, DeepSeek V3, Kimi K2.
officechai.com
13

Anthropic, Claude Opus 4.6 pricing (2026) Cut 67% from Opus 4.1: $15/$75 to $5/$25 per million input/output tokens. 1M-token context window included at standard pricing, no surcharge.
platform.claude.com
14

OpenAI, GPT-5.4 mini and nano launch (March 18, 2026) GPT-5.4 mini: $0.75/$4.50 per million tokens. GPT-5.4 nano: $0.20/$1.25. Mini is 70% cheaper than full GPT-5.4.
nxcode.io
15

NPI Financial, "Anthropic shifts Claude Enterprise to hybrid pricing model" (April 14, 2026) Lower seat fees, usage commitment up-front, loss of API discounts. Direct response to OpenAI's $100/mo Pro tier and the broader OSS price pressure.
npifinancial.com

This post is the first in our three-part series on the Open-Weights Moment. Next: the cost-economics deep-dive on why open weights are structurally cheaper.

Open Source LLMs Just Caught Up: Why Your LLM Router Needs to Switch in a Day

Open caught up. Here are the receipts.

Operators are voting with their inference bills.

The closed labs are responding with prices, not silence.

Switching speed is the moat now.

The Divyam view: build the loop, not a default.

What to do this week

References

Suggested Reading

What Open Weights Would Actually Do to Your Monthly LLM Bill.

LLM Cost Optimization and AI Model Switching: The Model Inertia Problem

The Divyam.AI Platform at a Glance