Switching Models in a Day Is an Eval Problem, Not a Model Problem.
The bottleneck isn't picking a model. It's running a continuous evaluation against your real prompt distribution every time a new one ships. Here's what that takes — and why it's the lever that decides where your savings land.
Continuous LLM evaluation — not model choice — is the lever that decides where you land in the 55–78% savings range the previous post in this series modeled. The teams capturing the open-weights moment are the ones that can answer four questions on demand:
- Does this new model clear the quality bar for any of my prompt tiers? (Pre-deployment.)
- Which tier should this specific request go to? (Per-request routing.)
- Is my current model degrading on real production traffic? (Drift detection.)
- Did this prompt change break anything? (Regression.)
Public benchmarks like MMLU and SWE-Bench answer none of these. The eval rig that does is the actual product moat for any team that wants to switch models in a day instead of a quarter.
The first two posts in this series argued that open-weights LLMs have caught up at 6–11x lower price and modeled what that switch would actually do to a $60K/month bill. Both posts ended at the same place: the savings a team captures depend on how much of its prompt distribution can safely route off the frontier tier. That is an evaluation question, not a model question. This post explains what continuous LLM evaluation actually means in production: the four eval workloads that a multi-model stack requires, why public benchmarks like MMLU and SWE-Bench answer none of them, and why most teams stall before they build the eval infrastructure that would let them switch models in a day instead of a quarter. The teams that have captured the open-weights savings — Cursor, Lindy, Cloudflare, Sully.ai — all share one thing: they had the eval rig before they had the new model.
Cursor switched to Kimi K2.5 because their eval rig was ready
The most consequential open-weights adoption event of 2026 was Cursor shipping Composer 2 on March 19, built on Kimi K2.5 as the foundation with Cursor's own RL on top. Three days later, co-founder Aman Sanger explained the decision: "We've evaluated a lot of base models on perplexity-based evals and Kimi K2.5 proved to be the strongest."1 Cursor did not switch because Kimi K2.5 was on the leaderboard. They switched because their internal evaluation, run against a corpus of code-completion prompts that look like Cursor's actual production traffic, found Kimi K2.5 was the best base for their downstream training.
The same story plays out at Lindy. Flo Crivello: "We've tested new OSS models the moment they're released for a while at Lindy."2 That is not a remark about model curiosity. It is a remark about eval infrastructure. Lindy has the rig that lets them know, within hours of a release, whether GLM-5.1 should become default. Most teams cannot make that decision in a quarter.
The pattern is consistent across teams that have publicly captured the open-weights savings. They had the eval rig before they had the new model. The eval rig is the moat.
Most "evals" are not the eval that lets you switch
When most teams say they "do evals," they mean one of three things. None of them are sufficient for routing decisions in production:
Public benchmarks. MMLU, SWE-Bench, GPQA, Humanity's Last Exam. These measure aggregate performance on standardized tasks. They are useful for triage — ruling out models that are obviously below frontier — but a model that scores 76.8% on SWE-Bench Verified might handle 95% of your prompts perfectly and fail catastrophically on the 5% that look unlike anything in the benchmark. Your prompt distribution is not the benchmark distribution. A team that decides routing based on public benchmark scores is making a decision with the wrong data.
One-off vendor A/B tests. Two weeks of shadow traffic on Kimi K2.5 versus Sonnet 4.6, a side-by-side score, a decision. The problem: the Kimi K2.5 you tested in February is not the Kimi K2.5 DeepInfra is serving in May, because the same model name covers a moving target of quantizations, batching configurations, and provider-side optimizations. A point-in-time comparison degrades the moment either model changes — which, on the frontier curve, is every few weeks.
Generic LLM-as-judge with no rubric. "Is this answer correct?" scored by GPT-5 across a sample of outputs. This works for narrow factual tasks but produces noise on most production workloads. A radiologist's draft note, a customer-support escalation, a tool-call argument for an agent — "correct" is the wrong question. The right question is whether the output meets the rubric the application's users actually care about, which has to be written down, validated, and re-validated.3
Each of these is a useful component of a real eval rig. None of them, alone, is the eval that lets you switch models in production.
What continuous evaluation actually means
Three properties separate continuous evaluation from one-off testing:
Use-case-specific. The criteria are derived from your actual application, not from a public benchmark. For Sully.ai's medical-note generation, the criteria include diagnostic accuracy, billing-code correctness, and adherence to a specific SOAP-note format. For Cloudflare's security-review agent, the criteria include vulnerability classification accuracy and false-positive rate on a held-out corpus of pull requests. The rubric is the application.
Continuous. The evaluation runs every time a new model ships, every time a prompt or rubric changes, every time production traffic patterns drift. Not as a project that is re-launched when someone notices a problem. As a daily or hourly background process that produces a fresh score every time the inputs change.
Quantified per prompt class. The output is not a single number. It is a per-tier score that routing logic can act on: "Kimi K2.5 clears the bar on simple and mid tiers; falls below the bar on the reasoning tier." That is the data shape that lets a router make per-request decisions, not a "model X is better than model Y" overall verdict.
The output of an eval rig is not a leaderboard rank. It is a per-prompt-class quality score that routing logic can act on.
The four eval workloads of a multi-model stack
A production stack that switches models in a day, not a quarter, runs four distinct evaluation workloads continuously. Each answers a different question:
| Workload | Question it answers | Cadence |
|---|---|---|
| Pre-deployment | Does this new model clear the bar for any of my tiers? | Every time a candidate model ships |
| Per-request routing | Which tier should this specific prompt go to? | Every production request |
| Production drift | Is my chosen model degrading on real traffic? | Continuous, with alerting |
| Regression | Did this prompt or rubric change break anything? | Every prompt or rubric change |
None of these workloads is optional. Skip pre-deployment and you cannot adopt new models without a manual sprint. Skip per-request routing and your bill goes to Scenario A instead of Scenario B (the difference, on a $60K/month baseline, is roughly $58K/year). Skip drift detection and your captured savings degrade silently as model quality changes under you. Skip regression and a prompt update on Tuesday breaks production on Wednesday.
Why most teams stall
Building this from scratch is a quarter of engineering work, and the technical work is only half of it.
The technical work: sample collection (which prompts? how many? how stratified?), prompt classification by complexity, rubric development with domain experts, judge selection and calibration, scoring infrastructure, drift monitoring, dashboards, alert routing. Each of these is well-understood in isolation. Stitching them into a continuous loop that runs against every new model release is not.4
The organizational work is harder. Quality ownership is rarely clean: engineering owns the infrastructure, product owns the feature, domain experts own what "good" means, and on-call owns the consequences when the rubric was wrong. Eval rigs require all four to agree on a written rubric, which most teams have never done explicitly. Most teams stop at "we collected some prompts and ran them once," which produces a snapshot rather than a continuous signal. The team picks a model, ships it, and cannot tell when the next model would be a better choice. That gap is exactly the Model Inertia the open-weights moment is supposed to close.
What the closed loop actually looks like
The teams that have captured the open-weights savings are running a continuous loop that connects routing, evaluation, and production traffic. The shape:
- Router directs each production request to the model selected for its prompt tier, with fallback rules for failures.
- Production responses, prompts, and metadata stream into the eval store as a continuously updated sample of real traffic.
- Eval workloads run against that sample on a schedule: pre-deployment when a candidate ships, drift detection daily, regression on every prompt change.
- The eval results update the routing policy: a tier promotion when a new model clears the bar, a tier demotion when a model degrades, a fallback rule when a regression is detected.
- Router picks up the new policy on the next request, and the loop repeats.
The team running this loop does not run migrations. The system does. New OSS models flow in transparently, the bill drops as routing matures, and a frontier-tier release that beats the current best gets promoted within hours rather than across a quarter-long migration project.
EvalMate runs the loop
The eval rig described above is what Divyam.AI's EvalMate is built to operationalize. The product flow:
- Describe your use case in natural language. EvalMate generates a candidate quality rubric, validated against a sample of your real prompts.
- Pre-deployment evaluation runs automatically against any candidate model registered in your stack.
- Production traffic streams into the eval store via the divyam-llm-interop layer, which provides the unified request/response capture across providers.
- Drift, regression, and routing-decision evals run on the same sample on a schedule, with alerts for material changes.
- Results feed Divyam.AI's Model Router, which adjusts the per-prompt routing policy without an application code change.
The measurable outcome is the one this series has pointed at since post one: a team whose stack switches models in a day, captures most of the open-weights savings range, and does not run a migration sprint when the next frontier-grade model ships in three weeks.
What to do this week
If your team has not yet built an eval rig that can answer the four questions above, three actions worth taking before any model migration:
One: sample 500 production prompts and classify them by complexity. Simple (classification, formatting, template summarization), mid (most RAG, agent steps, customer-facing generation), reasoning-heavy (multi-step analysis, complex extraction, judgment calls). The share that lands in each tier is the most important number for your routing decision — more important than any vendor benchmark.
Two: write down what "good" means for each tier. Not in a doc that lives in someone's notes. In a rubric the team agrees on, with examples of pass and fail outputs for each criterion. Anthropic's published evaluation guidance is a useful starting point.5 So is OpenAI's open-source Evals framework.6 Both are battle-tested by teams that have shipped production AI at scale.
Three: run a baseline evaluation of your current model against that rubric. Whatever the score is, that is your floor. Every candidate model gets compared against it. The first time you have to make a routing decision, you have a defensible answer to the question "would this be better?"
The open-weights moment is a savings opportunity. The eval rig is what converts the opportunity into the bill drop. The teams that win the next twelve months will not be the ones that picked the right model in April. They will be the ones whose evaluation infrastructure can pick the right model every week.
- The eval rig is the moat. Cursor switched to Kimi K2.5 because their perplexity-based eval was ready. Lindy tests new OSS models the moment they release because their eval rig lets them. The teams capturing the open-weights savings had the eval before they had the new model.
- Public benchmarks are not the eval that lets you switch. MMLU and SWE-Bench measure something else. Your prompt distribution is not the benchmark distribution. Production routing decisions made on public-benchmark scores are decisions made with the wrong data.
- Continuous evaluation has three properties — use-case specific, continuous over time, and quantified per prompt class. The output is not a leaderboard rank. It is a routing-actionable per-tier score.
- Four workloads must run continuously: pre-deployment, per-request routing, drift detection, regression. Skip any of them and a different failure mode shows up: stalled adoption, missed savings, silent quality drift, or production breakage from prompt changes.
- Most teams stall on the organizational work, not the technical work. Quality ownership is rarely clean. Eval rigs require engineering, product, and domain experts to agree on a written rubric — which most teams have never done explicitly. That is the bottleneck for capturing the savings the previous post in this series modeled.
References
-
1TechCrunch, "Cursor admits its new coding model was built on top of Moonshot AI's Kimi" (March 22, 2026) Cursor co-founder Aman Sanger, on selecting Kimi K2.5 as the base for Composer 2: "We've evaluated a lot of base models on perplexity-based evals and Kimi K2.5 proved to be the strongest." Composer 2 ships with Cursor's own RL on top of Kimi K2.5, served via Fireworks.
techcrunch.com -
2Flo Crivello, founder of Lindy (X, April 14, 2026) "We've tested new OSS models the moment they're released for a while at Lindy. Inference is our #1 cost by a lot (more than payroll)… I think we are right now crossing the line to 'at the frontier, for most use cases.' GLM-5.1 in particular is incredible and will likely be our default soon."
x.com -
3Anthropic, "Reducing Latency and Cost with Prompt Caching and LLM-as-Judge" Anthropic's published guidance: LLM-as-judge calibration against expert-labeled examples is required for the judge to produce reliable production signals. Generic "is this correct?" rubrics produce noise on most domain-specific tasks.
docs.claude.com -
4Divyam.AI, "What Open Weights Would Actually Do to Your Monthly LLM Bill" (2026) Per-prompt routing across three tiers (DeepSeek V3.2 + Kimi K2.5 + Sonnet 4.6) saves 65–78% versus a $60K/month Sonnet 4.6 baseline. Where in the range a team lands is determined by the share of production traffic that can safely route off the frontier tier — an eval question.
divyam.ai -
5Anthropic, Claude Developer documentation: "Develop tests" (2026) Anthropic's published guidance on building production-grade LLM evaluations: define success criteria, develop holdout sets, design grading approaches (code-based, human, LLM-as-judge), and iterate. The framework that has shipped Claude through five major model versions.
docs.claude.com -
6OpenAI, "Evals" framework (open source) OpenAI's open-source library for evaluating LLM applications. MIT-licensed, used internally by OpenAI for model release evaluations and externally by hundreds of teams. Supports custom rubrics, LLM-as-judge, and templated evaluation patterns.
github.com/openai/evals -
7Divyam.AI, "Open Source LLMs Just Caught Up: Why Your LLM Router Needs to Switch in a Day" (Post 1 of this series) The first post in the Open-Weights Moment series. Establishes that open-source LLMs have reached frontier capability at 6–11x lower list-price inference cost than closed models, and that switching speed is the new moat.
divyam.ai
This is the third and final post in our Open-Weights Moment series. Read post 1: Open Source LLMs Just Caught Up: Why Your LLM Router Needs to Switch in a Day. Read post 2: What Open Weights Would Actually Do to Your Monthly LLM Bill.