What is continuous LLM evaluation?

Continuous LLM evaluation is the practice of running automated quality checks against a production application's actual prompt distribution every time a new model ships, a prompt or rubric changes, or production traffic patterns drift. Unlike one-off model selection or public benchmarks (MMLU, SWE-Bench), continuous evals produce a per-prompt-class quality score that routing logic can act on. It is the prerequisite for switching models faster than once a quarter, which is the typical cadence for teams without eval infrastructure. The four workloads continuous evaluation must cover are pre-deployment screening of new models, per-request routing decisions, production drift detection on existing models, and regression testing when prompts or rubrics change.

How do you evaluate an LLM for production use?

A production-grade LLM evaluation has four components. First, a representative sample of real production prompts (typically 500-2,000 examples), classified by complexity tier — simple, mid, reasoning-heavy. Second, a domain-specific quality rubric tied to what the application is actually trying to deliver, not generic correctness. Third, a judge — either expert-labeled reference outputs, an LLM-as-judge with the rubric, or both. Fourth, a scheduler that re-runs the evaluation when models change, prompts change, or production drift is detected. Public benchmarks like MMLU and SWE-Bench are useful for triage but do not substitute for any of the four components.

Why aren't public benchmarks (MMLU, SWE-Bench) enough for production model decisions?

Public benchmarks measure aggregate performance on standardized tasks: academic reasoning for MMLU, code-fixing for SWE-Bench, agentic execution for Terminal-Bench. Production applications don't run those tasks; they run a specific distribution of customer-support summaries, RAG queries, agent tool-call arguments, structured extractions, and so on. A model scoring 76.8% on SWE-Bench Verified might handle 95% of your prompts perfectly and fail catastrophically on the 5% that look unlike anything in the benchmark. The only signal that matters for production deployment is performance on your own prompts against your own quality bar.

What is LLM-as-judge and when should you use it?

LLM-as-judge is a method for scoring LLM outputs at scale by having another LLM evaluate them against a rubric. It works best when (1) the domain-specific rubric is clear and well-specified, (2) the volume of evaluation is too high for human grading on every iteration, and (3) the judge's calibration has been validated against expert-labeled examples. The judge model is typically a frontier closed model (Claude Opus, GPT-5) regardless of which models are being evaluated, because judgment quality is correlated with overall model capability. LLM-as-judge breaks down for adversarial prompts, deeply technical domains, or judgment calls where the rubric cannot be made specific enough; those still need human review.

Why do most teams stall before building production LLM evaluations?

Building production-grade evaluation infrastructure from scratch typically takes a quarter of engineering work and crosses organizational boundaries. The technical work involves prompt sampling, classification by complexity, rubric development, judge selection and calibration, scoring infrastructure, drift monitoring, and dashboards. The organizational friction is harder: ownership of 'quality' is often unclear between engineering, product, and domain experts. Most teams stop at 'we collected some prompts and ran them once,' which produces a snapshot rather than a continuous signal. The result is that the team chooses a model, ships it, and then cannot tell when the next model would be a better choice — the exact gap that produces Model Inertia.

Engineering

Switching Models in a Day Is an Eval Problem, Not a Model Problem.

The bottleneck isn't picking a model. It's running a continuous evaluation against your real prompt distribution every time a new one ships. Here's what that takes — and why it's the lever that decides where your savings land.

April 20, 2026 · 15 min read

TL;DR

Continuous LLM evaluation — not model choice — is the lever that decides where you land in the 55–78% savings range the previous post in this series modeled. The teams capturing the open-weights moment are the ones that can answer four questions on demand:

Does this new model clear the quality bar for any of my prompt tiers? (Pre-deployment.)
Which tier should this specific request go to? (Per-request routing.)
Is my current model degrading on real production traffic? (Drift detection.)
Did this prompt change break anything? (Regression.)

Public benchmarks like MMLU and SWE-Bench answer none of these. The eval rig that does is the actual product moat for any team that wants to switch models in a day instead of a quarter.

Executive Summary

The first two posts in this series argued that open-weights LLMs have caught up at 6–11x lower price and modeled what that switch would actually do to a $60K/month bill. Both posts ended at the same place: the savings a team captures depend on how much of its prompt distribution can safely route off the frontier tier. That is an evaluation question, not a model question. This post explains what continuous LLM evaluation actually means in production: the four eval workloads that a multi-model stack requires, why public benchmarks like MMLU and SWE-Bench answer none of them, and why most teams stall before they build the eval infrastructure that would let them switch models in a day instead of a quarter. The teams that have captured the open-weights savings — Cursor, Lindy, Cloudflare, Sully.ai — all share one thing: they had the eval rig before they had the new model.

Cursor switched to Kimi K2.5 because their eval rig was ready

The most consequential open-weights adoption event of 2026 was Cursor shipping Composer 2 on March 19, built on Kimi K2.5 as the foundation with Cursor's own RL on top. Three days later, co-founder Aman Sanger explained the decision: "We've evaluated a lot of base models on perplexity-based evals and Kimi K2.5 proved to be the strongest."¹ Cursor did not switch because Kimi K2.5 was on the leaderboard. They switched because their internal evaluation, run against a corpus of code-completion prompts that look like Cursor's actual production traffic, found Kimi K2.5 was the best base for their downstream training.

The same story plays out at Lindy. Flo Crivello: "We've tested new OSS models the moment they're released for a while at Lindy."² That is not a remark about model curiosity. It is a remark about eval infrastructure. Lindy has the rig that lets them know, within hours of a release, whether GLM-5.1 should become default. Most teams cannot make that decision in a quarter.

The pattern is consistent across teams that have publicly captured the open-weights savings. They had the eval rig before they had the new model. The eval rig is the moat.

Most "evals" are not the eval that lets you switch

When most teams say they "do evals," they mean one of three things. None of them are sufficient for routing decisions in production:

Public benchmarks. MMLU, SWE-Bench, GPQA, Humanity's Last Exam. These measure aggregate performance on standardized tasks. They are useful for triage — ruling out models that are obviously below frontier — but a model that scores 76.8% on SWE-Bench Verified might handle 95% of your prompts perfectly and fail catastrophically on the 5% that look unlike anything in the benchmark. Your prompt distribution is not the benchmark distribution. A team that decides routing based on public benchmark scores is making a decision with the wrong data.

One-off vendor A/B tests. Two weeks of shadow traffic on Kimi K2.5 versus Sonnet 4.6, a side-by-side score, a decision. The problem: the Kimi K2.5 you tested in February is not the Kimi K2.5 DeepInfra is serving in May, because the same model name covers a moving target of quantizations, batching configurations, and provider-side optimizations. A point-in-time comparison degrades the moment either model changes — which, on the frontier curve, is every few weeks.

Generic LLM-as-judge with no rubric. "Is this answer correct?" scored by GPT-5 across a sample of outputs. This works for narrow factual tasks but produces noise on most production workloads. A radiologist's draft note, a customer-support escalation, a tool-call argument for an agent — "correct" is the wrong question. The right question is whether the output meets the rubric the application's users actually care about, which has to be written down, validated, and re-validated.³

Each of these is a useful component of a real eval rig. None of them, alone, is the eval that lets you switch models in production.

What continuous evaluation actually means

Three properties separate continuous evaluation from one-off testing:

Use-case-specific. The criteria are derived from your actual application, not from a public benchmark. For Sully.ai's medical-note generation, the criteria include diagnostic accuracy, billing-code correctness, and adherence to a specific SOAP-note format. For Cloudflare's security-review agent, the criteria include vulnerability classification accuracy and false-positive rate on a held-out corpus of pull requests. The rubric is the application.

Continuous. The evaluation runs every time a new model ships, every time a prompt or rubric changes, every time production traffic patterns drift. Not as a project that is re-launched when someone notices a problem. As a daily or hourly background process that produces a fresh score every time the inputs change.

Quantified per prompt class. The output is not a single number. It is a per-tier score that routing logic can act on: "Kimi K2.5 clears the bar on simple and mid tiers; falls below the bar on the reasoning tier." That is the data shape that lets a router make per-request decisions, not a "model X is better than model Y" overall verdict.

The output of an eval rig is not a leaderboard rank. It is a per-prompt-class quality score that routing logic can act on.

The four eval workloads of a multi-model stack

A production stack that switches models in a day, not a quarter, runs four distinct evaluation workloads continuously. Each answers a different question:

Workload	Question it answers	Cadence
Pre-deployment	Does this new model clear the bar for any of my tiers?	Every time a candidate model ships
Per-request routing	Which tier should this specific prompt go to?	Every production request
Production drift	Is my chosen model degrading on real traffic?	Continuous, with alerting
Regression	Did this prompt or rubric change break anything?	Every prompt or rubric change

None of these workloads is optional. Skip pre-deployment and you cannot adopt new models without a manual sprint. Skip per-request routing and your bill goes to Scenario A instead of Scenario B (the difference, on a $60K/month baseline, is roughly $58K/year). Skip drift detection and your captured savings degrade silently as model quality changes under you. Skip regression and a prompt update on Tuesday breaks production on Wednesday.

Why most teams stall

Building this from scratch is a quarter of engineering work, and the technical work is only half of it.

The technical work: sample collection (which prompts? how many? how stratified?), prompt classification by complexity, rubric development with domain experts, judge selection and calibration, scoring infrastructure, drift monitoring, dashboards, alert routing. Each of these is well-understood in isolation. Stitching them into a continuous loop that runs against every new model release is not.⁴

The organizational work is harder. Quality ownership is rarely clean: engineering owns the infrastructure, product owns the feature, domain experts own what "good" means, and on-call owns the consequences when the rubric was wrong. Eval rigs require all four to agree on a written rubric, which most teams have never done explicitly. Most teams stop at "we collected some prompts and ran them once," which produces a snapshot rather than a continuous signal. The team picks a model, ships it, and cannot tell when the next model would be a better choice. That gap is exactly the Model Inertia the open-weights moment is supposed to close.

What the closed loop actually looks like

The teams that have captured the open-weights savings are running a continuous loop that connects routing, evaluation, and production traffic. The shape:

Router directs each production request to the model selected for its prompt tier, with fallback rules for failures.
Production responses, prompts, and metadata stream into the eval store as a continuously updated sample of real traffic.
Eval workloads run against that sample on a schedule: pre-deployment when a candidate ships, drift detection daily, regression on every prompt change.
The eval results update the routing policy: a tier promotion when a new model clears the bar, a tier demotion when a model degrades, a fallback rule when a regression is detected.
Router picks up the new policy on the next request, and the loop repeats.

The team running this loop does not run migrations. The system does. New OSS models flow in transparently, the bill drops as routing matures, and a frontier-tier release that beats the current best gets promoted within hours rather than across a quarter-long migration project.

EvalMate runs the loop

The eval rig described above is what Divyam.AI's EvalMate is built to operationalize. The product flow:

Describe your use case in natural language. EvalMate generates a candidate quality rubric, validated against a sample of your real prompts.
Pre-deployment evaluation runs automatically against any candidate model registered in your stack.
Production traffic streams into the eval store via the divyam-llm-interop layer, which provides the unified request/response capture across providers.
Drift, regression, and routing-decision evals run on the same sample on a schedule, with alerts for material changes.
Results feed Divyam.AI's Model Router, which adjusts the per-prompt routing policy without an application code change.

The measurable outcome is the one this series has pointed at since post one: a team whose stack switches models in a day, captures most of the open-weights savings range, and does not run a migration sprint when the next frontier-grade model ships in three weeks.

What to do this week

If your team has not yet built an eval rig that can answer the four questions above, three actions worth taking before any model migration:

One: sample 500 production prompts and classify them by complexity. Simple (classification, formatting, template summarization), mid (most RAG, agent steps, customer-facing generation), reasoning-heavy (multi-step analysis, complex extraction, judgment calls). The share that lands in each tier is the most important number for your routing decision — more important than any vendor benchmark.

Two: write down what "good" means for each tier. Not in a doc that lives in someone's notes. In a rubric the team agrees on, with examples of pass and fail outputs for each criterion. Anthropic's published evaluation guidance is a useful starting point.⁵ So is OpenAI's open-source Evals framework.⁶ Both are battle-tested by teams that have shipped production AI at scale.

Three: run a baseline evaluation of your current model against that rubric. Whatever the score is, that is your floor. Every candidate model gets compared against it. The first time you have to make a routing decision, you have a defensible answer to the question "would this be better?"

The open-weights moment is a savings opportunity. The eval rig is what converts the opportunity into the bill drop. The teams that win the next twelve months will not be the ones that picked the right model in April. They will be the ones whose evaluation infrastructure can pick the right model every week.

Key Takeaways

The eval rig is the moat. Cursor switched to Kimi K2.5 because their perplexity-based eval was ready. Lindy tests new OSS models the moment they release because their eval rig lets them. The teams capturing the open-weights savings had the eval before they had the new model.
Public benchmarks are not the eval that lets you switch. MMLU and SWE-Bench measure something else. Your prompt distribution is not the benchmark distribution. Production routing decisions made on public-benchmark scores are decisions made with the wrong data.
Continuous evaluation has three properties — use-case specific, continuous over time, and quantified per prompt class. The output is not a leaderboard rank. It is a routing-actionable per-tier score.
Four workloads must run continuously: pre-deployment, per-request routing, drift detection, regression. Skip any of them and a different failure mode shows up: stalled adoption, missed savings, silent quality drift, or production breakage from prompt changes.
Most teams stall on the organizational work, not the technical work. Quality ownership is rarely clean. Eval rigs require engineering, product, and domain experts to agree on a written rubric — which most teams have never done explicitly. That is the bottleneck for capturing the savings the previous post in this series modeled.

References

1

TechCrunch, "Cursor admits its new coding model was built on top of Moonshot AI's Kimi" (March 22, 2026) Cursor co-founder Aman Sanger, on selecting Kimi K2.5 as the base for Composer 2: "We've evaluated a lot of base models on perplexity-based evals and Kimi K2.5 proved to be the strongest." Composer 2 ships with Cursor's own RL on top of Kimi K2.5, served via Fireworks.
techcrunch.com
2

Flo Crivello, founder of Lindy (X, April 14, 2026) "We've tested new OSS models the moment they're released for a while at Lindy. Inference is our #1 cost by a lot (more than payroll)… I think we are right now crossing the line to 'at the frontier, for most use cases.' GLM-5.1 in particular is incredible and will likely be our default soon."
x.com
3

Anthropic, "Reducing Latency and Cost with Prompt Caching and LLM-as-Judge" Anthropic's published guidance: LLM-as-judge calibration against expert-labeled examples is required for the judge to produce reliable production signals. Generic "is this correct?" rubrics produce noise on most domain-specific tasks.
docs.claude.com
4

Divyam.AI, "What Open Weights Would Actually Do to Your Monthly LLM Bill" (2026) Per-prompt routing across three tiers (DeepSeek V3.2 + Kimi K2.5 + Sonnet 4.6) saves 65–78% versus a $60K/month Sonnet 4.6 baseline. Where in the range a team lands is determined by the share of production traffic that can safely route off the frontier tier — an eval question.
divyam.ai
5

Anthropic, Claude Developer documentation: "Develop tests" (2026) Anthropic's published guidance on building production-grade LLM evaluations: define success criteria, develop holdout sets, design grading approaches (code-based, human, LLM-as-judge), and iterate. The framework that has shipped Claude through five major model versions.
docs.claude.com
6

OpenAI, "Evals" framework (open source) OpenAI's open-source library for evaluating LLM applications. MIT-licensed, used internally by OpenAI for model release evaluations and externally by hundreds of teams. Supports custom rubrics, LLM-as-judge, and templated evaluation patterns.
github.com/openai/evals
7

Divyam.AI, "Open Source LLMs Just Caught Up: Why Your LLM Router Needs to Switch in a Day" (Post 1 of this series) The first post in the Open-Weights Moment series. Establishes that open-source LLMs have reached frontier capability at 6–11x lower list-price inference cost than closed models, and that switching speed is the new moat.
divyam.ai

This is the third and final post in our Open-Weights Moment series. Read post 1: Open Source LLMs Just Caught Up: Why Your LLM Router Needs to Switch in a Day. Read post 2: What Open Weights Would Actually Do to Your Monthly LLM Bill.