What is the difference between LLM routing and a closed-loop AI system?

LLM routing selects which model handles each request (ACT). A closed-loop system also observes whether that routing decision produced the right outcome (OBSERVE), learns from production data which decisions were suboptimal (LEARN), and recalibrates routing weights or adopts new models without manual re-evaluation cycles (ADAPT). Most platforms cover ACT. Very few cover LEARN and ADAPT.

Can competitors like Martian, Not Diamond, or Portkey build a closed-loop system?

Building a closed loop requires two distinct engineering disciplines, production routing and production evaluation, tightly integrated so quality signals govern routing decisions in real time. Most competitors have invested in one domain. Beyond the architecture, each customer's quality model must be trained on their specific production data, creating a data flywheel that begins compounding on day one.

Is Portkey an intelligent LLM router?

No. Portkey is an AI gateway. Its own documentation describes 'automatic fallbacks and load balancing at the gateway layer', rule-based fault tolerance, not intelligent per-prompt model selection. Portkey does not evaluate which model would produce the best quality response for a given prompt. Its routing is policy-driven: if Model A fails or is rate-limited, route to Model B.

Strategy

Divyam.AI: Analysis of competitive landscape

Q: How does Divyam.AI compare to Microsoft Model Router?

On the MMLU-Pro comparison, Divyam.AI achieved 84% cost savings versus Microsoft Model Router's 35% at comparable accuracy after adding Gemini models to the routing pool. In the NVIDIA comparison, at a similar cost-savings range, Divyam.AI had about a 0.2 percentage-point accuracy drop versus NVIDIA's 18.1 point drop. Beyond benchmarks, Microsoft Model Router uses fixed routing modes with no dynamic recalibration, is limited to its supported model set, and has no quality evaluation layer.

Why closing the loop is important

April 29, 2026 · 12 min read

Abstract

Production AI infrastructure today is largely open-loop. Platforms route requests or observe outcomes, but almost none learn from those observations and recalibrate routing autonomously. This paper examines the ACT → OBSERVE → LEARN → ADAPT framework across five leading platforms: Microsoft Azure Model Router, NVIDIA LLM Router, Martian, Not Diamond, and Portkey, and identifies precisely where each stops. Every competitive claim is drawn from primary sources: each company's own published documentation, official blog posts, or peer-reviewed research.

Benchmarks: MMLU-Pro (Divyam.AI, Dec 2025). Sources cited throughout and listed in full at the end of this paper.

The Production AI Problem

In 2026, the question facing most engineering teams is not whether to use large language models. It is how to keep using them efficiently as the landscape shifts beneath them. New frontier models ship every three to four weeks. Provider pricing drops roughly 10× per year. Open-source models have closed the capability gap with proprietary ones at 10 to 17× lower cost. Teams that adopted a model six months ago are now paying a meaningful premium for equivalent or worse quality.

The IDC estimates 88% of AI pilots never reach production.^[1] McKinsey's State of AI 2025 puts 65% of organizations in permanent pilot mode, unable to scale.^[2] The bottleneck is not model capability. It is the infrastructure required to define quality, measure it continuously, and act on what was measured.

The financial stakes are concrete. On a $60,000/month LLM baseline growing at 5% monthly, a team that manually switches models captures roughly 25% of available savings. Automated per-prompt routing captures roughly 60%. The gap between those two approaches is $333,000 per year on this single baseline. It is infrastructure drag, not model lag.

The Core Claim

There is only one way to capture the full value of a rapidly evolving model landscape: a system that evaluates models continuously, routes each request to the right model, learns from production-specific quality signals, and recalibrates as the frontier moves. Some platforms in this analysis route. Some provide gateways or observability. Only one closes the full loop.

A Framework for Evaluating Production AI Infrastructure

Any production AI stack performs at least one of four functions. We call this the ACT → OBSERVE → LEARN → ADAPT framework. Each stage is necessary. Each is harder to build than the one before it. Each stage that is missing requires a human engineer to substitute.

ACT is model selection: deciding which model handles each incoming request, or at minimum making multiple models available through a common access layer. The meaningful differences at this stage are in routing intelligence: whether decisions are made by fixed policies, broad categories, fallback rules, or per-prompt quality estimation trained on production data.

OBSERVE is measurement: tracking whether routing decisions produced the right outcomes. This means not just latency and cost, but output quality. Several platforms provide observability. Fewer provide structured quality evaluation. Surfacing metrics is not the same as evaluating whether a specific routing decision was correct.

LEARN is inference from production data: identifying which past routing decisions were suboptimal, for which prompt types, and updating the model of what each LLM does well. This requires connecting the observation layer back to the routing layer, an integration that is architecturally non-trivial and absent from most stacks.

ADAPT is autonomous recalibration: updating routing weights, adopting new models without manual evaluation cycles, and surfacing gaps in the evaluation framework before they cause user-visible regressions. ADAPT without LEARN is impossible. LEARN without OBSERVE is blind. The loop only becomes self-improving when all four stages are connected and running continuously.

A useful analogy: a thermostat that displays the current temperature is ACT + OBSERVE. One that learns your schedule and pre-heats the room before you wake is LEARN + ADAPT. The difference is not a feature addition. It is whether the system improves without you.

The four stages of a self-improving AI stack. LEARN and ADAPT, identifying what to change and acting on it automatically, is where every competitor in this analysis stops.

The Landscape: Where Each Competitor Stops

The following assessment draws exclusively from each company's own published documentation, official blog posts, and peer-reviewed research. Where benchmark data exists, it is cited.

Microsoft Azure Model Router: Stops at ACT (rule-based)

Microsoft's Model Router deploys as a single Azure AI Foundry endpoint.^[3] Administrators choose from three fixed routing modes: Balanced, Quality, or Cost, and select from a supported model set (currently 18, primarily Azure OpenAI). The system routes each incoming request according to the selected mode. It does not learn from whether those routing decisions produced good outcomes. Adding a new model requires explicit operator enrollment and redeployment. On the MMLU-Pro benchmark, Microsoft Model Router achieves approximately 35% cost savings at comparable accuracy, a meaningful starting point for teams committed to the Azure OpenAI ecosystem.^[4]

The ceiling is the constraint. Three fixed modes cannot capture the cost-quality tradeoff at per-prompt granularity. A request requiring three sentences of factual retrieval and a request requiring multi-step legal reasoning route through the same mode logic. In the MMLU-Pro comparison, Divyam.AI achieved about 60% cost savings versus Microsoft's 35% when limited to the same model set. After adding Gemini models to the Divyam.AI routing pool, Divyam.AI reached 84% cost savings at comparable accuracy versus Microsoft's 35%.^[4]

NVIDIA LLM Router v2: Stops at ACT (category-based, experimental)

NVIDIA's router is an experimental blueprint that classifies incoming requests into one of six predefined complexity categories, such as domain_knowledge, chit_chat, and hard_question, and routes based on category.^[5] An optional neural network trained on CLIP embeddings can incorporate historical usage patterns, but retraining is manual and performed via provided notebooks. The system does not proxy requests; it returns a model name recommendation only, leaving the actual routing integration to the developer.

The fundamental constraint is category-level granularity. On the MMLU-Pro benchmark, 89% of prompts were classified into the single category domain_knowledge, collapsing most of the router's decision space. In the cited cost-saving configuration, Divyam.AI and NVIDIA were in a similar savings range, about 84% and 82% respectively. The difference was quality: Divyam.AI's accuracy drop was about 0.2 percentage points, while NVIDIA's was 18.1 percentage points.^[4] The experimental label is accurate: this is a reference architecture, not a production system.

Martian (WithMartian.com): Gateway and research tools

Martian describes itself as "an AI research lab focused on understanding machine intelligence" that makes selected internal tools externally available.^[6] Its current public product surface centers on three offerings: Gateway, ARES, and K-Steering.

The Gateway provides "unified access to 200+ AI models through a single API" and tracks "real-time usage, model performance, and request history."^[15] ARES is an "RL-first framework for training and evaluating LLM agents" that treats LLM requests as observations and responses as actions.^[16] K-Steering is a toolkit for steering model behavior at inference time by "modifying internal activations at specific layers without fine-tuning the base model."^[17]

In the ACT → OBSERVE → LEARN → ADAPT framework, Martian's public Gateway covers model access and usage visibility. ARES and K-Steering support agent training, evaluation, and inference-time behavior control. These are technically sophisticated capabilities, especially for agent research and controllable generation. The current public docs, however, do not describe a production routing system that evaluates live outcomes, learns which model should handle each customer prompt class, and autonomously recalibrates routing.

Not Diamond: Intelligent routing + prompt-level learning, stops before ADAPT

Not Diamond offers per-request intelligent routing recommendations with an RL-guided prompt optimization loop.^[7] The agentic prompt optimization is a genuine capability: it adapts prompts to different models with as few as three labeled examples,^[8] working efficiently with limited data. The system claims 30 to 90% cost savings and 10 to 100ms routing latency.

The key distinction: Not Diamond's learning applies to prompt formulation, not to routing quality. The system does not evaluate whether its routing decisions produced good outcomes in production, does not detect when a model it routes to has degraded, and does not autonomously update routing strategy based on model performance changes. The routing decision itself is stateless. Each request is independent of whether prior routing decisions were correct.

Portkey: Gateway with rule-based fault tolerance, stops at OBSERVE

Portkey is an AI gateway, not an intelligent router. Its own documentation is explicit: the platform provides "automatic fallbacks and load balancing at the gateway layer" and ensures "provider outages get resolved before they reach your agents."^[9] This is rule-based fault tolerance. If Model A fails or is rate-limited, route to Model B. It is not intelligent per-prompt model selection based on quality. Portkey does not evaluate which model would produce the best response for a given prompt type.

Where Portkey excels is observability and governance: 40+ metrics, full agent traces including MCP calls, real-time anomaly detection, RBAC, budget controls, PII/PHI guardrails, and SSO.^[10] It makes the operational management of many concurrent agents genuinely tractable. But observability without automated response is a dashboard. When Portkey surfaces a quality regression, a human engineer must still diagnose, decide, and manually update the routing configuration. The observation does not close back into action.

Capability Comparison

● Automated capability · ◑ Partial or manual · ○ Not available
Microsoft is classified as a router because it selects models, even though its policy is fixed and does not learn from customer production outcomes. Martian's ARES and K-Steering are RL / steering tools, but the current public docs do not document production routing learning or autonomous routing recalibration. Portkey's routing is rule-based fault tolerance (fallbacks, load balancing), not intelligent per-prompt model selection. Sources: [3][4][5][6][7][9][15][16][17]

Why the Loop Compounds

An open-loop system performs at a fixed level. The cost and quality outcomes it produces in month one are roughly the same as in month twelve, absent manual intervention. A closed-loop system improves automatically. Each cycle produces better data, which produces better routing decisions, which produce better outcomes, which produce better data.

This is not a theoretical claim. It is observable in production deployments.

	First Cycle	Annual (Compounded)
Cost savings	50%	75%
Quality gain	5%	20%

The mechanism is not complicated. First, the system builds a domain-specific quality definition from expert judgment and production context. Then it benchmarks candidate models against the actual workload, not a generic benchmark. As new requests, annotations, and models enter the system, the evaluation layer and router keep the cost-quality frontier current. The result is not a one-time model migration; it is a repeatable optimization process that reduces the engineering work required to keep production AI current.

The customer-specific nature of the quality model is the data flywheel. A generic routing model, trained on public benchmarks, cannot know that a travel-planning agent, a healthcare support bot, and a product-equality engine each define quality differently. Divyam.AI calibrates optimization to the customer's own task, data, constraints, and quality targets. That is why the router can compare frontier, smaller proprietary, and open-weight models on the workload that actually matters, then choose the best operating point for each request.

Can Competitors Build This?

This question deserves a direct answer, because it is the right question for anyone evaluating AI infrastructure for durability.

Competitors may eventually try to build this. But customers do not compete on eventual roadmaps. They compete on what their AI systems can deliver now: lower cost, stable quality, faster model adoption, and less engineering drag.

Two hard problems, one integration. Building a production routing engine is a substantial engineering effort. Building a production evaluation engine, one that achieves 92% agreement with human judgment at 100× lower cost than LLM-as-judge evaluation, is an equally substantial effort in a different discipline.^[11] It requires expertise in reward modeling, Item Response Theory-based skill estimation, human feedback pipeline design, and drift detection. Most companies have invested in one domain. Divyam.AI has invested in both, with the integration ensuring that quality signals govern routing decisions in real time. That integration is where the architectural complexity lives. It is not a feature that can be added in a sprint.

The data flywheel becomes a switching moat. Divyam.AI does not just route requests or score outputs. It learns the cost-quality curve for each customer's agents, constraints, and production traffic, then uses that intelligence to keep improving automatically while preserving quality. Replacing that loop with a generic router or standalone eval platform is not a like-for-like swap. Costs can rise immediately, quality can drop, and the agent-specific reward model has to be rebuilt from scratch. Switching providers means forfeiting the intelligence calibrated to your data and restarting the flywheel.

Time is the structural moat. The algorithms underlying this system are documented in academic literature. IRT has existed since 1968. Reward modeling is well understood. The moat is operational: a conservative estimate for a well-resourced team to reach comparable production-grade closed-loop quality is 18 months of engineering. By that point, customers using Divyam.AI will have completed multiple compounding cycles and will have quality models calibrated to their specific workloads. Every month of delay is another cycle of compounding that widens the gap.

Evidence from Production

Organization	Use Case	Cost Reduction	Quality Outcome
MakeMyTrip^[12]	Myra AI travel assistant	63%	Zero quality loss
PharmEasy^[13]	Easybot customer support agent	30%	95% chat closure rate improvement
Flash.co^[14]	Product equality engine	30%	15% quality uplift

On MMLU-Pro, the Microsoft comparison and NVIDIA comparison measure different tradeoffs. Against Microsoft, Divyam.AI achieved about 60% savings versus Microsoft's 35% when limited to the same model set, and 84% savings after adding Gemini models, both at comparable accuracy. Against NVIDIA, the cited comparison is about accuracy preservation at a similar cost-savings range: Divyam.AI had about a 0.2 percentage-point accuracy drop, while NVIDIA had an 18.1 point drop.^[4]

The MakeMyTrip result is instructive for a specific reason. Myra is a complex multi-agent travel assistant, and optimizing it was not a simple "switch from Model A to Model B" exercise. The Query Planner alone involved six modules and roughly ten candidate models per module, including frontier, smaller, and open-source/open-weight options. That created about one million possible combinations to evaluate manually. Divyam.AI replaced that brute-force cycle with algorithmic search, intelligent benchmark sampling, and fine-grained prompt routing. The result was a 63% cost reduction with zero quality loss, deployed in MMT's cloud environment with a single-line integration and full auditability.^[12]

The Structural Property

The question for any AI infrastructure investment is not "does it work today?" It is "does it keep finding the best cost-quality operating point as models, prices, prompts, and user behavior change?" Open-loop systems require repeated engineering cycles to stay competitive. Divyam.AI turns that work into an infrastructure loop: define quality, benchmark the live workload, route at request level, monitor outcomes, and keep recalibrating as better options appear.

Key Takeaways

Portkey is a gateway, not an intelligent router. Its own documentation describes "automatic fallbacks and load balancing at the gateway layer."^[9] That is rule-based fault tolerance. It does not perform quality-based per-prompt model selection.
Martian should be reassessed as a research-tool and gateway company, not simply an intelligent-router company. Its current public docs emphasize Gateway access, ARES for agent RL, and K-Steering for inference-time activation steering.^[15]^[16]^[17] Those tools are advanced, but they do not document autonomous routing recalibration from live production outcomes.
Microsoft and NVIDIA offer useful starting points, but the benchmark tradeoffs are specific. Microsoft achieved 35% savings at comparable accuracy, while Divyam.AI achieved about 60% on the same model set and 84% after adding Gemini models. NVIDIA reached a similar cost-savings range, but with a much larger accuracy drop: 18.1 percentage points versus Divyam.AI's 0.2 points in that configuration.^[4]
The compounding effect is real and measurable. The whitepaper frames typical first-cycle gains around ~50% cost reduction and ~5% quality improvement, compounding toward ~75% annual cost reduction and ~20% annual quality improvement as the loop keeps running.
The moat is structural, not temporary. It requires two distinct engineering disciplines, tight integration between them, and customer-specific production data that begins accumulating on day one and makes switching increasingly costly over time.

Sources

IDC research on AI pilot failure rates, cited in Divyam.AI production readiness analysis. Available at divyam.ai/whitepaper.
McKinsey & Company. The State of AI 2025. McKinsey Global Institute, 2025.
Microsoft. Model Router: Azure AI Foundry. Microsoft Learn, 2026. learn.microsoft.com
Divyam.AI Research. LLM Router Comparison: Divyam.AI vs Microsoft Model Router vs NVIDIA LLM Router. MMLU-Pro benchmark, December 2025. divyam.ai/blog/divyam-router-vs-microsoft-nvidia
NVIDIA. LLM Router v2: AI Blueprints. GitHub, 2025 (experimental). github.com/NVIDIA-AI-Blueprints/llm-router
Martian. Martian Documentation. 2026. Direct quote: "an AI research lab focused on understanding machine intelligence." docs.withmartian.com
Not Diamond. What is Not Diamond? Not Diamond Documentation, 2025. docs.notdiamond.ai
Not Diamond. Pricing & FAQ. notdiamond.ai, 2025. "Works with as few as 3 data samples." notdiamond.ai/pricing
Portkey. Agent Gateway. Portkey Blog, 2026. Direct quote: "Automatic fallbacks and load balancing at the gateway layer. Provider outages get resolved before they reach your agents." portkey.ai/blog/agent-gateway
Portkey. Your First AI Agent Will Go Fine. Your Fiftieth Is Where Things Get Interesting. Portkey Blog, 2026. portkey.ai/blog
Divyam.AI. The Divyam.AI Platform at a Glance. Divyam.AI Blog, April 2026. EvalMate rewards model: ~8B parameters, 92% agreement with human judgment, 100× cheaper than LLM-as-judge evaluation. divyam.ai/blog/divyam-platform-at-a-glance
Divyam.AI. MakeMyTrip Case Study: 63% LLM Cost Savings, Zero Quality Loss. divyam.ai/customers/makemytrip
Divyam.AI. PharmEasy Case Study: 95% Chat Closure Rate Improvement, 30% Cost Savings. divyam.ai/customers/pharmeasy
Divyam.AI. Flash.co Case Study: 15% Quality Uplift, 30% Cost Savings. divyam.ai/customers/flash
Martian. Gateway. Martian Documentation, 2026. Direct quotes: "unified access to 200+ AI models through a single API"; "real-time usage, model performance, and request history." docs.withmartian.com/gateway
Martian. ARES. Martian Documentation, 2026. Direct quote: "RL-first framework for training and evaluating LLM agents." docs.withmartian.com/ares
Martian. K-Steering Core Concepts. Martian Documentation, 2026. Direct quote: "modifying internal activations at specific layers without fine-tuning the base model." docs.withmartian.com/k-steering/core-concepts

For a deeper look at the platform architecture that makes this possible: The Divyam.AI Platform at a Glance →

For the full MMLU-Pro benchmark results: LLM Router Comparison: Divyam.AI vs. Microsoft vs. NVIDIA →

For the cost math behind the compounding returns: The Model Inertia Problem →