Divyam.AI: Analysis of competitive landscape
Why closing the loop is important
Production AI infrastructure today is largely open-loop. Platforms route requests or observe outcomes, but almost none learn from those observations and recalibrate routing autonomously. This paper examines the ACT → OBSERVE → LEARN → ADAPT framework across five leading platforms: Microsoft Azure Model Router, NVIDIA LLM Router, Martian, Not Diamond, and Portkey, and identifies precisely where each stops. Every competitive claim is drawn from primary sources: each company's own published documentation, official blog posts, or peer-reviewed research.
Benchmarks: MMLU-Pro (Divyam.AI, Dec 2025). Sources cited throughout and listed in full at the end of this paper.
The Production AI Problem
In 2026, the question facing most engineering teams is not whether to use large language models. It is how to keep using them efficiently as the landscape shifts beneath them. New model releases, price changes, and open-weight alternatives arrive fast enough that yesterday's model choice can quickly become tomorrow's cost penalty. Sam Altman summarized the pricing dynamic bluntly: "The cost to use a given level of AI falls about 10x every 12 months."[18] Teams that adopted a model six months ago may already be paying a meaningful premium for equivalent or worse quality.
IDC/Lenovo research reported by CIO found that "88% of observed POCs don't make the cut to widescale deployment."[1] McKinsey's State of AI 2025 similarly found that "nearly two-thirds" of organizations had not yet begun scaling AI across the enterprise.[2] The bottleneck is not model capability. It is the infrastructure required to define quality, measure it continuously, and act on what was measured.
The financial stakes are concrete. On a $60,000/month LLM baseline growing at 5% monthly, a team that manually switches models captures roughly 25% of available savings. Automated per-prompt routing captures roughly 60%. The gap between those two approaches is $333,000 per year on this single baseline. It is infrastructure drag, not model lag.
There is only one way to capture the full value of a rapidly evolving model landscape: a system that evaluates models continuously, routes each request to the right model, learns from production-specific quality signals, and recalibrates as the frontier moves. Some platforms in this analysis route. Some provide gateways or observability. Only one closes the full loop.
A Framework for Evaluating Production AI Infrastructure
Any production AI stack performs at least one of four functions. We call this the ACT → OBSERVE → LEARN → ADAPT framework. Each stage is necessary. Each is harder to build than the one before it. Each stage that is missing requires a human engineer to substitute.
ACT is model selection: deciding which model handles each incoming request, or at minimum making multiple models available through a common access layer. The meaningful differences at this stage are in routing intelligence: whether decisions are made by fixed policies, broad categories, fallback rules, or per-prompt quality estimation trained on production data.
OBSERVE is measurement: tracking whether routing decisions produced the right outcomes. This means not just latency and cost, but output quality. Several platforms provide observability. Fewer provide structured quality evaluation. Surfacing metrics is not the same as evaluating whether a specific routing decision was correct.
LEARN is inference from production data: identifying which past routing decisions were suboptimal, for which prompt types, and updating the model of what each LLM does well. This requires connecting the observation layer back to the routing layer, an integration that is architecturally non-trivial and absent from most stacks.
ADAPT is autonomous recalibration: updating routing weights, adopting new models without manual evaluation cycles, and surfacing gaps in the evaluation framework before they cause user-visible regressions. ADAPT without LEARN is impossible. LEARN without OBSERVE is blind. The loop only becomes self-improving when all four stages are connected and running continuously.
A useful analogy: a thermostat that displays the current temperature is ACT + OBSERVE. One that learns your schedule and pre-heats the room before you wake is LEARN + ADAPT. The difference is not a feature addition. It is whether the system improves without you.
The Landscape: Where Each Competitor Stops
The following assessment draws exclusively from each company's own published documentation, official blog posts, and peer-reviewed research. Where benchmark data exists, it is cited.
Microsoft Azure Model Router: Stops at ACT (pre-trained router)
Microsoft's Model Router is packaged as a single Microsoft Foundry model deployment.[3] Microsoft describes it as a trained language model that analyzes each prompt in real time and selects an eligible underlying model. Administrators can choose among three routing modes, Balanced, Quality, or Cost, and can optionally restrict routing to a selected model subset. Each router version is associated with a fixed supported model set, and Microsoft notes that new base models are excluded from a custom subset until explicitly added. The public docs do not describe a customer-specific feedback loop that learns from whether routing decisions produced good outcomes in production. On the MMLU-Pro benchmark, Microsoft Model Router achieves approximately 35% cost savings at comparable accuracy, a meaningful starting point for teams committed to the Microsoft Foundry ecosystem.[4]
The ceiling is the constraint. Three deployment-level modes cannot capture every workload-specific cost-quality tradeoff at the granularity of a customer's own production prompt classes. In the MMLU-Pro comparison, Divyam.AI achieved about 60% cost savings versus Microsoft's 35% when limited to the same model set. After adding Gemini models to the Divyam.AI routing pool, Divyam.AI reached 84% cost savings at comparable accuracy versus Microsoft's 35%.[4]
There is a deeper gap than cost savings: Microsoft Model Router offers no production quality guarantee. Its docs do not describe any mechanism that measures quality before and after routing changes, or that compares pre-routing and post-routing output quality on a customer's actual workload. The router selects models, but whether that selection maintained or degraded quality in production is a question the customer must answer independently. No production engineering team can responsibly adopt infrastructure that cannot say: your quality will not drop. Divyam.AI is explicit on this point. Production quality is the first constraint, not an afterthought. Routing decisions are bounded by quality regression guardrails, and in many cases Divyam.AI's routing actively improves quality beyond the single-model baseline.
NVIDIA LLM Router v2: Stops at ACT (experimental recommendation service)
NVIDIA's LLM Router v2 is an experimental blueprint for model selection.[5] Its public README describes two routing strategies: intent-based routing using a small language model, or auto-routing using CLIP embeddings plus a trained neural network. The router analyzes text or multimodal prompts and returns the recommended model name. Unlike v1, v2 does not proxy requests to downstream LLMs; the application is responsible for calling the recommended model. Custom neural-network routing can be retrained through the provided notebooks, but the public docs do not describe autonomous production-outcome learning or automatic routing recalibration.
The fundamental constraint is operational maturity. In the cited cost-saving configuration, Divyam.AI and NVIDIA were in a similar savings range, about 84% and 82% respectively. The difference was quality: Divyam.AI's accuracy drop was about 0.2 percentage points, while NVIDIA's was 18.1 percentage points.[4] The experimental label matters: this is a developer blueprint and recommendation service, not a closed-loop production routing system.
The structural problem is that NVIDIA v2 offloads the hardest part of custom routing to the customer. Building a custom neural-network router requires assembling a representative training set, training the model, validating it, and deploying it. None of that is automated. NVIDIA provides a framework and notebooks; the engineering burden stays with the team. In practice, this means most teams using NVIDIA's router are running pre-trained, general-purpose routing logic that was never calibrated to their specific workload. Divyam.AI does not hand customers a framework. It handles training data assembly, training, validation, and deployment as a managed capability — and critically, it ensures training sets are optimally selected to cover the full distribution of prompt variations, which is what separates a well-calibrated router from a coarse one.
Martian (WithMartian.com): Gateway and research tools
Martian describes itself as "an AI research lab focused on understanding machine intelligence" that makes selected internal tools externally available.[6] Its current public product surface centers on three offerings: Gateway, ARES, and K-Steering.
The Gateway provides "unified access to 200+ AI models through a single API" and tracks "real-time usage, model performance, and request history."[15] ARES is an "RL-first framework for training and evaluating LLM agents" that treats LLM requests as observations and responses as actions.[16] K-Steering is a toolkit for steering model behavior at inference time by "modifying internal activations at specific layers without fine-tuning the base model."[17]
In the ACT → OBSERVE → LEARN → ADAPT framework, Martian's public Gateway covers model access and usage visibility. ARES and K-Steering support agent training, evaluation, and inference-time behavior control. These are technically sophisticated capabilities, especially for agent research and controllable generation. The current public docs, however, do not describe a production routing system that evaluates live outcomes, learns which model should handle each customer prompt class, and autonomously recalibrates routing.
Not Diamond: Intelligent routing + prompt optimization, stops before autonomous ADAPT
Not Diamond offers intelligent model routing and prompt optimization.[7] Its docs say the router selects the best LLM from candidate models for each query, and that teams can train custom routers on their own evaluation data. The prompt optimization product uses an agentic loop guided by reinforcement learning, can work with as few as three data samples, and is designed to generate model-specific optimized prompts.[8] Not Diamond also claims 30 to 90% cost savings and 10 to 100ms routing latency.
The key distinction is the operating loop. Not Diamond's public docs describe trained routers, custom router training jobs, and prompt optimization. They do not describe a closed-loop production system that continuously evaluates live routing outcomes, detects model or prompt drift from production traces, and autonomously recalibrates routing without an explicit retraining or configuration step.
The harder problem is who owns quality. Not Diamond's retraining capability is real, but the responsibility for training data curation, training execution, and quality validation is explicitly left to the customer. This is a significant burden in practice. A training set that is too small produces a routing model that is overfit to narrow patterns — a model that makes confident decisions on edge cases it has never seen. A training set that is not representative misses entire prompt categories. Getting this right requires domain expertise, iteration, and ongoing maintenance as models and workloads evolve. Divyam.AI takes full ownership of this loop: training data is assembled automatically, coverage across prompt variation classes is verified before training runs, and the resulting router is deployed without a manual retraining cycle. The customer gets a router that improves; the engineering work stays off their plate.
Portkey: Gateway with config-based controls, stops at OBSERVE
Portkey is best understood as an AI gateway, not a quality-predictive per-prompt router. Its own documentation is explicit: the platform provides "automatic fallbacks and load balancing at the gateway layer" and ensures "provider outages get resolved before they reach your agents."[9] Portkey also documents conditional routing, fallbacks, load balancing, canary tests, and guardrails. These are valuable production controls. They are not the same as a router that predicts which model will produce the highest-quality answer for each prompt class and learns from production outcome quality.
Where Portkey excels is observability and governance: 40+ metrics, full agent traces including MCP calls, real-time anomaly detection, RBAC, budget controls, PII/PHI guardrails, and SSO.[10] It makes the operational management of many concurrent agents genuinely tractable. But observability without automated response is a dashboard. When Portkey surfaces a quality regression, a human engineer must still diagnose, decide, and manually update the routing configuration. The observation does not close back into action.
Capability Comparison
Microsoft is classified as a router because it selects models with a trained router and configurable modes, but its public docs do not describe customer-specific production-outcome learning. Martian's ARES and K-Steering are RL / steering tools, but the current public docs do not document production routing learning or autonomous routing recalibration. Portkey's routing is gateway configuration (fallbacks, load balancing, conditional rules), not quality-predictive per-prompt model selection. Sources: [3][4][5][6][7][9][15][16][17]
Why the Loop Compounds
An open-loop system performs at a fixed level. The cost and quality outcomes it produces in month one are roughly the same as in month twelve, absent manual intervention. A closed-loop system improves automatically. Each cycle produces better data, which produces better routing decisions, which produce better outcomes, which produce better data.
This is not a theoretical claim. It is observable in production deployments.
| First Cycle | Annual (Compounded) | |
|---|---|---|
| Cost savings | 50% | 75% |
| Quality gain | 5% | 20% |
The mechanism is not complicated. First, the system builds a domain-specific quality definition from expert judgment and production context. Then it benchmarks candidate models against the actual workload, not a generic benchmark. As new requests, annotations, and models enter the system, the evaluation layer and router keep the cost-quality frontier current. The result is not a one-time model migration; it is a repeatable optimization process that reduces the engineering work required to keep production AI current.
The customer-specific nature of the quality model is the data flywheel. A generic routing model, trained on public benchmarks, cannot know that a travel-planning agent, a healthcare support bot, and a product-equality engine each define quality differently. Divyam.AI calibrates optimization to the customer's own task, data, constraints, and quality targets. That is why the router can compare frontier, smaller proprietary, and open-weight models on the workload that actually matters, then choose the best operating point for each request.
Why This Is Hard to Replicate
Durability in AI infrastructure does not come from having a router or an evaluation tool in isolation. It comes from one operating loop that can execute autonomously and repeatedly: experimenting with newer models, keeping experiments safely isolated from production, enforcing stringent production safeguards, providing resilient fallbacks for unforeseen errors, and monitoring the overall system for drift without adding manual engineering work.
The depth is in the closed loop. Divyam.AI benchmarks candidate models against the customer's own quality definition, then uses production traces to keep the routing frontier current as traffic, prompts, models, and prices change. In that loop, EvalMate's frugal selection module cuts experimentation cost by 20× while maintaining full quality guarantees.[11] That matters because broad experimentation across many modules and candidate models becomes economically practical, not just theoretically possible.
The system is designed to catch change early. Every routed request generates a structured trace: model chosen, latency, cost, and response. Those traces flow back into evaluation, where quality regressions can surface within hours and request-pattern drift triggers coverage-gap analysis. The goal is to detect customer, prompt, or model drift before it shows up as user-reported degradation.
Replication requires more than matching a feature list. Building a production router is one discipline. Building a production evaluation engine, one that achieves 92% agreement with human judgment at 100× lower cost than LLM-as-judge evaluation, is another.[11] The hard part is integrating both so that quality signals govern routing decisions automatically, while drift detection keeps the evaluation rubric aligned with what users are actually sending.
The calibrated loop becomes difficult to replace. Divyam.AI does not just route requests or score outputs. It learns the cost-quality curve for each customer's agents, constraints, and production traffic, then uses that intelligence to keep improving automatically while preserving quality. Replacing that loop with a generic router or standalone eval platform is not a like-for-like swap. Costs can rise, quality can fall, and the agent-specific reward model has to be rebuilt from scratch. Switching providers means forfeiting the intelligence calibrated to your data and restarting the flywheel.
Evidence from Production
| Organization | Use Case | Cost Reduction | Quality Outcome |
|---|---|---|---|
| MakeMyTrip[12] | Myra AI travel assistant | 63% | Zero quality loss |
| PharmEasy[13] | Easybot customer support agent | 30% | 95% chat closure rate improvement |
| Flash.co[14] | Product equality engine | 30% | 15% quality uplift |
On MMLU-Pro, the Microsoft comparison and NVIDIA comparison measure different tradeoffs. Against Microsoft, Divyam.AI achieved about 60% savings versus Microsoft's 35% when limited to the same model set, and 84% savings after adding Gemini models, both at comparable accuracy. Against NVIDIA, the cited comparison is about accuracy preservation at a similar cost-savings range: Divyam.AI had about a 0.2 percentage-point accuracy drop, while NVIDIA had an 18.1 point drop.[4]
The MakeMyTrip result is instructive for a specific reason. Myra is a complex multi-agent travel assistant, and optimizing it was not a simple "switch from Model A to Model B" exercise. The Query Planner alone involved six modules and roughly ten candidate models per module, including frontier, smaller, and open-source/open-weight options. That created about one million possible combinations to evaluate manually. Divyam.AI replaced that brute-force cycle with algorithmic search, intelligent benchmark sampling, and fine-grained prompt routing. The result was a 63% cost reduction with zero quality loss, deployed in MMT's cloud environment with a single-line integration and full auditability.[12]
The question for any AI infrastructure investment is not "does it work today?" It is "does it keep finding the best cost-quality operating point as models, prices, prompts, and user behavior change?" Open-loop systems require repeated engineering cycles to stay competitive. Divyam.AI turns that work into an infrastructure loop: define quality, benchmark the live workload, route at request level, monitor outcomes, and keep recalibrating as better options appear.
Key Takeaways
- Portkey is a gateway, not a quality-predictive per-prompt router. Its own documentation describes "automatic fallbacks and load balancing at the gateway layer."[9] It also supports conditional routing, fallbacks, load balancing, and guardrails, but the public docs do not describe routing that learns from production outcome quality.
- Martian should be reassessed as a research-tool and gateway company, not simply an intelligent-router company. Its current public docs emphasize Gateway access, ARES for agent RL, and K-Steering for inference-time activation steering.[15][16][17] Those tools are advanced, but they do not document autonomous routing recalibration from live production outcomes.
- Microsoft and NVIDIA offer useful starting points, but the benchmark tradeoffs are specific. Microsoft achieved 35% savings at comparable accuracy, while Divyam.AI achieved about 60% on the same model set and 84% after adding Gemini models. NVIDIA reached a similar cost-savings range, but with a much larger accuracy drop: 18.1 percentage points versus Divyam.AI's 0.2 points in that configuration.[4] Microsoft is a pre-trained Foundry router; NVIDIA v2 is an experimental recommendation blueprint.
- The compounding effect is real and measurable. The whitepaper frames typical first-cycle gains around ~50% cost reduction and ~5% quality improvement, compounding toward ~75% annual cost reduction and ~20% annual quality improvement as the loop keeps running.
- The moat is structural, not temporary. It requires two distinct engineering disciplines, tight integration between them, and customer-specific production data that begins accumulating on day one and makes switching increasingly costly over time.
Sources
- CIO, citing IDC/Lenovo research. 88% of AI pilots fail to reach production, but that's not all on IT. Direct quote: "88% of observed POCs don't make the cut to widescale deployment." cio.com
- McKinsey & Company. The State of AI 2025. Direct quote: "nearly two-thirds of respondents say their organizations have not yet begun scaling AI across the enterprise." mckinsey.com
- Microsoft. Model Router for Microsoft Foundry. Microsoft Learn, 2026. Direct quotes: "trained language model"; "all packaged as a single model deployment"; "Supported options: Quality, Cost, Balanced"; "This set is fixed." learn.microsoft.com
- Divyam.AI Research. LLM Router Comparison: Divyam.AI vs Microsoft Model Router vs NVIDIA LLM Router. MMLU-Pro benchmark, December 2025. divyam.ai/blog/divyam-router-vs-microsoft-nvidia
- NVIDIA. LLM Router v2: AI Blueprints. GitHub, 2025 (experimental). Direct quotes: "LLM Router v2 (Experimental)"; "Classification only (returns model name)"; "does not proxy requests." github.com/NVIDIA-AI-Blueprints/llm-router
- Martian. Martian Documentation. 2026. Direct quote: "an AI research lab focused on understanding machine intelligence." docs.withmartian.com
- Not Diamond. What is Not Diamond? Not Diamond Documentation, 2025. Direct quotes: "intelligent AI model router and prompt optimization"; "train custom routers using your data." docs.notdiamond.ai
- Not Diamond. Pricing & FAQ. notdiamond.ai, 2025. Direct quotes: "30-90% cost savings"; "10-100ms"; "as few as three data samples." notdiamond.ai/pricing
- Portkey. Agent Gateway. Portkey Blog, 2026. Direct quote: "Automatic fallbacks and load balancing at the gateway layer. Provider outages get resolved before they reach your agents." portkey.ai/blog/agent-gateway
- Portkey. Your First AI Agent Will Go Fine. Your Fiftieth Is Where Things Get Interesting. Portkey Blog, 2026. Direct quotes: "Full traces across every run, including every MCP call"; "50+ guardrails out of the box." portkey.ai/blog
- Divyam.AI. The Divyam.AI Platform at a Glance. Divyam.AI Blog, April 2026. EvalMate rewards model: ~8B parameters, 92% agreement with human judgment, 100× cheaper than LLM-as-judge evaluation; frugal selection cuts experimentation cost by 20× while maintaining quality guarantees; production traces surface regressions and drift. divyam.ai/blog/divyam-platform-at-a-glance
- Divyam.AI. MakeMyTrip Case Study: 63% LLM Cost Savings, Zero Quality Loss. divyam.ai/customers/makemytrip
- Divyam.AI. PharmEasy Case Study: 95% Chat Closure Rate Improvement, 30% Cost Savings. divyam.ai/customers/pharmeasy
- Divyam.AI. Flash.co Case Study: 15% Quality Uplift, 30% Cost Savings. divyam.ai/customers/flash
- Martian. Gateway. Martian Documentation, 2026. Direct quotes: "unified access to 200+ AI models through a single API"; "real-time usage, model performance, and request history." docs.withmartian.com/gateway
- Martian. ARES. Martian Documentation, 2026. Direct quote: "RL-first framework for training and evaluating LLM agents." docs.withmartian.com/ares
- Martian. K-Steering Core Concepts. Martian Documentation, 2026. Direct quote: "modifying internal activations at specific layers without fine-tuning the base model." docs.withmartian.com/k-steering/core-concepts
- Sam Altman. Three Observations. Direct quote: "The cost to use a given level of AI falls about 10x every 12 months." blog.samaltman.com
For a deeper look at the platform architecture that makes this possible: The Divyam.AI Platform at a Glance →
For the full MMLU-Pro benchmark results: LLM Router Comparison: Divyam.AI vs. Microsoft vs. NVIDIA →
For the cost math behind the compounding returns: The Model Inertia Problem →