Skip to main content
Engineering

What Does It Take to Build a World-Class Agentic Application?

Why serious LLM products need more than prompts, gateways, and one-off evaluations

SK
Sandeep Kohli CEO & Co-Founder, Divyam.AI
· 18 min read
Executive Summary

Building the first version of an LLM application is deceptively easy. Getting it to production, and keeping it there, is not. Most agentic applications stall not because of bad ideas, but because teams lack the infrastructure to improve quality continuously, manage costs intelligently, and absorb a fast-moving model landscape without losing control.

This paper argues that the central challenge in production AI is not generation. It is control. Teams that solve for control, through a rigorous quality framework and an intelligent optimization layer, are the ones that build products that keep getting better. Those that skip it are building on shifting ground.

Divyam.AI is built for this problem. We help teams define what quality means for their specific application, measure it continuously at the component level, and optimize every inference decision based on real production data. The result is an agentic system that compounds in value over time, rather than one that requires constant manual intervention to keep from drifting.

The uncomfortable truth about agentic software

Traditional software had one significant advantage: the infrastructure underneath it was relatively stable. You chose your database, message queue, cache, and monitoring stack. Those components evolved, but not at a pace that forced teams to revisit core cost, performance, or architecture decisions every few weeks. Once the foundational infrastructure was selected, application teams could mostly build on top of it.

Agentic software is fundamentally different. The application itself is evolving rapidly as workflows, prompts, and user journeys take shape. But that is only half the problem. The heaviest piece of infrastructure in the stack is the LLMs themselves, and they are also evolving, often faster than the application. New models arrive constantly. Existing models get cheaper. Capability jumps happen every month. Latency characteristics shift. The best model for a task today may not be the best model a month from now.2

In traditional software, the infrastructure is mostly stable while the application changes. In agentic software, both are moving at the same time. That is what makes writing production-grade agentic software fundamentally harder than anything most engineering teams have built before.

The aha moment is not the hard part

The first version of an LLM application is often easy to build. A small team can assemble prompts, connect a model, add a tool call or two, and produce something that feels genuinely capable. A customer support assistant answers questions. A copilot drafts content. An agent executes a workflow. That is the aha moment.

The problem begins when that first success is mistaken for a long-term operating model. Once a product gains traction, the questions change. It is no longer, 'Can we make an LLM do this?' It becomes, 'Can we improve it continuously without losing control? Can we keep quality high as usage scales? Can we detect regressions before customers do? Can we benefit from better models without destabilizing production? Can we lower cost without hurting the experience?'

Prototype
Can we make an LLM do this?
Results look good enough
One model, static prompts
Quality checked by hand
Cost is a future problem
Production
Can we keep improving without losing control?
Quality must be measurable and consistent
Models change — system must adapt
Regressions caught before customers do
Cost optimized per request, continuously
88%
of AI pilots never make it to production, according to IDC research1
~65%
of organizations stuck in pilot mode, unable to scale across the enterprise7

This is the moment many CTOs realize that launching the first version was not the hard part. Building a system that can move forward safely and repeatedly is the hard part. That is the difference between an interesting demo and a world-class product. Production is where the real standard shows up.

The real problem is not generation. It is control.

A surprising number of LLM applications in production still operate on a fragile assumption: the answers looked good enough when we tried them. That may be sufficient for a pilot. It is not sufficient for a product that is supposed to scale, become central to the user experience, and represent the brand.

In production, you need a solid grip on quality. In agentic systems, that is harder than it sounds because quality is not a single number measured at the final answer. It is multi-dimensional and distributed across the system: correctness, completeness, retrieval quality, reasoning quality, tool use, policy compliance, tone, latency-aware behavior, and task completion can all matter at once, depending on the use case.

High-level signals like NPS or surveys are useful, but they are too delayed, coarse, and aggregated to manage production quality on their own. They tell you something feels worse. They do not tell you which class of request is regressing, which step in the workflow is weakening, or which agent or tool is introducing the error.

In agentic applications, you need visibility into the quality of the parts, not just the quality of the whole.

Quality is distributed — not just at the final output
User
Input
Agent 1
Retrieval
Quality check
Tool
Call
Quality check
Agent 2
Reasoning
Quality check
Agent 3
Response
Quality check
Final
Output
Each stage has its own quality dimension: retrieval accuracy, tool precision, reasoning quality, policy compliance, tone. Measuring only the final output tells you something went wrong — not where or why.

Measuring only the final output feels clean and convenient, but it rarely gives builders enough to act on. When quality slips, what matters is knowing where it slipped, which step failed, which component regressed, and what needs to be fixed.

Before optimization, there must be a definition of quality

At Divyam.AI, we start from a simple principle: you cannot optimize what you have not defined. Before we help a customer optimize routing, model selection, or cost, we first help them define what quality means for their application.

That sounds obvious. It is one of the hardest parts of building serious LLM systems. Quality is not generic. It is specific to the business, the domain, the user, the brand, and the risk profile. A legal assistant, a travel planner, a financial advisor, and a healthcare workflow agent all use LLMs, but they do not share the same definition of a good answer. Desired tone, tolerance for ambiguity, accuracy requirements, escalation behavior, compliance expectations, and style constraints will all differ. This is not noise. This is the product. This is what makes an application distinct.

That is why our quality layer is designed to help customers quantify quality in a way that is useful for production, not just for presentations. It does this at the agent level, not just at the end-product level. That distinction matters. If you only know the final answer is weaker, you are still guessing. If you know which stage or which agent has drifted, you can fix the system.

Turning vague quality into something measurable

Our quality layer surfaces the exact places where human judgment is still needed and routes those cases to domain experts. This keeps the product honest. Instead of assuming the system already knows what good looks like everywhere, it identifies where clarity is still missing. It also allows the customer's unique standards to enter the system. The style, judgment, brand values, and domain nuances that matter to the business do not stay trapped in people's heads. They get translated into a usable quality framework.

Behind the scenes, the system versions not only golden datasets but also evaluation runs and workflows. This makes quality work reproducible and comparable across prompt changes, model swaps, pipeline revisions, and agent reconfiguration. Teams can rerun the same workflow in a controlled way, see what changed, and understand what improved, what regressed, and which decisions moved the system forward.

The next challenge is cost. Serious LLM evaluation is expensive. Judging a single response can require multiple model calls, rubric checks, or human review.5 That is where many teams lose discipline and fall back to spot checks and intuition. Divyam.AI addresses this by distilling human judgment into a reward model that approximates expert evaluation at a fraction of the cost. Quality checks can run continuously, at scale, without becoming financially prohibitive. In practice, teams get something they rarely have today: human-like quality assessment they can afford to run every day in production.

This is a core belief at Divyam.AI. If quality infrastructure is too expensive to run continuously, most teams will not run it continuously. When that happens, they are operating with only partial visibility into how their system is actually performing.

Quality cannot stay static if the product is evolving

No successful product stays still. Use cases expand, prompts change, workflows become richer, and customers begin using the product in ways the team did not originally anticipate. Their behavior drifts. Their expectations rise. If the quality framework does not evolve with the application, even a well-built evaluation setup will slowly become stale.

Divyam.AI is designed for that reality. The system continuously identifies gaps in coverage, highlights emerging behaviors that are not yet captured, surfaces new regions of the problem space where human input is required, and detects when customer behavior has drifted. It also performs ongoing regression checks so teams are not shipping based on anecdotes or isolated wins.

As the product moves from V1 to V2 to V3, the quality system keeps pace. This is one of the biggest differences between a serious agentic platform and a stitched-together LLM application: the ability to keep evolving without losing control.

Want to see how Divyam.AI defines and measures quality for production AI?

Book a Demo

Quality alone is not enough. Infrastructure must evolve intelligently too.

Suppose you now have a strong handle on quality. You know what good means for the product. You can assess it continuously. You can detect regressions before they reach customers. That still leaves a second major problem: which model should handle which request?

Many teams think they have solved this by adding an LLM gateway or router. But most gateways are interoperability layers. They help connect to multiple providers, normalize APIs, and make switching easier. They do not answer the harder question. They do not decide, in a fine-grained way, which model is right for a particular request under a particular cost and quality target.

So teams fall back to rules: use this model for that use case, route this region to that provider, switch when latency gets too high, fall back under certain conditions. Rules may work temporarily. They are too brittle, too coarse, and too difficult to maintain in an environment where models keep changing and customer behavior keeps shifting.

This is why Divyam.AI is not a gateway company and not a basic router company. We build an autonomous optimization layer. A gateway helps you talk to models. A rules engine helps you encode manual choices. What serious production systems need is something else: a system that learns, evaluates, and improves its decisions continuously.

Divyam.AI continuously learns what works for your application

Divyam.AI Platform Architecture
1 EvalMate
EvalMate
  • Tracks customer drift
  • Tracks use-case evolution
  • Stores & versions golden datasets
Domain Expert
Reward Model (RM)
Builds Reward Model (RM)
2 LLM Experimentation Infrastructure
Logs
Live production traffic
Trained Selector
LLM Experimentation Infrastructure
  • Evaluate candidate LLMs using RM
  • Creates deployment-specific Leaderboard
3 Production LLM Application
Production LLM Application
4 Selector Deployment
Request →
← Response
Trained Selector
LLM 1 LLM 2 LLM 3
Selects lowest-cost model that meets quality bar
Each quadrant feeds the next — EvalMate defines quality, Experimentation benchmarks models, Production generates traffic, and the Selector optimizes every request.

Once quality is measurable, we can do something more powerful than static routing. We run structured experiments on the latest models a customer chooses to evaluate. We assess those models not against generic public benchmarks, but against the customer's actual quality criteria and production behaviors. This creates a deployment-specific leaderboard.

That matters because the right model is not universal. It depends on the application, the customer's quality standard, the cost profile, the latency tolerance, and the real distribution of requests. We also ingest the customer's actual model rate card. This means Divyam.AI knows not just what each model can do, but what each model costs in the customer's reality.

From there, an important truth becomes visible. For many requests, several models may provide the right answer, but they do not all cost the same. As of March 2026, output token prices across production-grade models range from $0.28 to $30 per million tokens — a spread of over 100x for tasks where quality scores are comparable.34 So the correct decision is not always choose the most capable model, and not always choose the cheapest. It is: for this request, choose the lowest-cost model that still clears the required quality bar. That requires more than a static rule. It requires fine-grained, learned intelligence.

Divyam.AI uses the results of those experiments to train a selector that can make these decisions in milliseconds, directly in the request path. That selector is what turns experimentation into production advantage. Without customer-specific quality, the selector is just optimizing for a vague notion of good. With quality defined properly, it becomes a production system that optimizes for what the customer actually values.

The result is compounding improvement, not one-time optimization

The compounding loop — each cycle makes the system smarter
🚀
Production Traffic
Real requests, real behavior
🔬
Experiments
New models tested on actual data
📊
Leaderboard
Quality × cost × latency ranked
Selector Update
Routing decisions retrained
📈
Better Routing
Lower cost, higher quality
A new model launch is not a disruption. A rate-card change is not chaos. All of these become inputs into a system that continuously adapts.

For customers, the benefits are immediate and structural. Because quality is clearly assessed, the system can optimize without causing silent regressions. Because the selector understands both model capability and model cost, it routes each request to the most effective model. Because the leaderboard is continuously refreshed, older models get phased out when they no longer earn traffic. Because the experimentation loop is automated, new models can be evaluated quickly and brought into production fast, often within hours of a launch.

That means Divyam.AI gets stronger as the LLM ecosystem gets better. A new model launch is not a disruption. A rate-card change is not chaos. A workload shift is not a fire drill. All of these become inputs into a system that continuously adapts. This is one of the deepest advantages of the Divyam.AI approach: it converts external model progress into internal customer advantage, automatically and repeatedly.

The model ecosystem is moving quickly. That should not be something customers fear. It should be something they benefit from systematically.

What remains yours

The application experience itself, the creativity, the workflow design, the differentiation, and the moments that make users say wow, all remain yours. That is where your company's product thinking lives. You decide what should be built, what users should love, and what makes the product distinct.

Divyam.AI takes over the two hardest moving parts underneath that experience: defining and continuously measuring quality, and continuously optimizing the LLM infrastructure layer. That gives your team room to focus on product and customer experience, while the system beneath it stays disciplined, adaptive, and production-ready.

This is what makes agentic software sustainable

Without the right platform, building agentic software can feel like building on shifting ground. Prompts evolve. Workflows evolve. Users evolve. Models evolve. The cost landscape evolves. Quality remains expensive and subjective unless you invest deeply in controlling it. That is why so many teams stall after the initial excitement.

With Divyam.AI, that dynamic changes. Quality becomes defined, measured, and continuously enforced. Infrastructure becomes adaptive instead of manually tuned. New model releases become opportunities instead of distractions. Cost improvements become systematic instead of accidental. The application can keep evolving without the team losing confidence in what is shipping.

That is the word we keep coming back to: control. Not in the sense of slowing innovation down. In the sense of being able to move fast without losing your grip on quality, cost, and product behavior.

Do you need Divyam.AI?

You can build all of this yourself. The larger point is that if you are betting your company on an LLM-powered application, you need the following capabilities in place:

  • 1
    A real quality lifecycle. A way to detect what changed, where coverage is weak, when new test cases must be added, and what needs to be re-evaluated after a feature change, prompt update, model swap, or regression.
  • 2
    Continuous quality checks that remain affordable. A way to measure quality regularly without making evaluation so expensive or operationally heavy that the team eventually stops doing it.5
  • 3
    Optimal model selection for every query. A way to choose the right LLM for each request based on the real trade-off between quality, latency, and cost, not just broad benchmark reputation.3
  • 4
    Safe adoption of new models, continuously. A way to bring in better models as they appear and retire older ones without destabilizing production.

If you believe these capabilities matter, you are already describing the problem Divyam.AI is built to solve. We take care of this operational layer so your team does not have to. That leaves you free to focus on what is uniquely yours: your domain, your product, your customer experience, and your business.

The future belongs to companies that can keep getting better

The winners in the LLM era will not just be the companies that launch first. They will be the companies that can keep improving in quality, in cost, in capability, and in customer experience, while the environment around them keeps changing. McKinsey's research consistently shows that a small group of high performers is pulling away, capturing disproportionate value through systematic approaches to AI deployment, while the majority remain stuck in pilot mode.76

That is a very different challenge from prototyping. It requires more than prompts, more than a router, more than benchmark screenshots, and more than one-off evaluations. It requires a platform that can define quality, learn from production, adapt continuously, and optimize inference decisions in real time.

That is what Divyam.AI is building. If your goal is to create an LLM or agentic application that keeps getting better as your product grows and the model ecosystem evolves, Divyam.AI gives you the foundation to do exactly that.


References

  1. 1
    IDC / CIO — 88% of AI Pilots Fail to Reach Production (2025) IDC research, in partnership with Lenovo, found that 88% of observed AI proofs-of-concept do not make the cut to wide-scale deployment. For every 33 AI POCs a company launched, only four graduated to production.
    cio.com
    Production Gap
  2. 2
    Stanford HAI AI Index Report 2025 In 2024, U.S. institutions produced 40 notable AI models. Training compute for notable AI models is doubling approximately every five months, and dataset sizes for training LLMs are doubling every eight months.
    hai.stanford.edu
    Model Pace
  3. 3
    CostGoat — LLM API Pricing Comparison (March 2026, updated continuously) Live comparison of 300+ models showing output token prices ranging from $0.28/million (DeepSeek V3.2) to $30/million (GPT-5) — over 100x spread — for models with comparable quality scores.
    costgoat.com
    Model Price Dispersion
  4. 4
    Artificial Analysis — Model Intelligence vs. Price (2025, updated continuously) Live benchmarking across 60+ models confirms that higher intelligence models do not follow a consistent price-quality curve.
    artificialanalysis.ai
    Model Price Dispersion
  5. 5
    arXiv — A Survey on LLM-as-a-Judge (2024) A comprehensive survey addressing how reliable LLM-as-a-Judge systems can be built, covering costs, biases, and the challenges of using LLMs as scalable evaluators.
    arxiv.org/abs/2411.15594
    Evaluation Cost
  6. 6
    BCG — AI Adoption in 2024: 74% of Companies Struggle to Achieve and Scale Value (2024) BCG's research found that 74% of companies have yet to show any tangible value from their AI investments.
    bcg.com
    Production Gap
  7. 7
    McKinsey — The State of AI: How Organizations Are Rewiring to Capture Value (2025) Nearly two-thirds of respondents say their organizations have not yet begun scaling AI across the enterprise. A small group of high performers (around 6%) are pulling away.
    mckinsey.com
    AI Leaders vs. Laggards

Ready to Scale Your AI?

See how Divyam can help your team ship AI to production with confidence.

Book a Demo