How Flash.co achieved +15% quality uplift and −30% cost savings with Divyam.AI's evaluation and routing layer
Flash.co's AI product comparison engine was scaling fast. Across two pilot engagements, Divyam.AI Evalm8 automated the eval flywheel and Divyam.AI Router delivered compounding quality and cost gains with each upgrade.
Flash.co is building an AI-first shopping experience that does the research on behalf of its customers, helping them understand products better, compare options intelligently, make purchase decisions with greater confidence, and manage the post-purchase journey. Users paste a product URL; Flash finds it across every marketplace and surfaces the best price, with no manual searching required. The platform has processed 4.76M+ products and generated 12.56M+ insights.
The core AI challenge behind this seemingly simple experience is harder than it looks: given two product URLs, are they the same product? Differences in photography, descriptions, model variants, and marketplace-specific listing conventions mean that determining product equality requires genuine reasoning, not just text matching. This is the challenge Flash needs to solve to deliver trustworthy price comparisons to its customers. Divyam.AI's role is to ensure the evaluations measuring this capability are robust, comprehensive, and continuously improving.
⚡ Flash AI+ Add Extension
https://www.amazon.in/dp/B00SDBMKMC?ref=...
We analyze thousands of products so you don't have to
4.76M+Products researched
12.56M+Insights generated
Found across marketplaces
Amazon₹70,460
Flipkart₹72,690
View all 4 stores →
The Flash AI product comparison engine
A user pastes any product URL. Flash finds the identical product across every marketplace and surfaces comparative prices. The AI decision at the heart of this experience (are these two listings the same product?) requires structured reasoning across product specifications, images, and descriptions. Getting that decision right is what determines whether the price comparison is trustworthy.
The Challenge
Two challenges: comprehensive evals and continuous model optimisation
Solving product equality at scale requires getting two things right in sequence, and both are genuinely hard without the right tooling.
Challenge 1: Writing comprehensive, domain-calibrated evals. A small golden dataset of labelled product pairs is a reasonable starting point, but it doesn't generalise. Off-the-shelf LLM judges aren't calibrated to domain-specific nuances: the subtle differences between how a wardrobe is listed on Amazon versus Flipkart require the kind of domain judgment that's difficult to capture in a manually authored prompt. Without a reliable eval, there's no confident way to know whether a change to the model or the prompt is actually an improvement. Building evals that are truly comprehensive is an iterative, labour-intensive process without an intelligent tool to assist.
Challenge 2: Keeping pace with the LLM landscape. Flash was running their product equality task on gemini-2.5-flash-lite, achieving 65.34% task accuracy. Constructing and maintaining a leaderboard across the fast-moving LLM landscape is significant engineering work that competes directly with building the actual product. New models launch regularly; the cost-quality frontier shifts constantly. Without automation, staying current is an ongoing manual investment.
These two challenges compound each other. Without a reliable eval, you can't confidently benchmark alternative models. And without continuous model optimisation, accuracy and cost gains remain on the table. Addressing them requires two things working together: a flywheel that keeps evals sharp, and a routing layer that continuously finds the optimal model.
The Solution
Two flywheels: one for evaluation, one for optimization
Divyam.AI brought both products to bear on Flash's AI optimization goals: Divyam.AI Evalm8 to sharpen the evaluation layer, and Divyam.AI Router to optimise the model layer. The two work in sequence: a sharp eval is what makes trustworthy model optimisation possible.
Step 1: Build a reliable eval with Divyam.AI Evalm8
Evalm8 begins from a minimal starting point: a two-line seed prompt describing the task: compare two product URLs and determine if they are the same product. Against human annotation, that seed achieved 75% accuracy.
From there, Evalm8 took over. It automatically selected a diverse, coverage-complete set of product pairs for the developer to annotate. That annotation injected domain knowledge directly into the system: the kind of nuanced judgment that distinguishes same-product from similar-product. Evalm8 then automatically evolved the prompt from 2 lines to 40 lines, improving eval accuracy to 80% against human annotation. The developer's bandwidth was spent on annotation, not on prompt engineering.
How Evalm8 auto-evolves an eval: from a 2-line seed to a 40-line production prompt
Evalm8 starting seed
1Compare product specifications and other details from two E-commerce shopping website URLs.
2Determine whether the two URLs describe the same product.
Accuracy: 75%
Divyam.AI Evalm8 annotation + auto-evolution
Evolved prompt
1System message:
2
3Your input fields are:
41. first_url (str): first E-commerce shopping website URL
52. second_url (str): second E-commerce shopping website URL
6Your output fields are:
71. reasoning (str):
82. same_product (bool): whether the two URLs describe the same product
…structured schema, examples, output format instructions
40Response:
Accuracy: 80%
Accuracy measured against human annotation. The developer annotated diverse product pairs selected automatically by Evalm8, with no manual prompt engineering required.
Step 2: Find the optimal model with the Leaderboard
With a reliable eval in place, Divyam.AI Router automatically constructed a leaderboard, benchmarking nine models against Flash's product equality task. The incumbent model, gemini-2.5-flash-lite, achieved 65.34% accuracy. The leaderboard immediately revealed a better option: switching to gemini-2.5-flash-lite-preview-09-2025 would deliver a +12.33% relative accuracy gain at zero additional cost.
The root cause was instruction-following: Flash's prompt expected a specific JSON schema, and the preview version adhered to it more reliably. This is exactly the kind of ongoing intelligence that an automated leaderboard provides, compounding value without additional engineering investment from the team.
Quality Scores: Product Equality Task
gemini-3.1-flash-lite-preview
74.25%
moonshotai/kimi-k2-instruct-0905
74.25%
meta-llama/llama-4-scout-17b-16e-instruct
73.47%
gemini-2.5-flash-lite-preview-09-2025 recommended
73.40%
gpt-4o-mini
72.18%
gpt-4.1-nano
67.19%
gemini-2.5-flash-lite baseline
65.34%
openai/gpt-oss-20b
56.42%
llama-3.1-8b-instant
32.17%
Leaderboard constructed automatically by Divyam.AI Router. Switching from the baseline to the recommended model delivers a +12.33% relative accuracy gain at zero additional cost. Scores are on Flash's product equality task.
Step 3: Go further with per-request routing
Switching to the best single model from the leaderboard is attractive but sub-optimal. Not all product comparison requests are equally hard. Some pairs are clearly identical; some require careful reasoning across subtle listing differences. A static model choice applies the same level of effort to every request.
Divyam.AI Router makes model selection at a finer granularity: per request, rather than per application. By routing each request to the model best suited to its difficulty and cost profile, the router unlocks gains that no single model can achieve.
Against the gemini-2.5-flash-lite baseline, the router delivered a +15.56% relative accuracy gain, higher than the +12.33% achievable by simply switching models, while simultaneously reducing cost by 11.01%. Better accuracy and lower cost, achieved together.
How Divyam.AI Router distributed requests across 8 models
meta-llama/llama-4-scout-17b-16e-instruct
largest share
gpt-4.1-nano
gemini-2.5-flash-lite-preview-09-2025
openai/gpt-oss-20b
gemini-2.5-flash-lite
Other models
The router selects per request rather than per application, routing simpler requests to lower-cost models and harder ones to higher-accuracy models. The mix is dynamic: it shifts automatically as new models are introduced or as the system recalibrates itself periodically.
The Results
Better evals. Better models. Better routing. All automated.
The Flash engagement demonstrates what becomes possible when the evaluation layer and the optimisation layer work together. Evalm8 provided the measurement foundation: a robust, domain-calibrated eval that evolved automatically as annotations grew. Router provided the optimisation: a leaderboard and a per-request routing layer that continuously finds the best cost-quality operating point.
Neither product alone would have delivered the full result. A better eval without better routing still leaves accuracy and cost gains on the table. Better routing without a reliable eval means optimising toward a metric you can't fully trust. Together, they form the closed loop that makes production AI systems continuously improve.
All results are from the pilot evaluation on real production data. Both the +15.56% accuracy improvement and the −11.01% cost reduction are measured relative to the original gemini-2.5-flash-lite baseline.
+15.56%
Accuracy improvement (router vs. baseline)
−11.01%
Cost reduction (router vs. baseline)
75%→80%
Eval accuracy improvement (Evalm8)
9
Models benchmarked in leaderboard
Before Divyam.AI
Manual
Eval prompt, team-authored
65.34%
Task accuracy on gemini-2.5-flash-lite
Manual
Leaderboard and model selection
With Divyam.AI
Auto-generated
Eval prompt, evolved by Evalm8
+5pp accuracy
+15.56%
Task accuracy improvement via routing
vs. baseline
Automated
Leaderboard + per-request routing
−11.01% cost
Phase 2 Pilot
The second upgrade: a smarter router, better results
The Phase 1 pilot established the measurement foundation and demonstrated what intelligent routing could achieve. The Phase 2 pilot upgraded the router itself: a newer version of Divyam.AI Router was deployed against the same model pool, re-benchmarked against Flash's product equality eval, and re-optimised for the cost-quality frontier.
Against the same original gemini-2.5-flash-lite baseline, the Phase 2 router delivered a 15% quality uplift and 30% cost savings. The same models, routed more intelligently, produced a meaningfully better outcome on both dimensions.
This is the compounding dynamic at the core of Divyam.AI's value: the eval stays sharp, the router improves with each upgrade, and the gains compound without additional engineering investment from the Flash team.
Baseline
gemini-2.5-flash-lite, no routing
65.34% task accuracy
Phase 1 Pilot
Divyam.AI Router, first model pool
+15.56% accuracy · −11% cost
Phase 2 Pilot
Divyam.AI Router, expanded model pool
+15% quality · −30% cost
All figures measured against the original gemini-2.5-flash-lite baseline on Flash's product equality task.
What's Next
A flywheel that keeps turning as the LLM landscape evolves
The LLM landscape is moving fast. New models launch regularly, cost-quality frontiers shift, and the model that was optimal last month may not be optimal today. For a team running a static model assignment, each of these shifts requires a new evaluation cycle, engineering work that competes with building the actual product.
With Divyam.AI's evaluation and routing layer in place, Flash's system is built to improve continuously. As the developer annotates more product pairs, the eval gets sharper. As new models enter the market, the leaderboard updates automatically. As the routing layer sees more requests, it makes better per-request decisions. The developer's bandwidth stays where it belongs: on Flash's product and user experience, not on the infrastructure beneath it.
See what Divyam.AI can do for your AI system
Join teams like Flash.co, PharmEasy, and MakeMyTrip that are improving AI quality and cutting costs with Divyam.AI.