Customer Story

How Flash.co achieved +15% quality uplift and −30% cost savings with Divyam.AI's evaluation and routing layer

Flash.co's AI product comparison engine was scaling fast. Across two pilot engagements, Divyam.AI Evalm8 automated the eval flywheel and Divyam.AI Router delivered compounding quality and cost gains with each upgrade.

E-commerce Price Intelligence Product equality detection Evalm8 + Router

+15%

Quality uplift (Phase 2 pilot vs. baseline)

−30%

Cost savings (Phase 2 pilot vs. baseline)

Pilot engagements, compounding gains

Flash.co is building an AI-first shopping experience that does the research on behalf of its customers, helping them understand products better, compare options intelligently, make purchase decisions with greater confidence, and manage the post-purchase journey. Users paste a product URL; Flash finds it across every marketplace and surfaces the best price, with no manual searching required. The platform has processed 4.76M+ products and generated 12.56M+ insights.

The core AI challenge behind this seemingly simple experience is harder than it looks: given two product URLs, are they the same product? Differences in photography, descriptions, model variants, and marketplace-specific listing conventions mean that determining product equality requires genuine reasoning, not just text matching. This is the challenge Flash needs to solve to deliver trustworthy price comparisons to its customers. Divyam.AI's role is to ensure the evaluations measuring this capability are robust, comprehensive, and continuously improving.

⚡ Flash AI + Add Extension

https://www.amazon.in/dp/B00SDBMKMC?ref=...

We analyze thousands of products so you don't have to

4.76M+ Products researched

12.56M+ Insights generated

Found across marketplaces

Amazon ₹70,460

Flipkart ₹72,690

View all 4 stores →

The Flash AI product comparison engine

A user pastes any product URL. Flash finds the identical product across every marketplace and surfaces comparative prices. The AI decision at the heart of this experience (are these two listings the same product?) requires structured reasoning across product specifications, images, and descriptions. Getting that decision right is what determines whether the price comparison is trustworthy.

The Challenge

Two challenges: comprehensive evals and continuous model optimisation

Solving product equality at scale requires getting two things right in sequence, and both are genuinely hard without the right tooling.

Challenge 1: Writing comprehensive, domain-calibrated evals. A small golden dataset of labelled product pairs is a reasonable starting point, but it doesn't generalise. Off-the-shelf LLM judges aren't calibrated to domain-specific nuances: the subtle differences between how a wardrobe is listed on Amazon versus Flipkart require the kind of domain judgment that's difficult to capture in a manually authored prompt. Without a reliable eval, there's no confident way to know whether a change to the model or the prompt is actually an improvement. Building evals that are truly comprehensive is an iterative, labour-intensive process without an intelligent tool to assist.

Challenge 2: Keeping pace with the LLM landscape. Flash was running their product equality task on gemini-2.5-flash-lite, achieving 65.34% task accuracy. Constructing and maintaining a leaderboard across the fast-moving LLM landscape is significant engineering work that competes directly with building the actual product. New models launch regularly; the cost-quality frontier shifts constantly. Without automation, staying current is an ongoing manual investment.

These two challenges compound each other. Without a reliable eval, you can't confidently benchmark alternative models. And without continuous model optimisation, accuracy and cost gains remain on the table. Addressing them requires two things working together: a flywheel that keeps evals sharp, and a routing layer that continuously finds the optimal model.

The Solution

Two flywheels: one for evaluation, one for optimization

Divyam.AI brought both products to bear on Flash's AI optimization goals: Divyam.AI Evalm8 to sharpen the evaluation layer, and Divyam.AI Router to optimise the model layer. The two work in sequence: a sharp eval is what makes trustworthy model optimisation possible.

Step 1: Build a reliable eval with Divyam.AI Evalm8

Evalm8 begins from a minimal starting point: a two-line seed prompt describing the task: compare two product URLs and determine if they are the same product. Against human annotation, that seed achieved 75% accuracy.

From there, Evalm8 took over. It automatically selected a diverse, coverage-complete set of product pairs for the developer to annotate. That annotation injected domain knowledge directly into the system: the kind of nuanced judgment that distinguishes same-product from similar-product. Evalm8 then automatically evolved the prompt from 2 lines to 40 lines, improving eval accuracy to 80% against human annotation. The developer's bandwidth was spent on annotation, not on prompt engineering.

How Evalm8 auto-evolves an eval: from a 2-line seed to a 40-line production prompt

Evalm8 starting seed

1Compare product specifications and other details from two E-commerce shopping website URLs.

2Determine whether the two URLs describe the same product.

Accuracy: 75%

Divyam.AI Evalm8
annotation + auto-evolution

Evolved prompt

1System message:

3Your input fields are:

41. first_url (str): first E-commerce shopping website URL

52. second_url (str): second E-commerce shopping website URL

6Your output fields are:

71. reasoning (str):

82. same_product (bool): whether the two URLs describe the same product

…structured schema, examples, output format instructions

40Response:

Accuracy: 80%

Accuracy measured against human annotation. The developer annotated diverse product pairs selected automatically by Evalm8, with no manual prompt engineering required.

Step 2: Find the optimal model with the Leaderboard

With a reliable eval in place, Divyam.AI Router automatically constructed a leaderboard, benchmarking nine models against Flash's product equality task. The incumbent model, gemini-2.5-flash-lite, achieved 65.34% accuracy. The leaderboard immediately revealed a better option: switching to gemini-2.5-flash-lite-preview-09-2025 would deliver a +12.33% relative accuracy gain at zero additional cost.

The root cause was instruction-following: Flash's prompt expected a specific JSON schema, and the preview version adhered to it more reliably. This is exactly the kind of ongoing intelligence that an automated leaderboard provides, compounding value without additional engineering investment from the team.

Quality Scores: Product Equality Task

gemini-3.1-flash-lite-preview

74.25%

moonshotai/kimi-k2-instruct-0905

74.25%

meta-llama/llama-4-scout-17b-16e-instruct

73.47%

gemini-2.5-flash-lite-preview-09-2025 recommended

73.40%

gpt-4o-mini

72.18%

gpt-4.1-nano

67.19%

gemini-2.5-flash-lite baseline

65.34%

openai/gpt-oss-20b

56.42%

llama-3.1-8b-instant

32.17%

Leaderboard constructed automatically by Divyam.AI Router. Switching from the baseline to the recommended model delivers a +12.33% relative accuracy gain at zero additional cost. Scores are on Flash's product equality task.

Step 3: Go further with per-request routing

Switching to the best single model from the leaderboard is attractive but sub-optimal. Not all product comparison requests are equally hard. Some pairs are clearly identical; some require careful reasoning across subtle listing differences. A static model choice applies the same level of effort to every request.

Divyam.AI Router makes model selection at a finer granularity: per request, rather than per application. By routing each request to the model best suited to its difficulty and cost profile, the router unlocks gains that no single model can achieve.

Against the gemini-2.5-flash-lite baseline, the router delivered a +15.56% relative accuracy gain, higher than the +12.33% achievable by simply switching models, while simultaneously reducing cost by 11.01%. Better accuracy and lower cost, achieved together.

How Divyam.AI Router distributed requests across 8 models

meta-llama/llama-4-scout-17b-16e-instruct

largest share

gpt-4.1-nano

gemini-2.5-flash-lite-preview-09-2025

openai/gpt-oss-20b

gemini-2.5-flash-lite

Other models

The router selects per request rather than per application, routing simpler requests to lower-cost models and harder ones to higher-accuracy models. The mix is dynamic: it shifts automatically as new models are introduced or as the system recalibrates itself periodically.

The Results

Better evals. Better models. Better routing. All automated.

The Flash engagement demonstrates what becomes possible when the evaluation layer and the optimisation layer work together. Evalm8 provided the measurement foundation: a robust, domain-calibrated eval that evolved automatically as annotations grew. Router provided the optimisation: a leaderboard and a per-request routing layer that continuously finds the best cost-quality operating point.

Neither product alone would have delivered the full result. A better eval without better routing still leaves accuracy and cost gains on the table. Better routing without a reliable eval means optimising toward a metric you can't fully trust. Together, they form the closed loop that makes production AI systems continuously improve.

All results are from the pilot evaluation on real production data. Both the +15.56% accuracy improvement and the −11.01% cost reduction are measured relative to the original gemini-2.5-flash-lite baseline.

+15.56%

Accuracy improvement (router vs. baseline)

−11.01%

Cost reduction (router vs. baseline)

75%→80%

Eval accuracy improvement (Evalm8)

Models benchmarked in leaderboard

Before Divyam.AI

Manual

Eval prompt, team-authored

65.34%

Task accuracy on gemini-2.5-flash-lite

Manual

Leaderboard and model selection

With Divyam.AI

Auto-generated

Eval prompt, evolved by Evalm8

+5pp accuracy

+15.56%

Task accuracy improvement via routing

vs. baseline

Automated

Leaderboard + per-request routing

−11.01% cost

Phase 2 Pilot

The second upgrade: a smarter router, better results

The Phase 1 pilot established the measurement foundation and demonstrated what intelligent routing could achieve. The Phase 2 pilot upgraded the router itself: a newer version of Divyam.AI Router was deployed against the same model pool, re-benchmarked against Flash's product equality eval, and re-optimised for the cost-quality frontier.

Against the same original gemini-2.5-flash-lite baseline, the Phase 2 router delivered a 15% quality uplift and 30% cost savings. The same models, routed more intelligently, produced a meaningfully better outcome on both dimensions.

This is the compounding dynamic at the core of Divyam.AI's value: the eval stays sharp, the router improves with each upgrade, and the gains compound without additional engineering investment from the Flash team.

Baseline

gemini-2.5-flash-lite, no routing

65.34% task accuracy

Phase 1 Pilot

Divyam.AI Router, first model pool

+15.56% accuracy · −11% cost

Phase 2 Pilot

Divyam.AI Router, expanded model pool

+15% quality · −30% cost

All figures measured against the original gemini-2.5-flash-lite baseline on Flash's product equality task.

What's Next

A flywheel that keeps turning as the LLM landscape evolves

The LLM landscape is moving fast. New models launch regularly, cost-quality frontiers shift, and the model that was optimal last month may not be optimal today. For a team running a static model assignment, each of these shifts requires a new evaluation cycle, engineering work that competes with building the actual product.

With Divyam.AI's evaluation and routing layer in place, Flash's system is built to improve continuously. As the developer annotates more product pairs, the eval gets sharper. As new models enter the market, the leaderboard updates automatically. As the routing layer sees more requests, it makes better per-request decisions. The developer's bandwidth stays where it belongs: on Flash's product and user experience, not on the infrastructure beneath it.

See what Divyam.AI can do for your AI system

Join teams like Flash.co, PharmEasy, and MakeMyTrip that are improving AI quality and cutting costs with Divyam.AI.

Book a Demo More Customer Stories