How to scale from basic RAG to 200+ e-commerce sites

Playback speed

Share post at current time

Share from 0:00

0:00

How to scale from basic RAG to 200+ e-commerce sites

This one goes out to all the engineers building AI-first

Hexa

Mar 26, 2026

Why is this in your inbox? Build with AI is a series by Hexa Media where two partners or founders deep dive into useful, practical ways they’ve learned to use or build with AI within their startup. Prefer to skip future episode drops? Unsubscribe from future Build with AI notifications here.

Hexa Product Lead Pierre sits down with Louis Pinsard, CTO & co-founder of Dialog — an AI shopping assistant handling hundreds of thousands of conversations per month across 200+ ecommerce stores — to explore how they scaled from basic RAG to the infrastructure that supports several hundred thousand conversations a month.

They started with a very basic RAG setup, which was enough to land their first client. But pretty quickly they hit the real problems: product catalogs that explode in size and break your retrieval, a growing volume of users with increasingly diverse queries, keeping latency low through all of it, and building evaluations that give you objective criteria to actually know whether your agent is getting better or worse.

Louis walks through the full technical evolution, from chunking strategies and retrieval tricks to the evaluation systems they wish they’d built sooner.

You can watch the full conversation above, or read the recap below.

1) The first version was literally a LangChain tutorial in production

Dialog started in early 2024, right around when GPT-4 had been out for a couple of months. The team needed to move fast, so they did the most basic thing possible: a textbook RAG setup. User sends a message, you search for semantically close documents, stuff them into the LLM’s context, generate a response. A few lines of code, shipped to production.

It worked well enough to get about ten Shopify design partners on board. There was no real evaluation system — Louis would test a handful of queries manually and go with his gut. The models were also expensive, but the bet was that prices would keep dropping. That bet has held up.

2) Not every message deserves an expensive LLM call

Pretty quickly, they noticed people typing “hello” or sending off-topic messages, and the system would dutifully burn through tokens on zero-value interactions. So the first meaningful addition was a routing classifier that decides what kind of message this is before anything else happens. Is it store-related? About a specific product? A policy question? Or just noise?

This became one of the most important components in the whole system. The core idea hasn’t changed since: not all queries should be treated the same way, and sorting that out first saves money and improves quality simultaneously.

3) Re-ranking fixed the hallucination problem

Hallucinations came from two places: the retrieval step failing to surface the right documents, or surfacing too many and the LLM getting lost in an oversized context.

The fix was a re-ranking step. Instead of retrieving 10 documents and hoping for the best, you first cast a wide net — about 100 candidates — then run them through a specialized model whose only job is to pick the 10 most relevant ones for the given question. Tighter context, better results, fewer hallucinations.

4) The user’s question is almost never good enough to search with directly

If someone is on a skincare product page and asks “is this good for my skin type?” — that query has almost no useful semantic content. No product name, no ingredient, nothing for retrieval to latch onto.

One technique that worked well early on is HyDE — Hypothetical Document Embedding. You ask the LLM to generate a plausible answer without context, then use that hypothetical answer as your search query. Even a hallucinated answer will be semantically closer to the real documents than the vague original question was.

They also enrich queries with page context, conversation history, and keyword extraction. The rephrasing step turned out to be one of the highest-leverage improvements in the whole pipeline.

5) How you chunk your product catalog matters more than you’d think

A product has a title, a description (sometimes enormous), variants, collections, metadata — and if you naively chop all of that into fixed-size chunks, you get fragments where the end of a description bleeds into collection data. That’s meaningless to a retrieval system.

Dialog splits along logical boundaries first — description separate from title separate from variants — then applies size-based chunking within those sections. They also attach metadata like the product name to every chunk so fragments can always be traced back to their source. This becomes critical when you need to filter by price, skin type, or other attributes.

6) The filter trick that made structured catalogs useful

Many ecommerce catalogs already have structured filter systems — skin type, brand, price range. Dialog realized they could extract filters from the user’s natural language query before even hitting semantic search.

So “what’s a good moisturizer for dry skin?” gets “dry skin” extracted as a filter, the search is restricted to matching products, and semantic retrieval runs within that smaller subset. It’s adding determinism into a fuzzy process — and it works because the catalog structure already exists, you just need to tap into it.

7) They flew blind on evaluation for the first six months

For the first six months, improvements were measured by vibes. Louis would test manually, look at logs, ask clients to try things. That worked while the wins were obvious. But eventually changes got subtler and the risk of breaking something elsewhere got real.

The classifier was easy to evaluate — known input, known correct category, score it. RAG evaluation is much harder. You need query-document pairs that represent ground truth, and building that dataset requires genuine domain expertise. They use a mix of LLM-as-judge and human review, but Louis is clear this is still evolving.

8) What Louis would tell someone building an agent from scratch today

Three things.

Observability early — log the inputs and outputs of every component from day one. You’ll always be glad you have the data later, even if you don’t know exactly how you’ll use it yet.

Don’t jump to vector databases. Classical text search — BM25, Elasticsearch, even grep-style approaches — works better than people think, especially with structured data like product catalogs. Louis points to Claude Code as an example: it searches codebases through clever grep queries, not embeddings. Start simple, add semantic search when you’re genuinely hitting limits.

And use Python if the AI system is the core of your product. The ecosystem, tooling, and talent pool are all there. It sounds obvious, but it’s not always the default choice.

If you’re building AI agents on top of messy real-world data, the full conversation is worth your time. Louis gets into the details in a way that’s rare for a CTO of a company at this stage.

Subscribe to receive future episodes of Build with AI.