Skip to content
AdmerTech
Blog/AI
AI

Building an AI shopping assistant with RAG: a practical, production-ready guide

DODaniel Okafor··12 min read

Why a chatbot will not cut it

The 2023 wave of "AI shopping assistants" was almost entirely chatbots glued to a system prompt. They hallucinated prices, misquoted policies, and offered discounts they had no authority to give.

A shopping assistant in 2026 has to be three things at once: a search engine, a salesperson and a customer service rep — over your real data, with strict guardrails. That is a different system.

This is how we build them.

The architecture in one paragraph

A production assistant is a small orchestration layer with four components: a retrieval system over your catalog and policies, an LLM that reasons and writes, a constrained tool layer for actions like cart, account and returns, and an observability and eval layer that watches everything. None of these are optional.

1. Indexing your catalog the right way

The retrieval layer is where most assistants quietly fail. Common mistakes:

  • Embedding raw product JSON.
  • Embedding 1,200-word PDPs as a single chunk.
  • Mixing variants and master products in the same index.
  • Forgetting to re-embed when prices, stock or descriptions change.

Better:

  • Build canonical product cards with title, key attributes, stock, price, and a short LLM-written summary.
  • Index per-variant for size, color and material so retrieval can match exact intent.
  • Keep a separate index for policies, FAQs and help content.
  • Refresh embeddings on a webhook from Shopify, not a nightly cron.

The signal you care about is recall on real shopper queries — not benchmark numbers from a paper.

2. Hybrid retrieval beats pure semantic

Pure vector search is great for fuzzy intent and bad for exact-match queries — which is half of e-commerce. "iPhone 17 Pro Max case clear" needs to find that exact thing, not "vibes match."

The pattern:

  • BM25 keyword retrieval plus vector retrieval, fused with reciprocal rank fusion.
  • A small reranker over the top 50 results.
  • Filters from the LLM's tool calls — price ranges, stock status, brand, size.
  • Hard constraints from rules — never recommend competitor products on this PDP, never recommend out-of-stock items, never recommend over-21 products to unauthenticated visitors.

This is the boring engineering that separates a working assistant from a flashy demo.

3. Tool use for actions

A salesperson who cannot ring up a sale is not useful. The tool layer is where the assistant earns its keep.

Useful tools, scoped narrowly:

  • search_products(query, filters) — backed by your hybrid search.
  • get_product(id) — full details, returning canonical pricing and stock.
  • get_order(order_id) for authenticated users.
  • start_return(order_id, items, reason).
  • add_to_cart(variant_id, qty).
  • handoff_to_human(reason).

Each tool runs server-side, with auth checks the LLM never has the keys to bypass. The LLM calls the tool. The backend decides if the tool runs.

4. Guardrails that actually protect revenue

A sloppy assistant can lose more money than it earns. The non-negotiables:

  • A list of topics the model refuses — pricing it cannot verify, medical claims, regulatory advice.
  • Hard caps on discount language. The only discounts the model can mention are the ones returned by a get_promotions tool.
  • A persona that stays on-brand under adversarial prompting.
  • A response template that distinguishes "from the catalog" content from generative copy.
  • Every output run through a fast safety classifier before it returns.

Guardrails are not disclaimers. "I am an AI and may make mistakes" does not stop the assistant from offering 50 percent off your hero product.

5. Evals before you launch

You will not get this right on prompts alone. You need an offline eval suite before you ship and a live one after.

Build:

  • 100 to 300 real user queries with expected answers and expected tool calls.
  • Adversarial probes — prompt injection, jailbreaks, "ignore previous instructions."
  • Policy tests for returns, sizing, shipping and exchanges.
  • Catalog accuracy tests for price, stock and attributes.
  • A regression harness that runs on every prompt or model change.

If your eval pass rate is under 90 percent on the catalog and policy categories, you are not ready for production traffic.

6. Observability and cost controls

Once it is live:

  • Log every prompt, tool call, retrieval set and final response.
  • Sample 5 to 10 percent for human review weekly.
  • Watch cost per session and cost per converted session, not just total spend.
  • Route the easy 60 percent of queries to a smaller, cheaper model. Reserve the flagship for the complex 10 percent.
  • Cache aggressively at the retrieval layer. LLM calls do not need to be the cache layer.

Stores that get this right run shopping assistants for low single digits per converted session — comfortably below the cost of one chat with a human agent.

The stack we use most often

For a Shopify or Next.js commerce store, a typical stack:

  • Embeddings: a strong open or hosted model — pick one and stick with it.
  • Vector DB: pgvector for under 5 million vectors, Pinecone or Weaviate above that.
  • Reranker: a cross-encoder hosted on a GPU node, or a hosted reranker.
  • LLM: a flagship model for reasoning, a smaller one for routing and rewrites.
  • Orchestration: the Vercel AI SDK, LangGraph, or a custom orchestrator. The patterns matter more than the framework.

Pitfalls we keep seeing

  • Skipping the eval harness. Always the same outcome.
  • Letting the LLM call tools without server-side auth.
  • Indexing once, never refreshing.
  • Optimizing for "smartness" instead of conversion.
  • Treating the assistant as a feature, not a product. It needs an owner with metrics.

A real shopping assistant is a small system, not a clever prompt. Ship it that way and it earns its keep within a quarter.

Enjoyed this?

We also build this stuff for clients.

Happy to dive into your specific problem on a short call — no strings.