How to ship AI features that actually work in production

The gap between demo and production

The best AI demo is usually a lie. The model answered well once, on a well-chosen input, with no adversarial traffic. Production is the opposite of that.

Start from a measurable job

Before any code: what is the one job this feature does? What does success look like **in numbers**? Deflection rate. Time saved. Close rate. If you can’t express it in a metric, don’t ship it.

Build the eval harness first

Before touching prompts, write 50–100 real inputs with expected outputs. Grade them automatically. Every change to the system is a PR against this harness.

Use the smallest retrieval you can get away with

Most LLM wins come from retrieval and structure, not fine-tuning. Start with curated context, not the whole warehouse.

Add guardrails, not disclaimers

“AI may make mistakes” is not a guardrail. Constrain the action space: what tools can it call, what does it refuse, what gets human review?

Instrument everything

Log prompts, outputs, tool calls, user edits and outcomes. This is your dataset for the next release.

Ship small. Measure. Compound.

The teams winning with AI treat it like any other feature: incremental, measured, and owned by a real product manager.

How to ship AI features that actually work in production

The gap between demo and production

Start from a measurable job

Build the eval harness first

Use the smallest retrieval you can get away with

Add guardrails, not disclaimers

Instrument everything

Ship small. Measure. Compound.

We also build this stuff for clients.

Keep reading

Design systems for small teams (without the overhead)

5 automations that pay back in under 60 days

Why we rebuild storefronts on Next.js (and when we don’t)