The gap between demo and production
The best AI demo is usually a lie. The model answered well once, on a well-chosen input, with no adversarial traffic. Production is the opposite of that.
Start from a measurable job
Before any code: what is the one job this feature does? What does success look like **in numbers**? Deflection rate. Time saved. Close rate. If you can’t express it in a metric, don’t ship it.
Build the eval harness first
Before touching prompts, write 50–100 real inputs with expected outputs. Grade them automatically. Every change to the system is a PR against this harness.
Use the smallest retrieval you can get away with
Most LLM wins come from retrieval and structure, not fine-tuning. Start with curated context, not the whole warehouse.
Add guardrails, not disclaimers
“AI may make mistakes” is not a guardrail. Constrain the action space: what tools can it call, what does it refuse, what gets human review?
Instrument everything
Log prompts, outputs, tool calls, user edits and outcomes. This is your dataset for the next release.
Ship small. Measure. Compound.
The teams winning with AI treat it like any other feature: incremental, measured, and owned by a real product manager.