How to evaluate an LLM in production is the most-skipped engineering question in AI. A production AI feature without an eval suite is a science project , when it breaks you cannot tell why, and when it gets better you cannot tell why either.
Eval pipeline we build into every project
- 01Build the eval dataset from real (anonymized) production data. 100-500 examples covering common cases, edge cases, and known failure modes.
- 02Define metrics that match the user-facing outcome. Accuracy, faithfulness, citation correctness , never just BLEU/ROUGE in isolation.
- 03Run eval in CI on every prompt or model change. Regressions block merge.
- 04Pipe the eval results into a dashboard so non-engineers can see them.
- 05Re-curate the eval set quarterly. Production data shifts; the eval should too.
Tools we use
- LangSmith , hosted traces + eval datasets, UI for non-engineers.
- Braintrust , CI-native eval, programmable.
- Promptfoo , CLI eval suite that blocks PRs.
- Helicone , cost + latency observability, complements eval.
- Custom harnesses in Python or TypeScript when the metric is domain-specific.