AI cost optimization for production is mostly about discipline, not exotic infrastructure. Seven moves get us 40-70% savings on most builds without quality loss.
The seven moves
- 01Prompt caching. Anthropic and OpenAI both support it; we use it on every system prompt that doesn't change per request.
- 02Smaller models for sub-tasks. Use Haiku / GPT-4o-mini for classification, summarization, structured extraction; reserve Sonnet / GPT-4o for the reasoning step.
- 03Structured outputs over free text. Schemas force shorter completions and cleaner downstream consumption.
- 04Retrieval before generation. If the answer is in 200 documents, retrieve the relevant 5 and stop sending the entire corpus.
- 05Cost ceiling per request and per cohort. Alerts when a cohort drifts.
- 06Streaming with early-stop. Don't pay for the rest of the completion if the user navigated away.
- 07On-device or smaller-class models for high-volume sync paths. Whisper on-device, small classifiers locally.
Realistic savings, by move
| Move | Typical savings |
|---|---|
| Prompt caching on system prompts | 30-60% |
| Smaller model for sub-tasks | 20-50% |
| Structured outputs | 10-25% (also reliability) |
| Retrieval first, generation second | 30-70% |
| Cost ceilings + alerts | Prevents 5-10× incident spikes |