RAG vs fine-tuning is the wrong frame. The right frame is: prompting first, RAG when the knowledge is too big or too dynamic to fit in a prompt, fine-tuning last and only when the previous two are exhausted.
When RAG wins
- Knowledge updates frequently (docs, tickets, policies, product specs).
- You need citations or auditability for compliance.
- Knowledge is large (>10k documents) and doesn't fit in a context window.
- Multi-tenant: each customer has their own data and you cannot mix.
- You want one base model serving many use cases via different retrieval indices.
When fine-tuning wins
- You need a specific output format / style / persona the prompt cannot reliably enforce.
- Inference latency or unit cost is a hard constraint, and a smaller fine-tuned model beats a large frontier model on your task.
- The task is structured extraction with consistent schemas and you have 1,000+ labeled examples.
- You need to teach a specialized vocabulary or domain language the base model genuinely doesn't know.
When you do both
Common production pattern: fine-tune a small model for the structured output / extraction task, then RAG for the knowledge-grounded reasoning task, with the fine-tuned model consuming the retrieval context.
Cost comparison, roughly
| Approach | Setup | Per-query cost | Maintenance |
|---|---|---|---|
| Prompting only | Low | Higher tokens | Low |
| RAG | Medium (indexing + retrieval infra) | Tokens + retrieval | Medium (index refresh, eval) |
| Fine-tuning | High (training data + runs) | Lower tokens | High (retraining cadence) |
| RAG + fine-tune | Highest | Lowest at scale | Highest |