How to deploy an LLM on-premise in 2026: pick an open-weights model (Llama 3 70B is the current default), serve it with vLLM behind a private OpenAI-compatible API, and front it with auth + cost telemetry + an eval loop. Sounds simple. The operating discipline is the hard part.
The minimal stack
- 01Hardware: 8× H100 80GB or 8× A100 80GB for Llama 3 70B at production throughput.
- 02Inference server: vLLM (default) or TensorRT-LLM (highest throughput).
- 03API gateway: OpenAI-compatible endpoint so client SDKs work unchanged.
- 04Auth + RBAC: API keys per tenant / per service.
- 05Observability: prometheus + grafana for latency / GPU util / errors.
- 06Cost telemetry: token-level accounting per tenant.
- 07Eval loop: regression suite on a refreshed dataset.
Quantization to fit smaller hardware
Don't have 8 H100s? Run Llama 3 70B in AWQ or GPTQ 4-bit quantization on 2× A100 80GB. Quality drop is small (1-2 percentage points on most evals). For smaller models (Llama 3 8B, Mistral 7B) a single L40S or even L4 is enough.
What it costs
| Setup | Monthly amortized |
|---|---|
| Single H100 node, self-hosted | €4k - €7k |
| Single H100 node, rented (Lambda, CoreWeave) | €8k - €15k |
| 8× A100 cluster on-prem | €12k - €20k |
| Build + ops engagement (one-time) | €40k - €120k |