Claude vs GPT-4o vs open-weights , the honest answer for production is: it depends on the task, the data, the deploy target, and the unit economics. The brand is the last variable, not the first.
Our defaults
- Claude Sonnet , multi-step reasoning, tool use, agents with state.
- GPT-4o / o-series , structured extraction at scale, sync calls with strict schemas.
- Gemini , long context (1M tokens) or vision-heavy work.
- Llama 3 70B / Mistral / Qwen 2.5 , when data residency or unit cost demands it.
When to pick open-weights
- Data cannot leave your VPC / region (GDPR, HIPAA, financial regulators).
- Inference volume so high that token pricing exceeds amortized GPU cost.
- You need a smaller, faster model specifically tuned for your domain.
- Your customer is procurement-sensitive about US cloud LLM vendors.
When to pick closed-source frontier
- You need state-of-the-art reasoning on hard prompts (legal, research, code).
- You want fast iteration without GPU operations overhead.
- Your volume is moderate (most B2B SaaS features).
- You need the best image / vision / audio modality available.
The benchmark trap
Public benchmarks don't predict performance on your data. We benchmark every shortlisted model on the customer's actual eval set in discovery week before recommending a default. The differences are often surprising.