Comparing 7 Open-Source Models for RAG on City Data
undefined undefined, NaN
Founder, Zeever.ca
We ran 7 open-source models through the same benchmark suite and scored them with our LLM-as-judge. Every model got identical retrieved context so the only variable was the generation step. We wanted the best answer quality at the lowest cost for a system grounded in city government content.
Results
| Model | Relevance | Grounded | Citation | Latency | Cost |
|---|---|---|---|---|---|
| GPT-oss 120B(selected) | 0.94 | 1.00 | 0.83 | 5.8s | $0.006 |
| Kimi K2.5 | 0.88 | 0.96 | 0.96 | 6.3s | $0.041 |
| GLM-5 | 0.86 | 0.96 | 0.92 | 9.5s | $0.031 |
| DeepSeek v3.2 | 0.83 | 0.96 | 0.92 | 8.4s | $0.035 |
| Mixtral 8x22B | 0.82 | 0.92 | 0.71 | 3.9s | $0.022 |
| GLM-4.7 | 0.78 | 0.92 | 0.86 | 17.9s | $0.010 |
| DeepSeek v3.1 | 0.75 | 0.88 | 0.75 | 4.0s | $0.024 |
The surprise winner
GPT-oss 120B came out on top for relevance (0.94), hit perfect groundedness (1.00), and was the cheapest option by a wide margin ($0.006 for all benchmark prompts). About 7 times cheaper than the runner-up. We made it our default.
Kimi K2.5 had the best citation quality (0.96) but cost 7 times more. DeepSeek v3.1, which we originally started with, finished last on judge relevance.
Methodology
All models were accessed through Fireworks.ai on the same OpenAI-compatible API. Chunks are retrieved once and reused for every model, so score differences come purely from the generation step. The judge (GPT-oss 120B) evaluates all answers, including its own. We know there is a risk of self-evaluation bias and plan to test cross-model judging in future work.