← All research

Comparing 7 Open-Source Models for RAG on City Data

undefined undefined, NaN

CS
Colin Smillie

Founder, Zeever.ca

We ran 7 open-source models through the same benchmark suite and scored them with our LLM-as-judge. Every model got identical retrieved context so the only variable was the generation step. We wanted the best answer quality at the lowest cost for a system grounded in city government content.

Results

ModelRelevanceGroundedCitationLatencyCost
GPT-oss 120B(selected)0.941.000.835.8s$0.006
Kimi K2.50.880.960.966.3s$0.041
GLM-50.860.960.929.5s$0.031
DeepSeek v3.20.830.960.928.4s$0.035
Mixtral 8x22B0.820.920.713.9s$0.022
GLM-4.70.780.920.8617.9s$0.010
DeepSeek v3.10.750.880.754.0s$0.024

The surprise winner

GPT-oss 120B came out on top for relevance (0.94), hit perfect groundedness (1.00), and was the cheapest option by a wide margin ($0.006 for all benchmark prompts). About 7 times cheaper than the runner-up. We made it our default.

Kimi K2.5 had the best citation quality (0.96) but cost 7 times more. DeepSeek v3.1, which we originally started with, finished last on judge relevance.

Methodology

All models were accessed through Fireworks.ai on the same OpenAI-compatible API. Chunks are retrieved once and reused for every model, so score differences come purely from the generation step. The judge (GPT-oss 120B) evaluates all answers, including its own. We know there is a risk of self-evaluation bias and plan to test cross-model judging in future work.