← All research

How We Built an LLM-as-Judge to Replace Keyword Scoring

undefined undefined, NaN

CS
Colin Smillie

Founder, Zeever.ca

Our first evaluation system was simple: check whether specific keywords show up in the answer. If we expected the word "fine" and the model said "penalty," it scored zero. The system was working better than the numbers suggested.

The clearest example was the question "What happens if I build without a permit in Toronto?" The answer covered penalties, fees, stop-work consequences, and legal risks in detail. The keyword scorer gave it 0.00 because the wording didn't match the expected signals ("enforcement," "stop work," "fine," "order"). A good answer, scored as a total failure.

The fix: let a model judge the model

We built an LLM-as-judge that scores each answer on four dimensions, each rated 0.0 to 1.0:

  • Relevance: Does the answer directly address the question?
  • Completeness: Does it cover the expected topics (semantically, not by keyword)?
  • Groundedness: Is it supported by evidence, not hallucinated?
  • Citation quality: Are sources cited and relevant?

The prompt tells the judge explicitly that expected topics are concepts, not exact words. It runs at temperature=0 so results are deterministic. The judge is the same open-source model we use for answer generation (GPT-oss 120B via Fireworks.ai).

Results

MetricKeyword ScoreLLM Judge Score
Relevance0.600.88
Groundednessn/a1.00
Citation quality0.620.75

100% groundedness means every claim in every answer was supported by retrieved evidence from Toronto.ca. No hallucination was detected.

Cost

Each judge call costs about $0.0002 at Fireworks pricing. A full 100-question run with judging comes in under $0.02. We run both keyword and judge scoring together so we can see how they compare.