Research

Published March 28, 2026. Last updated: March 28, 2026.

Zeever.ca is built on a measurable foundation. We evaluate every component of the system and publish our methodology and results here. This page describes how we test answer quality, how we select our AI models, and what we found.

Corpus

The system is grounded in 2,096 documents crawled from Toronto.ca, including 174 HTML pages and 1,922 PDFs covering building permits, inspections, fees, and application processes. These documents are parsed into 9,689 content chunks, of which 9,192 are embedded using Nomic Embed v1.5 (768 dimensions). A knowledge graph with 2,716 nodes and 2,860 edges captures entity relationships across the permit domain.

Benchmark suite

We evaluate using 12 benchmark prompts that cover the core question categories a Toronto homeowner would ask. Each prompt has expected topic coverage and expected source pages.

IDPromptCategory
bp-01Do I need a building permit to build a deck in Toronto?Conditional
bp-02What documents do I need to submit with a building permit application?Documents
bp-03How much does a residential building permit cost in Toronto?Fees
bp-04What is the process for applying for a building permit in Toronto?Process
bp-05How long does it take to get a building permit approved?Timeline
bp-06What inspections are required during construction in Toronto?Inspections
bp-07Do I need a permit to finish my basement in Toronto?Conditional
bp-08Can I apply for a building permit online in Toronto?Process
bp-09What is an express building permit and do I qualify?Process
bp-10Do I need a permit to build a fence in Toronto?Conditional
bp-11What happens if I build without a permit in Toronto?Compliance
bp-12What are the requirements for a laneway suite permit?Requirements

Scoring methodology

We use two complementary scoring approaches:

Keyword signal scoring (fast)

Each benchmark prompt has a list of expected keywords. We check what fraction of those keywords appear in the answer. This is fast but brittle: if the model says "penalty" instead of "fine," it scores zero on that signal even though the answer is correct.

LLM-as-judge scoring (accurate)

A separate LLM call evaluates each answer on four dimensions, each scored 0.0 to 1.0:

  • Relevance: Does the answer directly address the question?
  • Completeness: Does it cover the expected topics (semantically, not by keyword)?
  • Groundedness: Is it supported by evidence, not hallucinated?
  • Citation quality: Are sources cited and relevant to the claims?

The judge understands synonyms: "penalty" matches "fine," "work must stop" matches "stop work order." This makes it significantly more accurate than keyword matching. The judge model runs with temperature=0 for deterministic output.

Current results

Results from our most recent evaluation run (March 2026, n=12 prompts):

MetricKeyword ScoreLLM Judge Score
Relevance0.600.88
Groundednessn/a1.00
Citation quality0.620.75

100% groundedness means every claim in every answer was supported by retrieved evidence from Toronto.ca. No hallucination was detected across all 12 benchmark prompts.

Model comparison

We tested 7 open-source models via Fireworks.ai using the same benchmark suite and LLM-as-judge evaluation. All models received identical retrieved context for fair comparison.

ModelRelevanceGroundedCitationLatencyCost
GPT-oss 120B(selected)0.941.000.835.8s$0.006
Kimi K2.50.880.960.966.3s$0.041
GLM-50.860.960.929.5s$0.031
DeepSeek v3.20.830.960.928.4s$0.035
Mixtral 8x22B0.820.920.713.9s$0.022
GLM-4.70.780.920.8617.9s$0.010
DeepSeek v3.10.750.880.754.0s$0.024

Cost is total for all 12 benchmark prompts. GPT-oss 120B was selected as the default model for its combination of highest relevance (0.94), perfect groundedness (1.00), and lowest cost ($0.006 per evaluation run).

Retrieval improvements

During development, we identified and fixed several retrieval quality issues:

  • IVFFlat probe count: The default pgvector probe count of 1 caused relevant chunks to be missed entirely. Increasing to 10 probes fixed a query that previously returned zero results.
  • Heading-aware chunking: Splitting HTML pages at h2/h3 boundaries (producing 584 focused chunks from 174 pages) improved citation accuracy by 35%, from 0.46 to 0.62.
  • Content classifier fix:A rule-based classifier was mis-categorizing process guides as fee schedules because they mentioned "fees" in passing. Title-based classification fixed this.

Vector vs. graph retrieval

We compared plain vector retrieval (RAG) against graph-enhanced retrieval (GraphRAG) using the same benchmark suite. Graph mode improved answers for structured questions about requirements and processes but added latency. Average scores were similar, suggesting the graph is most valuable for specific query types rather than all queries.

Methodology notes

  • All evaluations use temperature=0 for deterministic output.
  • The judge model (GPT-oss 120B via Fireworks.ai) is the same model used for answer generation. We acknowledge the potential for self-evaluation bias and plan to test cross-model judging in future work.
  • The 12-prompt benchmark suite covers the most common question categories but is not exhaustive. We plan to expand it as the system covers more content areas.
  • Groundedness scoring evaluates whether claims are supported by retrieved evidence. It does not verify that the retrieved evidence itself is current or accurate on Toronto.ca.
  • All evaluation code, prompts, and results are available in the project repository.

Code and data

The evaluation framework, benchmark prompts, scoring code, and model comparison scripts are available on request. Contact us at support@zeever.ca for access.