Research
Published March 28, 2026. Last updated: March 28, 2026.
Zeever.ca is built on a measurable foundation. We evaluate every component of the system and publish our methodology and results here. This page describes how we test answer quality, how we select our AI models, and what we found.
Corpus
The system is grounded in 2,096 documents crawled from Toronto.ca, including 174 HTML pages and 1,922 PDFs covering building permits, inspections, fees, and application processes. These documents are parsed into 9,689 content chunks, of which 9,192 are embedded using Nomic Embed v1.5 (768 dimensions). A knowledge graph with 2,716 nodes and 2,860 edges captures entity relationships across the permit domain.
Benchmark suite
We evaluate using 12 benchmark prompts that cover the core question categories a Toronto homeowner would ask. Each prompt has expected topic coverage and expected source pages.
| ID | Prompt | Category |
|---|---|---|
| bp-01 | Do I need a building permit to build a deck in Toronto? | Conditional |
| bp-02 | What documents do I need to submit with a building permit application? | Documents |
| bp-03 | How much does a residential building permit cost in Toronto? | Fees |
| bp-04 | What is the process for applying for a building permit in Toronto? | Process |
| bp-05 | How long does it take to get a building permit approved? | Timeline |
| bp-06 | What inspections are required during construction in Toronto? | Inspections |
| bp-07 | Do I need a permit to finish my basement in Toronto? | Conditional |
| bp-08 | Can I apply for a building permit online in Toronto? | Process |
| bp-09 | What is an express building permit and do I qualify? | Process |
| bp-10 | Do I need a permit to build a fence in Toronto? | Conditional |
| bp-11 | What happens if I build without a permit in Toronto? | Compliance |
| bp-12 | What are the requirements for a laneway suite permit? | Requirements |
Scoring methodology
We use two complementary scoring approaches:
Keyword signal scoring (fast)
Each benchmark prompt has a list of expected keywords. We check what fraction of those keywords appear in the answer. This is fast but brittle: if the model says "penalty" instead of "fine," it scores zero on that signal even though the answer is correct.
LLM-as-judge scoring (accurate)
A separate LLM call evaluates each answer on four dimensions, each scored 0.0 to 1.0:
- Relevance: Does the answer directly address the question?
- Completeness: Does it cover the expected topics (semantically, not by keyword)?
- Groundedness: Is it supported by evidence, not hallucinated?
- Citation quality: Are sources cited and relevant to the claims?
The judge understands synonyms: "penalty" matches "fine," "work must stop" matches "stop work order." This makes it significantly more accurate than keyword matching. The judge model runs with temperature=0 for deterministic output.
Current results
Results from our most recent evaluation run (March 2026, n=12 prompts):
| Metric | Keyword Score | LLM Judge Score |
|---|---|---|
| Relevance | 0.60 | 0.88 |
| Groundedness | n/a | 1.00 |
| Citation quality | 0.62 | 0.75 |
100% groundedness means every claim in every answer was supported by retrieved evidence from Toronto.ca. No hallucination was detected across all 12 benchmark prompts.
Model comparison
We tested 7 open-source models via Fireworks.ai using the same benchmark suite and LLM-as-judge evaluation. All models received identical retrieved context for fair comparison.
| Model | Relevance | Grounded | Citation | Latency | Cost |
|---|---|---|---|---|---|
| GPT-oss 120B(selected) | 0.94 | 1.00 | 0.83 | 5.8s | $0.006 |
| Kimi K2.5 | 0.88 | 0.96 | 0.96 | 6.3s | $0.041 |
| GLM-5 | 0.86 | 0.96 | 0.92 | 9.5s | $0.031 |
| DeepSeek v3.2 | 0.83 | 0.96 | 0.92 | 8.4s | $0.035 |
| Mixtral 8x22B | 0.82 | 0.92 | 0.71 | 3.9s | $0.022 |
| GLM-4.7 | 0.78 | 0.92 | 0.86 | 17.9s | $0.010 |
| DeepSeek v3.1 | 0.75 | 0.88 | 0.75 | 4.0s | $0.024 |
Cost is total for all 12 benchmark prompts. GPT-oss 120B was selected as the default model for its combination of highest relevance (0.94), perfect groundedness (1.00), and lowest cost ($0.006 per evaluation run).
Retrieval improvements
During development, we identified and fixed several retrieval quality issues:
- IVFFlat probe count: The default pgvector probe count of 1 caused relevant chunks to be missed entirely. Increasing to 10 probes fixed a query that previously returned zero results.
- Heading-aware chunking: Splitting HTML pages at h2/h3 boundaries (producing 584 focused chunks from 174 pages) improved citation accuracy by 35%, from 0.46 to 0.62.
- Content classifier fix:A rule-based classifier was mis-categorizing process guides as fee schedules because they mentioned "fees" in passing. Title-based classification fixed this.
Vector vs. graph retrieval
We compared plain vector retrieval (RAG) against graph-enhanced retrieval (GraphRAG) using the same benchmark suite. Graph mode improved answers for structured questions about requirements and processes but added latency. Average scores were similar, suggesting the graph is most valuable for specific query types rather than all queries.
Methodology notes
- All evaluations use temperature=0 for deterministic output.
- The judge model (GPT-oss 120B via Fireworks.ai) is the same model used for answer generation. We acknowledge the potential for self-evaluation bias and plan to test cross-model judging in future work.
- The 12-prompt benchmark suite covers the most common question categories but is not exhaustive. We plan to expand it as the system covers more content areas.
- Groundedness scoring evaluates whether claims are supported by retrieved evidence. It does not verify that the retrieved evidence itself is current or accurate on Toronto.ca.
- All evaluation code, prompts, and results are available in the project repository.
Code and data
The evaluation framework, benchmark prompts, scoring code, and model comparison scripts are available on request. Contact us at support@zeever.ca for access.