How we build, evaluate, and improve Zeever.ca. We publish our methodology, results, and what we learned along the way.
May 2, 2026
Mapping Canadian AI Compute: Why We Built the Zeever Compute Index
A verified inventory of 39 Canadian GPU providers, normalized on H100 USD/GPU·hr against a $7.50 ceiling, with a sovereignty taxonomy that makes the procurement trade-off legible.
April 23, 2026
The Shift to Agent-First AI: What Model Deprecations Tell Us
AI inference platforms are deprecating chat-first models and replacing them with agent-first MoE architectures. What this shift means for production AI.
April 17, 2026
Running Ollama on an Old Dell XPS With an NVIDIA 3070
We tested local inference with Ollama on a Dell XPS with an RTX 3070 against Fireworks, Together.ai, and OVHcloud. The old desktop held its own.
April 3, 2026
Switching Inference Providers: A 24-Hour Latency Test
We ran a 24-hour latency test across Fireworks, Together, and OVHcloud. Together.ai was 1.7x faster. Here is why we switched.
April 2, 2026
The Missing Layer: AI Inference in Canada
Canada has GPU hosting but no easy way to test and prototype open-source models. That gap is pushing Canadian companies toward US-hosted black boxes.
March 31, 2026
100 Questions Across Toronto.ca: Building a Comprehensive Benchmark
We expanded our evaluation from 12 building permit questions to 100 prompts across 16 categories covering all of Toronto.ca.
March 30, 2026
Scaling a RAG Pipeline from 174 Pages to 35,000 Documents
Our database hit 25GB, the server crashed from OOM, and the web admin went dark during crawls. Here is how we fixed all three.
March 28, 2026
How We Built an LLM-as-Judge to Replace Keyword Scoring
Keyword matching scored our best answer at 0.00. The LLM judge scored it 1.00. Here is how we built a semantic evaluation system.
March 28, 2026
Comparing 7 Open-Source Models for RAG on City Data
We tested 7 open-source LLMs on Toronto city services questions. The cheapest model won.
March 27, 2026
Fixing Vector Search: Probes, Chunking, and Classification
Three retrieval fixes that improved citation accuracy by 35% and fixed queries that returned zero results.
March 27, 2026
Vector RAG vs GraphRAG on Toronto City Data
We compared plain vector retrieval against graph-enhanced retrieval. Graph mode helped on some queries and hurt on others.
Methodology notes
- All evaluations use temperature=0 for deterministic output.
- The judge model (GPT-oss 120B via Fireworks.ai) is the same model used for answer generation. We acknowledge the potential for self-evaluation bias.
- The 100-prompt benchmark suite covers 16 categories across all sections of Toronto.ca.
- Groundedness scoring evaluates whether claims are supported by retrieved evidence. It does not verify that the evidence itself is current on Toronto.ca.
Code and data
The evaluation framework, benchmark prompts, scoring code, and model comparison scripts are available on request. Contact us at infozeever [dot] ca.