Research

How we build, evaluate, and improve Zeever.ca. We publish our methodology, results, and what we learned along the way.

May 2, 2026

Mapping Canadian AI Compute: Why We Built the Zeever Compute Index

A verified inventory of 39 Canadian GPU providers, normalized on H100 USD/GPU·hr against a $7.50 ceiling, with a sovereignty taxonomy that makes the procurement trade-off legible.

April 23, 2026

The Shift to Agent-First AI: What Model Deprecations Tell Us

AI inference platforms are deprecating chat-first models and replacing them with agent-first MoE architectures. What this shift means for production AI.

April 17, 2026

Running Ollama on an Old Dell XPS With an NVIDIA 3070

We tested local inference with Ollama on a Dell XPS with an RTX 3070 against Fireworks, Together.ai, and OVHcloud. The old desktop held its own.

April 3, 2026

Switching Inference Providers: A 24-Hour Latency Test

We ran a 24-hour latency test across Fireworks, Together, and OVHcloud. Together.ai was 1.7x faster. Here is why we switched.

April 2, 2026

The Missing Layer: AI Inference in Canada

Canada has GPU hosting but no easy way to test and prototype open-source models. That gap is pushing Canadian companies toward US-hosted black boxes.

March 31, 2026

100 Questions Across Toronto.ca: Building a Comprehensive Benchmark

We expanded our evaluation from 12 building permit questions to 100 prompts across 16 categories covering all of Toronto.ca.

March 30, 2026

Scaling a RAG Pipeline from 174 Pages to 35,000 Documents

Our database hit 25GB, the server crashed from OOM, and the web admin went dark during crawls. Here is how we fixed all three.

March 28, 2026

How We Built an LLM-as-Judge to Replace Keyword Scoring

Keyword matching scored our best answer at 0.00. The LLM judge scored it 1.00. Here is how we built a semantic evaluation system.

March 28, 2026

Comparing 7 Open-Source Models for RAG on City Data

We tested 7 open-source LLMs on Toronto city services questions. The cheapest model won.

March 27, 2026

Fixing Vector Search: Probes, Chunking, and Classification

Three retrieval fixes that improved citation accuracy by 35% and fixed queries that returned zero results.

March 27, 2026

Vector RAG vs GraphRAG on Toronto City Data

We compared plain vector retrieval against graph-enhanced retrieval. Graph mode helped on some queries and hurt on others.

Methodology notes

All evaluations use temperature=0 for deterministic output.
The judge model (GPT-oss 120B via Fireworks.ai) is the same model used for answer generation. We acknowledge the potential for self-evaluation bias.
The 100-prompt benchmark suite covers 16 categories across all sections of Toronto.ca.
Groundedness scoring evaluates whether claims are supported by retrieved evidence. It does not verify that the evidence itself is current on Toronto.ca.

Code and data

The evaluation framework, benchmark prompts, scoring code, and model comparison scripts are available on request. Contact us at infozeever [dot] ca.