← All research

100 Questions Across Toronto.ca: Building a Comprehensive Benchmark

undefined undefined, NaN

CS
Colin Smillie

Founder, Zeever.ca

We started with 12 questions, all about building permits. That worked when building permits were all we covered. As we added property taxes, recycling, parking, health, recreation, and city government, we needed a benchmark that could tell us whether the system actually worked across the full breadth of Toronto.ca.

Coverage

CategoryQuestions
Building permits12
Sign permits4
Tree & ravine permits4
Zoning4
Property tax8
Utility bills4
Business licences6
Recycling & garbage8
Streets & parking8
Tickets & fines6
Water & environment6
Grants & incentives4
Housing & shelter6
Health & community6
Parks & recreation8
City government6

Question types

The benchmark covers seven question patterns that real users ask:

  • Conditional: "Do I need a permit for...?"
  • Process: "How do I apply for...?"
  • Cost: "How much does... cost?"
  • Location: "Where can I find...?"
  • Requirements: "What documents do I need?"
  • Eligibility: "Am I eligible for...?"
  • Compliance: "What happens if I don't...?"

How to run it

The benchmark runs against the live API. Each prompt goes in as a query, the answer gets scored by both keyword matching (fast) and our LLM judge (accurate), and the results print as a report.

# Fast keyword scoring (~5 min)
uv run python scripts/eval.py --label my-run

# With LLM judge (~15 min, ~$0.02)
uv run python scripts/eval.py --label my-run --judge