← All research
100 Questions Across Toronto.ca: Building a Comprehensive Benchmark
undefined undefined, NaN
CS
Colin Smillie
Founder, Zeever.ca
We started with 12 questions, all about building permits. That worked when building permits were all we covered. As we added property taxes, recycling, parking, health, recreation, and city government, we needed a benchmark that could tell us whether the system actually worked across the full breadth of Toronto.ca.
Coverage
| Category | Questions |
|---|---|
| Building permits | 12 |
| Sign permits | 4 |
| Tree & ravine permits | 4 |
| Zoning | 4 |
| Property tax | 8 |
| Utility bills | 4 |
| Business licences | 6 |
| Recycling & garbage | 8 |
| Streets & parking | 8 |
| Tickets & fines | 6 |
| Water & environment | 6 |
| Grants & incentives | 4 |
| Housing & shelter | 6 |
| Health & community | 6 |
| Parks & recreation | 8 |
| City government | 6 |
Question types
The benchmark covers seven question patterns that real users ask:
- Conditional: "Do I need a permit for...?"
- Process: "How do I apply for...?"
- Cost: "How much does... cost?"
- Location: "Where can I find...?"
- Requirements: "What documents do I need?"
- Eligibility: "Am I eligible for...?"
- Compliance: "What happens if I don't...?"
How to run it
The benchmark runs against the live API. Each prompt goes in as a query, the answer gets scored by both keyword matching (fast) and our LLM judge (accurate), and the results print as a report.
# Fast keyword scoring (~5 min) uv run python scripts/eval.py --label my-run # With LLM judge (~15 min, ~$0.02) uv run python scripts/eval.py --label my-run --judge