Fixing Vector Search: IVFFlat Probes, Heading-Aware Chunking, and Content Classification

March 27, 2026

Founder, Developer, AI Researcher

Almost every answer quality problem we found came down to retrieval, not generation. The model was doing a fine job with the chunks it received. The issue was that it kept getting the wrong chunks, or none at all.

Fix 1: IVFFlat probe count

Our pgvector index used 20 lists (clusters of similar vectors). By default, pgvector checks just 1 list per search. With thousands of vectors across 20 lists, the best match could easily sit in a list we never looked at.

The query "What happens if I build without a permit?" came back empty. The chunk was there, embedded and ready, but the index skipped right past it. Setting ivfflat.probes = 10 fixed it on the spot. That chunk showed up as the top result at 0.83 similarity.

Fix 2: Heading-aware chunking

The HTML parser extracted text with single newlines between elements. The chunker split on double newlines. Since there were none, whole pages ended up as single chunks. The "When Do I Need a Building Permit" page (1,052 words) was one chunk. Its embedding was a fuzzy average of decks, fences, basements, and renovations all at once.

We rewrote the parser to insert section breaks at h2/h3 headings and taught the chunker to split there first. 174 HTML pages turned into 584 focused chunks. Citation accuracy went from 0.46 to 0.62, a 35% improvement.

Fix 3: Content classifier

A rule-based classifier sorted pages into fee_schedule, guide, regulation, or page. It scanned the title and body for keywords, checking "fee" first. The "Building Permit Review Streams" page (which has the actual review timelines) mentioned "All fees have been paid" in passing as a completeness requirement. The classifier filed it under fee_schedule. When people asked about timelines, it never came up.

We fixed this by giving the title priority. "Review Streams" in the title maps to guide because "review" is a guide keyword. The text scan only runs when the title is ambiguous.

Takeaway

Better chunking had more impact than any prompt engineering. In a RAG system, the retriever is the bottleneck, not the LLM.