AI · RAG

RAG Service Co-Pilot for dealership technicians

Technicians describe problems in plain English, but the answers they need are buried in DTC codes and part numbers across PDFs. I built a RAG engine that handles both, with hybrid retrieval, parent-document context, and citations on every answer. Frontline accuracy +40%; $2.1M saved.

Role: Founding Product Manager
Company: AI Growth Vector
Timeline: 2024 — 2025
Industry: Automotive · Service

+40%

frontline first-fix accuracy

$2.1M

annualized savings

−25%

hallucination rate via LLM-as-judge

Hybrid

semantic + BM25 + rerank

Context

Dealership technicians describe vehicle issues in everyday language ("brake squeal at low speed"), but the manuals, DTC codes, and part SKUs they need are buried across PDFs and structured systems. Generic vector search either missed the exact code or returned a fragment without the parent section's context.

The problem

A pure semantic search retrieved meaning but lost exact identifiers (DTC codes, SKUs). Pure keyword search hit identifiers but missed intent. And small chunks gave the LLM enough to sound confident but not enough to actually be right.

My PM decisions

Chose hybrid retrieval: semantic vector search for intent, BM25 for exact matches like DTC codes and part numbers, then RRF rerank to merge.
Adopted parent-document retrieval: search small chunks, return the full surrounding section so the LLM has grounded context.
Required every answer to ship with source citations. Without that, technicians won't trust it and the work can't be audited.
Stood up LLM-as-judge observability: a secondary model scored faithfulness, accuracy, safety, coverage, and hallucination rate on a sampled stream.

The system

Ingestion + structuring: parse PDFs into sections with year/model metadata.
Hybrid retrieval: semantic + BM25, merged with reciprocal rank fusion.
Parent retrieval: small-chunk match, return the full section as context.
Grounded LLM answer generation with inline citations.
Agent action layer (parts lookup via SQL/DMS, technician scheduling, customer messaging).
Observability dashboard tracking the five reliability metrics and a single reliability score.

Outcome

+40% improvement in frontline first-fix accuracy.
$2.1M in annualized savings from reduced rework and faster diagnoses.
−25% hallucination rate after wiring LLM-as-judge into the release gate.
Technicians actually used it, because every answer cited its source.

Reflections

The interesting product question wasn't "which retrieval is best." It was "what does a technician trust enough to act on?" The answer turned out to be citations plus the parent section of the manual, not a confident one-liner. Once trust was earned, adoption stopped being a change-management problem.

The other lesson: ship the LLM-as-judge harness before you ship the model. Without a standing reliability score, every regression conversation becomes a vibes argument. With it, the release gate is a number.

Stack

LangChainBM25Vector DBParentDocumentRetrieverLLM-as-JudgeOpenAI GPT-4o