Glossary
What is Retrieval-Augmented Generation?
Also known as: RAG
Retrieval-Augmented Generation (RAG) is a technique that grounds a large language model in your own data. At query time, a retrieval step finds the most relevant documents (or document chunks) from a corpus you control — typically using a vector database — and includes them in the LLM prompt. This dramatically reduces hallucinations and lets the model answer questions about content the base model has never seen, without the cost of retraining.
How RAG works
A RAG system has three phases: indexing, retrieval, and generation. During indexing, your documents are split into chunks (paragraphs, sections, or sliding windows of text), each chunk is converted to a vector embedding, and the vectors are stored in a vector database (Pinecone, Weaviate, pgvector, Chroma).
At query time, the user’s question is also converted to a vector. The vector database finds the chunks whose vectors are closest to the query vector — semantic similarity, not keyword match. Those chunks are pulled into the LLM prompt as context.
During generation, the LLM is asked to answer the question using the retrieved chunks. A well-tuned system also instructs the model to cite sources or say "I don’t know" if the chunks don’t contain the answer.
Why RAG matters
RAG addresses two of the largest practical limitations of LLMs: stale training data (LLMs only know what was in their training set) and hallucination (LLMs fabricate plausible-sounding information when uncertain). By retrieving fresh, authoritative content at query time and instructing the model to ground answers in that content, RAG produces answers that are current and traceable.
For law firms specifically, RAG is useful for: searching firm-specific case files, surfacing precedent from a firm’s own past matters, answering caller questions about firm-specific policies (fees, languages spoken, office hours), and ensuring an AI agent quotes the firm’s actual marketing copy rather than inventing claims.
RAG and AI voice agents
In voice agents, RAG is what lets the agent answer caller questions using firm-specific information. "What languages do you speak?" "What are your office hours?" "Do you handle workers’ compensation cases in Florida?" — these are RAG queries against the firm’s own documents and configuration. The alternative — hard-coding all this in the system prompt — doesn’t scale beyond a few facts.
When RAG is the wrong tool
RAG is overkill for tasks where the answer is structured and exact (like "what is the user’s account balance" — that’s a database query). It’s also imperfect for highly conversational tasks where the LLM needs broad world knowledge that wouldn’t be in your corpus. Modern systems often combine RAG (for firm-specific facts) with the LLM’s baseline knowledge (for general world knowledge), letting the model use whichever is appropriate per turn.