Building RAG from scratch

RAG.

Vector database.

Embeddings.

Semantic search.

Chunking.

If AI Twitter and every product landing page this year had a drinking game, these words would end it fast. Most explanations either wave hands at "the model searches your data" or dive straight into linear algebra. So here's the plain version.

Retrieval augmented generation solves one problem: a language model only knows what it was trained on. It's never seen your company's internal wiki, last week's PDF report, or your collection of personal notes. But when a system uses RAG, before asking the model a question, relevant pieces of text get pulled from a document collection first, and those pieces get handed to the model alongside the question. The model answers from the hard data in front of it, instead of hallucinating what you want to hear.

Fundamentally, four steps make it work: break documents into chunks, turn each chunk into something searchable, find the ones that match a question, and feed them to the model. That middle part can be done a few different ways. Let's dive deeper.

First, how does a word become a number? (or: what are vector embeddings?)

A computer might be able to compare the content of two sentences directly, but it can't compare their semnatic meanings. It can only compare numbers. So text has to become numbers, and not just any numbers; ones that capture meaning.

Imagine plotting every word in the dictionary on a giant map, where similar words end up near each other. "Dog" and "puppy" sit close together. "Dog" and "stapler" sit much farther apart. "King" and "queen" sit near each other, but shifted in a consistent direction: "king" will be closer to "man", whereas "queen" will be closer to "woman." That map is what a vector embedding model builds, and is also the way LLMs themselves use and process information.

The same trick extends from single words to whole sentences and paragraphs, producing a list of numbers called an embedding.

flowchart LR
    A["text: a word or sentence"] --> B[Embedding Model]
    B --> C["a list of numbers
    [0.12, -0.87, 0.33, ...]"]
    C --> D["plotted on the meaning map"]

Texts with similar meanings end up with similar vectors, even if they don't share a single word. For example, "my car wouldn't start" and "the vehicle failed to turn on" land close together, because the embedding captures meaning, not spelling. That's the whole trick behind semantic, vector-based search: compare lists of numbers, not words. Let's pivot back to RAG.

Two ways to think about retrieval

Vector retrieval uses the embedding idea above. Every chunk of a document gets converted into a list of numbers ahead of time and stored. A question gets converted the same way, and whichever stored chunks land closest on the map get pulled out with the highest priority and are passed to the model.

Vectorless retrieval skips embeddings. There are two flavors: old-school keyword matching, and having a back-and-forth with the model using a table of contents (i.e. PageIndex). Let's explore both ideas.

Building the chunking step

Vector search and keyword search both need documents broken into small pieces first, there's no way to point at a specific paragraph inside one giant embedding. The simplest way to do this is to split every chunk_size words into a seperate chunk - if processing PDFs (often coming out of OCR), another way is page-based (spoiler).

def chunk_text(text, chunk_size=200):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunks.append(" ".join(words[start:end]))
    return chunks

Building vector retrieval

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed(texts):
    return model.encode(texts, normalize_embeddings=True)

class VectorStore:
    def __init__(self):
        self.texts = []
        self.vectors = None

    def add(self, texts):
        new_vectors = embed(texts)
        self.texts.extend(texts)
        self.vectors = new_vectors if self.vectors is None else np.vstack([self.vectors, new_vectors])

    def search(self, query, top_k=3):
        query_vector = embed([query])[0]
        scores = self.vectors @ query_vector
        top = np.argsort(scores)[::-1][:top_k]
        return [self.texts[i] for i in top]

Each chunk gets mapped to its list of numbers and stored. A search compares the question's numbers against every stored chunk and returns whichever are closest.

Building vectorless retrieval

Keyword search (BM25)

BM25 counts how often the words in a question appear in each chunk, weighted so rare words matter more than common ones. No neural network, no embeddings, just word frequency math. It shines when questions use the same terminology as the source, product codes, names, exact phrases embeddings tend to blur.

from rank_bm25 import BM25Okapi

class KeywordStore:
    def __init__(self):
        self.texts = []
        self.bm25 = None

    def add(self, texts):
        self.texts.extend(texts)
        tokenized = [t.lower().split() for t in self.texts]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query, top_k=3):
        scores = self.bm25.get_scores(query.lower().split())
        top = np.argsort(scores)[::-1][:top_k]
        return [self.texts[i] for i in top]

Same add / search shape as the vector store, no embeddings underneath, still needs chunks going in.

Reasoning over structure (PageIndex)

BM25 and vector search both chop a document into arbitrary pieces first, then score each piece alone. PageIndex skips that. It builds a tree from a document's actual structure, its sections and sub-sections, the same way a table of contents already does. Each node in that tree gets a short summary. Then an LLM reasons over the tree: given this question, which branch would contain the answer?

A tiny example, built from a financial report:

Section	Title	Summary
1	Overview	Company summary and fiscal year highlights
2	Financial Results	Revenue, expenses, and margin breakdown
2.1	Revenue	Quarterly revenue by region
2.2	Expenses	Operating cost breakdown
3.C	Appendix C: Definitions	Glossary of financial terms used in the report

A question like "what drove the expense increase" points the model straight at 2.2. A question that references "the definition in Appendix C" gets routed there directly, something a similarity score has no way of doing, since the phrase "Appendix C" rarely looks similar to the appendix itself.

The code below is a stand-in, the real thing involves API calls, but the shape holds:

class PageIndexStore:
    def __init__(self, document):
        self.tree = build_toc_tree(document)  # sections and summaries, no chunking

    def search(self, query):
        node = reason_over_tree(self.tree, query)  # the model reads titles/summaries, picks a branch
        return node.text

No chunk_text call anywhere, that's the point. This tends to shine on long, structured documents, contracts, filings, manuals, where the answer lives in one specific section.

The generation step

However the chunks got found, the last step is the same: hand them to the model as context, then ask the question.

import anthropic

client = anthropic.Anthropic()

def answer_question(query, store, top_k=3):
    chunks = store.search(query, top_k=top_k)
    context = "\n\n".join(chunks)

    prompt = f"""Answer the question using only the context below.
If the context doesn't contain the answer, say so.

Context:
{context}

Question: {query}"""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Swap in a VectorStore, a KeywordStore, or a PageIndexStore lookup, the function doesn't much care which. All of them hand back text.

Hybrid: vector first, then enrich

A useful pipeline layers all three rather than picking one:

Vector search first, over the whole collection, for broad semantic recall, this narrows things down to a handful of candidate pages.
Pull in PageIndex summaries for those candidate pages, giving the model the surrounding structural context, not just the raw chunk.
Run keyword/BM25/fuzzy matching alongside it, to catch exact terms vector search tends to blur, company IDs, ticket numbers, product codes.

class HybridStore:
    def __init__(self, document):
        self.vector_store = VectorStore()
        self.keyword_store = KeywordStore()
        self.page_index = PageIndexStore(document)

    def add(self, texts):
        self.vector_store.add(texts)
        self.keyword_store.add(texts)

    def search(self, query, top_k=3):
        pages = self.vector_store.search(query, top_k=top_k)          # broad semantic recall
        summaries = [self.page_index.summary_for(p) for p in pages]   # structural context per page
        exact_hits = self.keyword_store.search(query, top_k=top_k)    # catches IDs, codes, exact terms

        return pages + summaries + exact_hits

Vector search decides which pages are in play, PageIndex explains where those pages sit in the document, and keyword/fuzzy matching makes sure a literal string like a company ID doesn't get lost in translation. Three narrow tools, each catching what the others miss.

Where this tends to go next

The core loop, chunk, find, feed to the model, covers a surprising amount of ground on its own. From here, common next moves include a re-ranking pass over a wider candidate set, and attaching metadata like source or date so results can be filtered before a search even runs.

None of that requires starting over, it all sits on top of the pieces built above.