·RAG · engineering
How RAG actually works (in 5 paragraphs, no jargon)
Retrieval-Augmented Generation explained without the academic-paper voice. What happens between "user asks" and "model answers."
RAG stands for Retrieval-Augmented Generation. The name sounds academic; the idea is simple. Instead of expecting a language model to know your specific docs, you teach the model how to look them up at the moment of the question. The model doesn't memorize your docs — it reads the relevant parts, every time, fresh.
Step 1: chunking
When you upload a document, the system splits it into smaller windows — typically 300-800 tokens each, with a bit of overlap so context isn't lost at the seams. A 50-page PDF becomes ~150 chunks. Each chunk is a self-contained snippet of meaning: a paragraph, a section, a code block.
Step 2: embedding
Each chunk gets passed through an embedding model — a small specialized model whose only job is to turn a piece of text into a long list of numbers (a vector). At Ashh.ai we use nomic-embed-text, which produces a 768-dimensional vector per chunk. Two chunks that mean similar things end up with vectors that point in similar directions. Two chunks about unrelated topics point in very different directions.
These vectors get stored in a vector database alongside the original text. We use Postgres with the pgvector extension and HNSW indexing — same engine that runs the rest of the app, just one more column.
Step 3: retrieval
When a user asks a question, the question gets embedded too (same model, same 768 dimensions). The vector database compares the question's vector against all your stored chunk vectors and returns the top-K most-similar ones — typically K=4 to K=10. This is "cosine similarity" — measuring the angle between vectors.
The retrieved chunks are the closest semantic match for the question. Not keyword match — semantic. "How do I cancel my subscription?" pulls up the chunk about subscription management even if it uses different words ("end your plan," "stop billing").
Step 4: generation
The retrieved chunks get prepended to the user's question as context, and the whole thing is sent to the language model with a prompt like: "Using only the context below, answer the question. If the context doesn't contain the answer, say so." The model generates a reply grounded in the retrieved chunks.
That's it. The model doesn't need to know your docs in advance. The docs change? Re-embed the changed chunks; everything stays current. Privacy? Your docs live in your database, not in the model's weights. Provenance? You know exactly which chunks were used to generate each answer.
What it's not good at
RAG is excellent at "find the relevant snippet and answer from it" questions. It struggles with questions that require synthesizing information across many chunks at once ("what are all the security features mentioned anywhere in our docs?") because top-K retrieval might miss some. It also can't answer about things not in your docs — by design.
For most chatbot use cases — customer support, internal Q&A, product help — RAG is the right architecture. For deep multi-document analysis, you may need agentic retrieval (the model plans queries, retrieves iteratively, refines) which is a topic for another post.
Build a private AI chatbot in 5 minutes.
Flat-rate. Your data never used to train anyone else's models.
Start free →