Difficulty: Intermediate Time Investment: 2-3 hours Prerequisites: Basic understanding of vector embeddings, databases
LLMs have three fundamental limitations:
RAG addresses all three by giving the LM access to external, up-to-date, domain-specific knowledge at query time.
Understanding RAG architecture helps you understand how to evaluate vendors and tool but also how to design custom solutions when needed.
Scenario: User asks, “What’s our vacation policy for remote employees?”
Without RAG:
With RAG:
Formula:
Final Response = LLM(Original Query + Retrieved Context)
graph LR
A[Documents] --> B[Chunking]
B --> C[Vector Embedding Model]
C --> D[Vector Database]
E[Document 1] -.-> A
F[Document 2] -.-> A
G[Document N] -.-> A
Step 1: Chunking Break large documents into smaller, semantically meaningful pieces.
Why chunking?
Example:
Original Document (5000 words)
↓
Chunk 1: Introduction (500 words)
Chunk 2: Architecture Overview (600 words)
Chunk 3: Security Considerations (700 words)
...
Step 2: Embedding Convert each chunk into a numerical vector (e.g., 1536-dimensional array).
Key insight: Documents with similar semantic meaning have vectors that are numerically “close” in high-dimensional space.
Example:
Text: "The API endpoint returns JSON"
Vector: [0.12, -0.45, 0.78, ..., 0.23] (1536 numbers)
Text: "JSON responses from REST API"
Vector: [0.14, -0.43, 0.76, ..., 0.21] (similar numbers!)
Popular Embedding Models (Always check for newer models):
text-embedding-3-large (best quality, higher cost)embed-v3 (good balance)sentence-transformers (free, runs locally)Step 3: Indexing Store vectors in a specialised database optimised for similarity search.
graph TD
A[User Query] --> B[Embed Query]
B --> C[Similarity Search]
C --> D[Vector Database]
D --> E[Top K Chunks]
E --> F[Augment Original Prompt]
F --> G[LLM]
G --> H[Final Answer]
Step 1: Embed the Query Use the same embedding model to convert the user’s question into a vector.
Example:
Query: "How do I deploy the API?"
Vector: [0.13, -0.44, 0.77, ..., 0.22]
Step 2: Similarity Search Find the top K chunks whose vectors are closest to the query vector.
Common algorithm: Cosine similarity or Euclidean distance
Example:
Query vector: [0.13, -0.44, ...]
Top matches:
1. "Deployment Guide" (similarity: 0.92)
2. "API Configuration" (similarity: 0.87)
3. "CI/CD Pipeline" (similarity: 0.81)
Step 3: Augment the Prompt Inject retrieved chunks into the LM prompt.
Example:
System: You are a helpful assistant. Use the following context to answer questions.
Context:
[Chunk 1: Deployment Guide text...]
[Chunk 2: API Configuration text...]
User: How do I deploy the API?
Step 4: Generate Response The LM now has relevant context and can answer accurately without hallucinating.
| Strategy | Chunk Size | When to Use | Trade-offs |
|---|---|---|---|
| Fixed-size | 500-1000 tokens | Simple docs (articles, blogs) | May split mid-sentence; fast to implement |
| Semantic | Variable (by paragraph/section) | Structured docs (manuals, wikis) | Preserves meaning; harder to implement |
| Sliding window | Overlapping chunks (e.g., 1000 tokens, 200 overlap) | Dense technical docs | Avoids context loss; higher storage cost |
Architect’s Decision:
| Database | Type | When to Use | Trade-offs |
|---|---|---|---|
| Pinecone | Managed, cloud-native | Production, scale quickly | ✅ Easy, ❌ Vendor lock-in, cost |
| Weaviate | Open-source, self-hosted | Need control, hybrid search | ✅ Flexible, ❌ Operational overhead |
| pgvector | Postgres extension | Already using Postgres | ✅ Simple stack, ❌ Less optimised for scale |
| Chroma | In-memory, local | Prototyping, small datasets | ✅ Fast setup, ❌ Not production-ready |
Architect’s Decision:
Approach 1: Naive Retrieval (Top-K)
Approach 2: Hybrid Search (Vector + Keyword)
Approach 3: Reranking
Architect’s Decision:
RAG infrastructure costs vary significantly by scale and approach. Understanding the cost model helps justify investment and choose appropriate solutions.
£ (Prototype/Small Scale)
££ (Production/Medium Scale)
£££ (Enterprise/Large Scale)
1. Embedding API Calls
2. Vector Storage & Compute
3. Retrieval Latency
| Database | Setup Cost | Ongoing Cost | Operational Overhead | Lock-in Risk |
|---|---|---|---|---|
| Chroma | £ | £ | Low (local dev) | None |
| Pinecone | ££ | ££ | Lowest (managed) | High |
| Weaviate | ££ | £-£££ | Medium-High | Medium |
| pgvector | £ | £ | Medium | Low |
Decision framework:
When does RAG pay for itself?
RAG systems typically justify investment when:
Key metrics to track:
Other benefits:
Setup (using Python + OpenAI):
# 1. Install dependencies
# pip install openai chromadb
# 2. Load documents
docs = ["Your company handbook", "API documentation", ...]
# 3. Chunk documents
chunks = [doc.split("\n\n") for doc in docs] # Simple paragraph splitting
# 4. Embed and store in Chroma
import chromadb
client = chromadb.Client()
collection = client.create_collection("my_docs")
collection.add(documents=chunks, ids=[...])
# 5. Query
query = "What's the vacation policy?"
results = collection.query(query_texts=[query], n_results=3)
# 6. Augment prompt
context = "\n".join(results['documents'])
prompt = f"Context:\n{context}\n\nQuestion: {query}"
# 7. Send to LLM
# (use OpenAI API or similar)
Observe:
Setup: Take a 10-page technical doc
Task:
Query: “How do I configure authentication?”
Observe:
Problem: Using model-a to embed documents, model-b to embed queries
Result: Similarity search fails (vectors aren’t comparable)
Solution: Always use the same embedding model for indexing and querying
Problem: Retrieving chunks from old/deprecated docs
Solution: Add metadata (e.g., version, last_updated) and filter before similarity search
Example:
collection.query(
query_texts=["How to deploy?"],
where={"version": "2.0"}, # Only search current version
n_results=5
)
Problem: Assuming retrieval always works Solution: Log retrieved chunks; manually review to ensure relevance
Metric to track: Precision@K (% of retrieved chunks that are actually relevant)
Problem: Using RAG when the LM already knows the answer Solution: Use RAG for domain-specific, proprietary, or recent data. For general knowledge, skip retrieval.
Problem: User queries are short; documents are long. Vectors may not match well.
Solution:
Example:
Query: "How to deploy?"
Hypothetical Answer: "To deploy the API, first configure the environment variables..."
Embed hypothetical answer → search → retrieve actual deployment guide
Problem: A single query may not capture all relevant docs
Solution: Generate multiple variations of the query, retrieve for each, combine results
Example:
Original: "How to deploy the API?"
Variations:
- "API deployment steps"
- "Production deployment guide"
- "Deploy REST API"
Problem: Small chunks lack context; large chunks hurt precision
Solution: Store small chunks for retrieval, but return the full parent document to the LM
Example:
Retrieve: "Step 3: Configure database connection"
Return to LM: Entire "Deployment Guide" (includes context from steps 1-10)
RAG is the bridge between LMs and your data. When designing RAG systems, focus on:
When to use RAG:
When NOT to use RAG:
Start with a prototype (Chroma + OpenAI embeddings), test with real user queries, then scale.