RAG Patterns: Retrieval-Augmented Generation

Master RAG patterns from naive to agentic. Learn retrieval strategies, prompt design, and when to use each approach for AI-powered Q&A systems.

May 4, 2026
ragretrievalvector-searchknowledge-baseprompt-engineering

RAG Patterns

Retrieval-Augmented Generation (RAG) combines information retrieval with LLMs to produce accurate, grounded answers. The pattern you choose — naive, advanced, or agentic — depends on your data complexity and quality requirements.

Naive RAG

The simplest pattern: retrieve documents, concatenate with the question, generate an answer.

Retrieved documents:
[doc_1] Django is a high-level Python web framework that encourages rapid development...
[doc_2] Flask is a micro-framework for Python with a minimal core...
[doc_3] FastAPI is a modern Python web framework based on Starlette...

Question: What are the best Python web frameworks for 2026?

Answer using only the retrieved documents. If the documents don't
contain enough information, say so explicitly. Cite your sources.

Limitations: No query rewriting, no reranking, no handling of missing information. If the retrieval fails, the answer fails.

When naive is sufficient: Simple FAQ, small document sets (under 100 docs), demo applications, internal tools where 80% accuracy is acceptable.

Advanced RAG

Add pre-retrieval and post-retrieval steps to improve quality.

Pre-retrieval — Query rewriting:

Original query: "How do I set it up?"
Context: User was just reading about Django REST Framework
Rewritten: "How do I set up authentication in Django REST Framework 3.14?"
Rewrite the user's question into a clear, standalone search query.
Instructions:
- Expand abbreviations and acronyms
- Resolve pronouns ("it", "that", "this") by referencing conversation context
- Add domain-specific terms that improve matching
- Output only the rewritten query

Original: {user_query}
Conversation context: {context}
Search query:

Chunking strategies — how you split documents matters:

StrategyMethodBest For
Fixed-sizeSplit every N charactersSimple content, logs
SemanticSplit at topic boundariesArticles, documentation
RecursiveSplit by paragraph → sentence → wordMixed content
HierarchicalChunk + parent document referenceLong documents needing full-context answers
Chunk by section headings, not by character count.
Each chunk should be a self-contained unit of meaning.
Include the document title and section path in each chunk's metadata.
Target size: 500-1000 tokens per chunk.

Hybrid search: Combine vector (semantic) and keyword (BM25) retrieval for better coverage.

Search using both methods:
1. Vector search — finds semantically similar chunks
2. Keyword search — finds exact term matches

Merge results from both, deduplicate, and rerank by combined score.
Weight: 0.6 vector + 0.4 keyword (adjust based on your data).

Post-retrieval — Reranking:

Rank these documents by relevance to the query. Keep only the top 3.

Query: "Python async web frameworks"
Documents:
1. [doc_A: "Introduction to Python"] — relevance: low
2. [doc_B: "FastAPI async handlers"] — relevance: high
3. [doc_C: "Django ORM tutorial"] — relevance: medium
4. [doc_D: "AIOHTTP vs FastAPI"] — relevance: high
5. [doc_E: "Python packaging guide"] — relevance: low

Kept: doc_B, doc_D, doc_C

HyDE (Hypothetical Document Embeddings): Generate a hypothetical ideal document from the query, then use that to search.

Given the question, first generate a hypothetical document that
would perfectly answer it. Then use that document's embedding
to search for real documents.

Question: {question}
Hypothetical document:

Best for: Production Q&A, documentation search, any system requiring high precision.

Agentic RAG

The agent decides when and what to retrieve, using tools dynamically. It can refine searches, try different approaches, and know when it has enough information.

You are a research assistant with access to a knowledge base.
Retrieve information only when needed.

Available tools:
- search_knowledge_base(query: string) → [documents]

Rules:
1. First, try to answer from your own knowledge
2. If unsure, search the knowledge base
3. If search results are insufficient, refine your search
4. Cite sources for any retrieved information
5. Say "I couldn't find information on that" only after 2 search attempts

Remember: you can search multiple times with different queries.

Self-querying retrieval: The model extracts structured filters from natural language.

From the user's question, extract:
- Search query (the core information need)
- Filters (date range, category, author, version)
- Sort order (relevance, date, popularity)

Question: "Show me the latest articles about React Server Components from 2025"
→ Query: "React Server Components"
→ Filters: {year: 2025}
→ Sort: by date descending

When retrieval is insufficient, the agent should adapt:

Your first search returned low-quality results. Try these strategies:
1. Simplify the query (remove jargon)
2. Use synonyms for key terms
3. Split a complex query into multiple specific searches
4. If still failing, admit the gap rather than fabricating

Best for: Complex research, ambiguous queries, scenarios where the optimal retrieval strategy isn't known upfront.

Multi-Hop RAG

Chain multiple retrievals where each result informs the next query. Essential for questions that require connecting information across documents.

Question: "Which company developed the framework used by Instagram's backend?"

Hop 1: "What framework does Instagram use?"
→ Result: "Instagram uses Django"

Hop 2: "Which company developed Django?"
→ Result: "Django was created by the Django Software Foundation"

Answer: "Instagram uses Django, which was developed by the Django Software Foundation."
You need to answer a question that may require multiple searches.
Break your approach into hops:

Hop 1: Search for the initial answer
Hop 2: Use the result to formulate the next search
Hop 3+: Continue until you can answer confidently

Set a maximum of 5 hops. If you can't answer after 5 hops,
report what you found and what's still missing.

Current hop: {hop_number}
Previous findings: {findings}
Next search query:

Best for: Multi-step reasoning, entity linking, questions requiring synthesis across documents.

Handling Common RAG Failures

FailureSymptomFix
Irrelevant retrievalAnswer uses unrelated docsImprove chunking, add reranking step
Contradictory sourcesAnswer contains conflicting statementsFlag contradictions in output: "Source A says X, Source B says Y"
Outdated informationAnswers reference old versionsInclude date metadata, add freshness check in prompt
Missing informationAnswer fabricates detailsTighten "only use retrieved docs" instruction, add refusal language
Too many documentsOversized prompt, truncated outputCap retrieved chunks at 3-5, use reranking

Citation & Attribution Strategies

Inline citation: Cite within the answer text.

Django is a high-level Python framework [1]. Flask is better for microservices [2].

[1] Django Documentation, https://docs.djangoproject.com
[2] Flask Documentation, https://flask.palletsprojects.com

When sources conflict:

Source A states the API rate limit is 100 requests per minute.
Source B states it is 1000 requests per minute.

This appears to be a version difference. Source A is for v2.0,
Source B is for v3.0. Please verify which version you are using.

Evaluating RAG Quality

Retrieval metrics:

  • Hit rate — Did we retrieve at least one relevant document?
  • Mean Reciprocal Rank (MRR) — How high was the first relevant result?
  • Normalized Discounted Cumulative Gain (NDCG) — Overall ranking quality

Generation metrics:

  • Faithfulness — Does the answer stick to retrieved documents?
  • Answer relevance — Does the answer address the question?
  • Completeness — Does the answer cover all aspects?

Pattern Selection

PatternRetrieval QualityLatencyComplexityBest For
Naive RAGLowLowNoneSimple FAQ, demos
Advanced RAGMedium-HighMediumLow-MediumProduction Q&A, docs
Agentic RAGHighHighMediumComplex research
Multi-Hop RAGHighHighMediumEntity linking, synthesis

Best Practices

  • Chunk strategically — Split by section boundaries, not character count. Each chunk should be self-contained.
  • Include metadata — Store source, date, version, and section path alongside each chunk.
  • Set relevance thresholds — Don't retrieve low-scoring documents; they add noise.
  • Handle empty results — Tell the user when nothing relevant exists. Never fabricate.
  • Cite sources — Always attribute information to its source document.
  • Test your retrieval — Measure hit rate on a held-out set of questions before deploying.
  • Version your documents — Multiple versions of the same doc cause confusion; use version metadata to retrieve the right one.