RAG Types and Strategies: A Practical Guide to Retrieval-Augmented Generation

Table of Contents
Retrieval-Augmented GenerationTypes and Strategies: A Guide

Have you ever asked ChatGPT something and received an incorrect answer? While LLMs can now search the web, they cannot access private documents, internal wikis, or company data. Retrieval-Augmented Generation ( RAG )  addresses this limitation by enabling AI systems to search through proprietary documents before generating responses.

The team has built multiple RAG systems, some highly successful, others less so. This guide captures key learnings from those implementations.

What is RAG anyway?

RAG stands for Retrieval-Augmented Generation

You have documents. The user asks a question. You find the relevant parts. You give those parts to an LLM. The LLM answers using that info. 

The process can be broken down into three steps:

  1. Turn your documents into vectors
  2. Search for matching chunks
  3. Feed matches to the LLM

The basic version is straightforward to build. However, making it perform well in production requires deeper consideration.

The Basic RAG Pipeline

The simplest RAG pipeline involves embedding documents, retrieving the closest match, and generating a response using a large language model.

				
					from sentence_transformers import SentenceTransformer
from openai import OpenAI
import numpy as np
 
# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')
 
# Your documents (normally you'd load these from files)
documents = [
   "Our return policy allows returns within 30 days.",
   "Shipping takes 3-5 business days.",
   "Contact support at help@example.com"
]
 
# Turn documents into vectors (normalize for cosine similarity)
doc_embeddings = embedder.encode(documents, normalize_embeddings=True)
 
# Setup LLM client once
client = OpenAI()
 
def basic_rag(question):
 
   # Turn question into vector
   q_embedding = embedder.encode([question], normalize_embeddings=True)[0]
   # Find most similar document (dot product = cosine sim when normalized)
   similarities = np.dot(doc_embeddings, q_embedding)
      best_idx = np.argmax(similarities)
   context = documents[best_idx]
   # Ask LLM with context
   response = client.chat.completions.create(
   	model="gpt-4o",
   	messages=[
       	{"role": "system", "content": f"Answer based on this: {context}"},
       	{"role": "user", "content": question}
   	]
   )
 
   return response.choices[0].message.content
 
# Try it
answer = basic_rag("How long do I have to return something?")
print(answer)
				
			

This approach works but has limitations. It does not scale well and may retrieve irrelevant results.

Raw numpy is used here, which is fine for a demo. However, in production, a vector database like Chroma, Qdrant, or Pinecone should be used. They handle indexing, scaling, and fast search. You don’t want to np.dot over a million vectors every query.

What if the question needs info from multiple documents? What if the best match isn’t actually relevant? What if your chunks are too big or too small?

That’s where the other techniques come in.

Vector Databases: Moving Beyond Numpy

The basic approach compares a query against every document, which becomes inefficient at scale.

A vector database solves this. It stores your embeddings and builds an index. When you search, it finds the approximate nearest neighbors without checking every single vector. Millions of vectors, milliseconds to search.

Popular options include: 

  • Chroma (simple, local)
  • Qdrant (fast, self-hosted)
  • Pinecone (managed, cloud) 

They all do the same core thing. These systems handle indexing, scaling, and efficient retrieval, making them essential for production use cases. Choose according to your setup.

Here’s the basic Retrieval Augmented Generation rewritten with Chroma:

				
					import chromadb
from openai import OpenAI
 
# Create a collection (Chroma handles embedding automatically)
chroma = chromadb.Client()
collection = chroma.create_collection(
   name="docs",
   metadata={"hnsw:space": "cosine"}
)
 
# Your documents
documents = [
   "Our return policy allows returns within 30 days.",
   "Shipping takes 3-5 business days.",
   "Contact support at help@example.com"
]
 
# Add documents (Chroma embeds them for you)
collection.add(
   documents=documents,
   ids=[f"doc_{i}" for i in range(len(documents))]
)
 
# Setup LLM client once
client = OpenAI()
 
def basic_rag(question):
   # Search for relevant documents
   results = collection.query(query_texts=[question], n_results=2)
   context = "\n".join(results["documents"][0])
 
   # Ask LLM with context
   response = client.chat.completions.create(
   	model="gpt-4o",
   	messages=[
       	{"role": "system", "content": f"Answer based on this:\n{context}"},
       	{"role": "user", "content": question}
   	]
   )
   return response.choices[0].message.content
 
# Try it
answer = basic_rag("How long do I have to return something?")
print(answer)
				
			

This is way less code, without requiring any manual embedding. No numpy. And it scales.

Chroma uses its own default embedding model. You can swap it for any model you want. For production, use chromadb.PersistentClient() so your data survives restarts.

This is still a toy example. But the pattern is the same one you’d use with a million documents. The vector database handles the hard part.

Chunking: Why Splitting Documents Matters

Most people just split every 500 characters, which is a common mistake. This often breaks sentences and reduces embedding quality.

You end up with chunks like this:

“…the policy states that customers may return items within 30 d”

And another chunk:

“..ays of purchase. No questions asked. Our shipping…”

The first chunk got cut mid-sentence and the embedding is garbage.

Better approach: Structure-based Chunking

A better approach is structure-based chunking, splitting by headings and paragraphs:

				
					from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker
 
# Convert PDF to structured document
converter = DocumentConverter()
doc = converter.convert("manual.pdf").document
 
# Chunk based on headers and paragraphs
chunker = HierarchicalChunker()
 
chunks = list(chunker.chunk(doc))
 
for chunk in chunks:
   print(f"Section: {chunk.meta.headings}")
   print(f"Text: {chunk.text[:100]}...")
   print()
				
			

This keeps sentences together. It respects headings. Your embeddings make more sense.

Watch out for token limits. all-MiniLM-L6-v2 maxes out at 256 tokens. Anything longer gets silently cut off. If your chunks are bigger, pick an embedding model with a larger context window. Or make your chunks shorter.

Semantic chunking is another option. You look at how similar each sentence is to the next. When similarity drops, that’s a natural break point.

Re-ranking: The Highest Impact Improvement

Re-ranking is one of the most effective enhancements to basic retrieval augmented generation.

Vector search is fast but sloppy. You get chunks that are sort of related. But a re-ranker evaluates each chunk against the query and scores its actual relevance.

				
					from sentence_transformers import CrossEncoder
 
# Load re-ranker model
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
 
def search_with_rerank(query, chunks, top_k=5):
 
   # First: get 20 candidates with vector search
   candidates = vector_search(query, chunks, k=20)
 
   # Second: re-rank to find best 5
   pairs = [[query, chunk["text"]] for chunk in candidates]
   scores = reranker.predict(pairs)
 
   # Sort by score
   ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
 
   return [chunk for chunk, score in ranked[:top_k]]
 
# Use it
results = search_with_rerank("What's the refund policy?", all_chunks)
				
			

In practice, adding re-ranking significantly improves accuracy with minimal latency overhead.

Query Expansion: Improving Poor Queries

Users often phrase questions differently than how information is stored in documents.

Someone might ask, “How do I get my money back?” when your docs say “refund policy” and “return process.” 

Query expansion rewrites the question before searching:

				
					client = OpenAI()
 
def expand_query(query):
 
   response = client.chat.completions.create(
   	model="gpt-4o-mini",
   	messages=[{
       	"role": "user",
       	"content": f"""Rewrite this search query to be more detailed.
           Add relevant terms that might appear in documentation.
           Original: {query}
           Return only the expanded query."""
   	}]
   )
 
   return response.choices[0].message.content
 
# "how do I get my money back" becomes
# "refund policy return process money back reimbursement timeline"

				
			

Multi-query is similar. You generate several versions of the question. Search with all of them. Combine results.

				
					def multi_query_search(query, num_variants=3):
 
   # Generate query variants
   variants = generate_query_variants(query, num_variants)
   all_results = []
 
   for variant in variants:
       results = vector_search(variant, k=5)
       all_results.extend(results)
 
   # Remove duplicates, combine scores
   return deduplicate_and_rank(all_results)

				
			

Contextual Retrieval: Restoring Meaning

When documents are split into chunks, they may lose context. For example, 

 “The maximum amount is $500 per transaction.” 

Maximum amount of what? Which policy? The chunk lost all context when you split it from the original document. 

There is an idea to fix this. The idea works with any large language model. During indexing, you can ask the model to write a brief intro for each chunk:

				
					def add_context_to_chunk(chunk_text, full_doc):
 
   prompt = f"""Write 1-2 sentences explaining what this chunk is about
   and where it comes from in the document.
   Document preview: {full_doc[:2000]}
   Chunk: {chunk_text}
   Return only the context description."""
 
   context = llm_generate(prompt)
 
   return f"{context}\n---\n{chunk_text}"
 
# Now the chunk becomes:
# "This section describes transaction limits from the Payment Terms policy."
# ---
# "The maximum amount is $500 per transaction."

				
			

The embedding now captures what the chunk is about. Retrieval gets better.

Yes, it means an LLM call for every chunk during indexing, which is slow and costly. Worth it when your chunks lose meaning on their own. Not every dataset needs this.

Also read: Closing the Search Gap: The Power of Hybrid Search to Bridge Semantics and Keywords

Agentic RAG: Dynamic Decision-Making

Different queries require different retrieval strategies. Agentic RAG allows the model to choose between tools such as:

  • Semantic search
  • Full document retrieval
  • SQL queries

For instance, 

“What’s the return policy?” needs a semantic search, while “Summarize the employee handbook” needs the full document. Similarly, “How many orders last month?” needs a database query.

Agentic RAG gives the LLM tools. It picks which one to use.

				
					import json
 
client = OpenAI()
 
tools = [
   {
   	"type": "function",
   	"function": {
       	"name": "semantic_search",
       	"description": "Search documents by meaning. Good for factual questions.",
       	"parameters": {
           	"type": "object",
           	"properties": {"query": {"type": "string"}},
           	"required": ["query"]
       	}
   	}
   },
   {
   	"type": "function",
   	"function": {
       	"name": "get_full_document",
       	"description": "Get entire document. Good for summaries.",
       	"parameters": {
           	"type": "object",
           	"properties": {"doc_name": {"type": "string"}},
           	"required": ["doc_name"]
       	}
   	}
   },
   {
   	"type": "function",
   	"function": {
       	"name": "sql_query",
       	"description": "Query the database. Good for numbers and stats.",
       	"parameters": {
           	"type": "object",
           	"properties": {"query": {"type": "string"}},
           	"required": ["query"]
       	}
   	}
   }
]
 
# Map tool names to your own functions (implement these based on your setup)
tool_functions = {
   "semantic_search": lambda args: vector_search(args["query"], k=5),
   "get_full_document": lambda args: load_document(args["doc_name"]),
   "sql_query": lambda args: run_sql(args["query"]),
}
 
def agentic_rag(user_question):
 
   messages = [{"role": "user", "content": user_question}]
   response = client.chat.completions.create(
   	model="gpt-4o",
   	messages=messages,
   	tools=tools,
   	tool_choice="auto"
   )
   msg = response.choices[0].message
 
   # If the model picked a tool, execute it and feed results back
   if msg.tool_calls:
       messages.append(msg)
 
   	for tool_call in msg.tool_calls:
       	fn_name = tool_call.function.name
       	fn_args = json.loads(tool_call.function.arguments)
       	result = tool_functions[fn_name](fn_args)
 
           messages.append({
           	"role": "tool",
           	"tool_call_id": tool_call.id,
           	"content": json.dumps(result)
       	})
 
   	# Let the model generate a final answer using the tool results
   	final = client.chat.completions.create(
       	model="gpt-4o",
       	messages=messages
   	)
 
   	return final.choices[0].message.content
 
   return msg.content
				
			

The key part most tutorials skip: you have to execute the tool and send results back. The LLM picks the tool. Your code runs it. Then the LLM uses the output to answer.

This approach increases flexibility but requires careful tool design and orchestration.

Knowledge Graphs: Handling Relationships

Vector search finds text that looks similar. But it can’t reason about relationships.

“Who reports to the CEO?” needs an understanding of organizational structure. Not just documents that mention the CEO.

Knowledge graphs store entities and connections:

				
					from neo4j import GraphDatabase
 
driver = GraphDatabase.driver("bolt://localhost:7687")
 
# Store relationships
with driver.session() as session:
   session.run("""
   	MERGE (company:Company {name: 'Acme'})
   	MERGE (ceo:Person {name: 'Jane Smith'})
   	MERGE (manager:Person {name: 'Bob Jones'})
   	MERGE (ceo)-[:CEO_OF]->(company)
   	MERGE (manager)-[:REPORTS_TO]->(ceo)
   """)
 
# Query relationships
def find_reports(person_name):
 
   with driver.session() as session:
   	result = session.run("""
       	MATCH (p:Person {name: $name})<-[:REPORTS_TO]-(report)
       	RETURN report.name
   	""", name=person_name)
  	
   	return [record["report.name"] for record in result]
 
# Who reports to Jane Smith?
reports = find_reports("Jane Smith")  # Returns ["Bob Jones"]

				
			

Building the graph is the hard part. You need to extract entities and relationships from documents. Usually, with the help of an LLM. However, it is slow and expensive.

One thing to keep in mind is to use this only for highly connected data, for example, if it includes organizational charts, legal docs with cross-references and research papers with citations.

Self-RAG: Adding Quality Control

Sometimes the retrieved chunks don’t actually answer the question. The LLM hallucinates anyway.

Self-RAG adds checks at two points. 

  • First, grade each chunk for relevance. Throw out the bad ones. 
  • Second, after the LLM answers, check if the answer is actually grounded in the chunks.
				
					def grade_chunks(query, chunks):
 
   """Score each chunk individually. Keep only relevant ones."""
   relevant = []
 
   for chunk in chunks:
   	prompt = f"""Is this chunk relevant to the question?
   	Question: {query}
   	Chunk: {chunk['text'][:300]}
   	Answer only "yes" or "no"."""
   	verdict = llm_generate(prompt).strip().lower()
 
   	if verdict.startswith("yes"):
           relevant.append(chunk)
 
   return relevant
 
def check_groundedness(query, chunks, answer):
 
   """Check if the answer is supported by the chunks."""
   context = "\n".join([c['text'][:300] for c in chunks])
   prompt = f"""Is this answer fully supported by the provided context?
   No outside information should be used.
   Context: {context}
   Question: {query}
   Answer: {answer}
   Answer only "supported" or "not supported"."""
 
   return llm_generate(prompt).strip().lower().startswith("supported")
 
def self_rag(query):
 
   # Step 1: Retrieve
   chunks = vector_search(query, k=10)
   # Step 2: Grade relevance per chunk
   relevant = grade_chunks(query, chunks)
   # If not enough relevant chunks, retry with expanded query
 
   if len(relevant) < 2:
   	better_query = expand_query(query)
   	chunks = vector_search(better_query, k=10)
   	relevant = grade_chunks(better_query, chunks)
 
   # Step 3: Generate answer
   answer = generate_answer(query, relevant)
 
   # Step 4: Check groundedness
   if not check_groundedness(query, relevant, answer):
   	answer += "\n\nNote: This answer may not be fully supported by the available documents."
 
   return answer[
				
			

More LLM calls. More latency. But you catch two failure modes: bad retrieval and bad generation. Worth it when wrong answers have real consequences. Medical info. Legal questions. Financial advice.

Fusion RAG: Combining Multiple Retrieval Methods

In some cases, vector search alone isn’t sufficient.

The problem: different retrieval methods have different strengths. Vector search (dense retrieval) understands meaning. BM25 (sparse retrieval) finds exact keyword matches. Dense works great for natural questions. Sparse handles technical terms better.

Fusion RAG runs multiple retrieval methods at once. Then it combines the results using Reciprocal Rank Fusion (RRF).

Why RRF? You can’t just average the scores. Different methods use different scoring scales. A score of 0.8 from vector search means something different than 0.8 from BM25.

RRF sidesteps this by using ranks instead of scores. If a document appears near the top in multiple methods, it gets boosted.

				
					from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
 
class FusionRAG:
 
   def __init__(self, chunks):
 
   	self.chunks = chunks
   	self.texts = [c["text"] for c in chunks]
 
   	# Setup vector search
   	self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
   	self.embeddings = self.embedder.encode(self.texts, normalize_embeddings=True)
 
   	# Setup BM25 (basic tokenizer — swap in nltk or spacy for production)
   	tokenized = [text.lower().split() for text in self.texts]
   	self.bm25 = BM25Okapi(tokenized)
 
   def vector_search(self, query, k=20):
 
   	q_emb = self.embedder.encode([query], normalize_embeddings=True)[0]
   	scores = np.dot(self.embeddings, q_emb)
   	top_indices = np.argsort(scores)[::-1][:k]
 
   	return [(idx, scores[idx]) for idx in top_indices]
 
   def keyword_search(self, query, k=20):
 
   	tokens = query.lower().split()
   	scores = self.bm25.get_scores(tokens)
   	top_indices = np.argsort(scores)[::-1][:k]
 
   	return [(idx, scores[idx]) for idx in top_indices]
 
   def reciprocal_rank_fusion(self, result_lists, k=60):
 
   	"""
   	Combine multiple ranked lists using RRF.
   	k=60 is the standard constant from the original paper.
   	"""
   	scores = {}
 
   	for result_list in result_lists:
       	for rank, (doc_idx, _) in enumerate(result_list):
           	if doc_idx not in scores:
                   scores[doc_idx] = 0
 
           	# RRF formula: 1 / (k + rank + 1) since rank is 0-indexed
           	scores[doc_idx] += 1 / (k + rank + 1)
 
   	# Sort by combined score
   	ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
 
   	return ranked
 
   def search(self, query, top_k=5):
 
   	# Run both retrieval methods
   	vector_results = self.vector_search(query, k=20)
   	keyword_results = self.keyword_search(query, k=20)
  	
   	# Fuse results
   	fused = self.reciprocal_rank_fusion([vector_results, keyword_results])
  	
   	# Return top chunks
   	return [self.chunks[idx] for idx, score in fused[:top_k]]
 
 
# Use it
fusion = FusionRAG(all_chunks)
results = fusion.search("error code E-4502")
				
			

The beauty of RRF: It works even when you add more retrieval methods. You can throw in a third or fourth method, and RRF handles the combination.

				
					def search_with_three_methods(self, query, top_k=5):
 
   vector_results = self.vector_search(query, k=20)
   keyword_results = self.keyword_search(query, k=20)
 
   # Add a third method, maybe a fine-tuned model
   specialized_results = self.specialized_search(query, k=20)
 
   fused = self.reciprocal_rank_fusion([
   	vector_results,
   	keyword_results,
   	specialized_results
   ])
 
   return [self.chunks[idx] for idx, score in fused[:top_k]]
				
			

When does fusion help most? Mixed query types. Some users ask natural questions (“how do I return something”). Others use technical terms (“RMA process policy section 4.2”). Fusion handles both.

You can observe 10-15% recall improvements just from adding BM25 alongside vector search. Not huge. But sometimes that’s the difference between finding the right answer and not.

What Works in Practice

If you are starting a new RAG project:

Always: Re-ranking. Biggest improvement for the effort.

Always: Structure-based chunking. Split on headings and paragraphs. Stop splitting at arbitrary character counts.

Usually: Fusion RAG. Combining vector and keyword search catches things that either would miss alone.

Sometimes: Contextual retrieval. When chunks lose meaning without the surrounding document.

Rarely: Knowledge graphs. Only when the data really needs it.

Most production systems use 3-4 of these together. The trick is figuring out which combination fits your data.

The tip is to start simple. And add complexity when you hit specific problems. Don’t over-engineer from day one.

Wrapping Up

RAG is not a single technique but a collection of strategies.

Basic RAG is easy to implement, but achieving high performance requires iteration, experimentation, and evaluation.

Key techniques include:

  • Vector databases to store and search embeddings at scale
  • Structure-based chunking to split documents properly
  • Re-ranking to improve retrieval quality
  • Query expansion to handle bad questions
  • Contextual retrieval to preserve chunk context
  • Agentic RAG for flexible retrieval
  • Knowledge graphs for connected data
  • Self-RAG for quality control
  • Fusion RAG to combine multiple retrieval methods 

The goal is not to use every technique, but to choose the right combination for the problem at hand.

Building reliable RAG systems is just one part of delivering real business value with LLMs. It requires the right architecture, tooling, and production expertise.

As an LLM development company, Xcelore specializes in designing and deploying end-to-end LLM solutions, from intelligent retrieval systems to scalable AI applications. Partner with Xcelore to build production-ready AI systems that work. 

Share this blog

What do you think?

Contact Us Today for
Inquiries & Assistance

We are happy to answer your queries, propose solution to your technology requirements & help your organization navigate its next.

Your benefits:
What happens next?
1
We’ll promptly review your inquiry and respond
2
Our team will guide you through solutions
3

We will share you the proposal & kick off post your approval

Schedule a Free Consultation

Related articles