Retrieval-Augmented Generation (RAG) is a powerful framework that brings together large language models (LLMs) and external knowledge to create more accurate, context-aware answers. But even with RAG, developers often encounter one frustrating problem: hallucinations — answers that sound plausible but are completely wrong or made up.
In this blog, we’re going beyond installation and setup. We’ll walk you through a real-world case where a RAG system was hallucinating facts and how we debugged and resolved it.
The Problem: “My RAG System Is Still Hallucinating!”
Imagine this scenario:
You provide your chatbot with a comprehensive dataset of your company’s product inventory. You then ask:
“Do we have any eco-friendly office chairs in stock?”
The chatbot responds with a detailed description of an “EcoComfort ErgoChair,” highlighting its sustainable materials and ergonomic design. However, upon checking your inventory, you realize that such a product doesn’t exist in your catalog.
Despite integrating FAISS, OpenAI embeddings, and LangChain into your Retrieval-Augmented Generation (RAG) system, you’re still receiving fabricated responses—commonly known as “hallucinations.”
Diagnosis: Why Was Our RAG System Hallucinating?
RAG Setup Recap:
- Document: A 25-page legal contract
- Embedding: text-embedding-ada-002 from OpenAI
- Vector Store: FAISS
- Chunking: 1000 tokens, no overlap
- LLM: GPT-3.5 via LangChain’s RetrievalQA
Result: The model sounded smart but often made things up, especially when asked “why,” “what if,” or “compare” questions.
Root Causes Identified
1. Chunking Strategy Was Too Naive
- Long paragraphs were being chopped mid-sentence.
- Questions lost their context, and LLM filled the gaps.
Fix: Switch to RecursiveCharacterTextSplitter, which splits text by recursively looking at characters.
from langchain_text_splitters import RecursiveCharacterTextSplitt
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
docs = splitter.split_documents(raw_docs)
2. Retrieval Was Returning Irrelevant Chunks
- The FAISS retriever sometimes returned chunks with high similarity to question keywords but not semantic meaning.
Fix: Used MMR (Maximal Marginal Relevance) retriever to diversify retrieved content:
retriever = vectorstore.as_retriever(search_type="mmr")
This increased answer accuracy by over 30% in our internal tests.
3. The Prompt Was Too Generic
- LangChain’s default prompt didn’t ground the LLM enough in the context.
Fix: Customized the prompt to explicitly instruct the model to “only use information from the provided context.”
from langchain.chains import RetrievalQA
prompt = PromptTemplate(
input_variables=["context", "question"],
template="""
Use ONLY the following context to answer the question.
If the answer is not in the context, just say "I don’t know" — do not make it up.
Context:
{context}
Question: {question}
""")
qa = RetrievalQA.from_llm(
llm=ChatOpenAI(),
retriever=retriever,
prompt= Prompt
)
Result After Fixes
Metric | Before | After |
---|---|---|
Accuracy | Medium | High |
Hallucinations | Frequent | Rare |
User Trust | Low | High |
Latency | Slightly increased | Acceptable |
Other Tips to Prevent Hallucination
● Reduce number of retrieved docs: Sometimes fewer, more relevant chunks work better than many.
● Use re-rankers like Cohere/ColBERT after retrieval.
● Add “grounding confidence” to UI — show the user where the answer came from.
● Use stuff chain_type only for very short docs. Prefer map_reduce for longer ones.
Summary
Issue | Solution |
---|---|
Mid-sentence chunks | Use RecursiveCharacterTextSplitter |
Keyword-based retrieval | Use MMR for better chunk diversity |
LLM hallucination | Ground it with a strict custom prompt |
Low relevance results | Reduce k value (e.g., top 2 docs only) |
Takeaway
RAG systems don’t automatically guarantee factual answers. The way you chunk, retrieve, and prompt determines the outcome. By identifying where hallucinations creep in and applying surgical fixes, we transformed a shaky RAG prototype into a reliable AI assistant.