RAG Best Practices
Optimizing RAG Retrieval for Unstructured Documents
What RAG optimizations from research work and enumerating combinations of methods for fine tuning
RAG (Retrieval Augmented Generation) has been, and still is the most effective way to integrate LLMs with external context. Today, enterprise AI SaaS companies (including many of our customers), are building context-aware AI products by processing/ingesting their customers’ internal documents from file storage integrations like Google Drive and Sharepoint as context for RAG.
RAG is actually fairly simple to understand and is built off tested methods and research - give your LLM external data by querying it from a vector database using vector search, a practice that has been done long before LLMs and generative AI came onto the scene. If you want a quick overview on the basic components of RAG, read this article and come back to this one.

However, as you may know if you've tried to go into product - LLMs are not perfect, and even with RAG, response accuracy is always a challenge. This is why, while it's easy to spin up a RAG chatbot POC, optimizing it for production is quite challenging. What has been really interesting is the variety of practices that can modify this basic workflow in order to optimize RAG with more relevant answers. Not only have there been companies optimizing RAG workflows, academic research has also gone in to this problem space of optimizing RAG answers.
In this article, we wanted to explore how RAG optimization methods discussed in academic research applies in a B2B SaaS context and extend their research by performing grid search to examine interaction effects.
High Level Summary
From reading as many papers on “RAG Best Practices” we could find, we identified:
Vector Retrieval (dense vs hybrid)
Reranking
Summarization
as the prevailing RAG optimization methods that researchers have found most additive to RAG performance.
To answer if these methods apply to more relevant B2B SaaS use cases, we put together 3 knowledge bases for different B2B SaaS use cases, such as RAG for technical documentation. Each knowledge base had ~100 documents which allowed us to generate 200-300 different test cases with prompts and expected answers to test combinations of RAG optimization methods against.
What we found was that hybrid retrieval, without reranking, and with summarization saw the highest performance in average answer relevancy and irrelevant answer frequency. These settings achieved a:
2.5% lift in average answer relevancy
51% decrease in irrelevant answers
The results of this study demonstrated that studied optimization methods have noticeable impacts on performance. What the results also imply is that grid searching is a necessary component to comprehensively examine the impact of settings in combination with one another. Case in point, while reranking would have been expected to be optimal, what we found was that it wasn’t necessary with our given settings.
Literature Review
Researching the work already done in academia, we read through several papers focused on RAG best practices. The paper we referenced most frequently was “Searching for Best Practices in Retrieval-Augmented Generation” by researchers at Fudan University, which was one of the most referenced and had a very wholistic approach, exploring several RAG workflow methods - query classification, chunking, embedding models, vector databases, retrieval methods, reranking, repacking, and summarization.
What they found was that hybrid search, reranking, reverse repacking, and summarization achieved the highest performance when evaluating across various general datasets (don’t worry if you don’t understand these methods yet; we’ll go over it below).
Other papers we referenced in our background research are “Enhancing Retrieval-Augmented Generation: A Study of Best Practices” from a team at the University of Tubingen in Germany and “Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization” from researchers at Tarbiat Modares University in Tehran. The common thread we found across in these papers was that they also spoke on reranking and summarization as effective RAG optimization methods.
Explained: Research-based Best Practices
Distilling results from the academic literature, we identified hybrid retrieval, reranking, and summarization to be the main overarching strategies for optimizing RAG retrieval. These are the three methods we evaluated in this article across a few metrics and datasets relevant to B2B SaaS. But first, let me give some more context on each of these strategies.
Hybrid Retrieval Strategy
There are two main ways to retrieve vector embeddings. Dense vector retrieval is using dense vectors, that specialize in capturing the semantic meaning of a text. In contrast, sparse vectors capture keywords where dimensions in the vector embedding emphasize the importance of a word in that document.
Hybrid retrieval uses both dense and sparse vectors to search, thus taking advantage of semantic as well as keyword search.

Reranking
Reranking is an optional step in the RAG workflow where after vectors are retrieved and ranked by their vector similarity metric (cosine distance, dotproduct, etc. are all metrics that measure how semantically similar 2 vectors are), a model evaluates each vector for their relevancy and re-ranks the vectors before giving them to the LLM for answer synthesis.

Summarization
Summarization is what it sounds like, compressing prompts or context into shorter text using an intermediate model to cut noise and better focus on relevant information. This is especially useful to solve the “lost in the middle” problem where details in the middle of a longer text can be ignored by an LLM, similar to how people tend to remember the beginning and/or end over the middle (primacy & recency bias).

Methodology
In this study, we wanted to evaluate if these overarching insights from these academic resources apply to more SaaS relevant use cases. Additionally, we wanted to identify the most impactful combination or RAG response relevancy, utilizing a grid search analysis where we tested combinations of retrieval strategy, rerank, and summarization settings .
Our methodology:
Created 3 different knowledge bases that could be SaaS use cases using web-crawling to generate ~100 documents per knowledge base
SoCalGas policies - documents on billing, payment, services, etc.
simulates internal policy documentation
Paragon documentation - technical SDK and product documentation
simulates technical documentation
Y-Combinator Directory - Pages on different startups and their solutions
simulates internal research on technical topics
Synthetically generated ~5 prompts per document (totaling 700+ prompts aka test cases) using that document’s content as context. This ensures RAG is necessary for our prompts
For each combination of RAG methods (retrieval strategy, reranking, summarization), generated responses for each of our test case prompts
Evaluated each test case for answer relevancy, faithfulness, and contextual relevancy

Here are a few samples of our fully run test cases:
We have the settings combination, prompt input, expected output generated from the synthetic prompt process, and the actual output generated from the RAG workflow.

For each test case, we also have the expected context that answers the prompt and the actual context retrieved by our RAG workflow. Lastly, we have our evaluation metrics with generated reasoning for each metric.

*For even more detail on our methodology, an appendix can be found at the bottom.
Results
Overall Results
Interestingly, the results deviated slightly from the research papers we referenced. The best performing combination of settings in terms of average answer relevancy - the most comprehensive metric of RAG performance - was hybrid search with summarization without reranking.
*Green highlighting is for the best performing combination, whereas the gray highlighting denotes the base implementation of RAG (generally just using dense vector retrieval without additional settings)

Comparing this with a base implementation of RAG, our optimized settings introduced a 2.5% lift in average answer relevancy.
Another metric we wanted to add is a measurement of “irrelevant answers” (whenever answer relevancy is below 0.5). In many RAG applications, irrelevant answers have worse impact than good ones. As this is the case, we may want to tradeoff average performance if there’s a combination of settings that reduces these misses/bad answers. What we saw here is that the hybrid search with summarization without reranking cut irrelevant answer frequency by 51% (from 33 irrelevant responses to 16 out of 780 test cases) in addition to having the highest average answer relevancy metrics.

Additional Metrics
Looking at faithfulness and contextual relevancy, we actually see the dense retrieval, reranking, summarization combination achieve the highest scores, signaling that dense vector search has the highest impact on faithfulness and contextual relevancy, and that reranking and summarization have positive impact. Hybrid search still uses dense retrieval which helps explain why answer relevancy is not greatly impacted by less relevant context and faithfulness to the grounding context.


Further breakdown by source
Breaking this down further, cross examining by knowledge base, we see that there is one knowledge base where hybrid retrieval with summarization and without rerank is not the optimal setting.

With the SoCalGas Knowledge base, the dense retrieval saw the highest answer relevancy. Our hypothesis here is that the documents here were the most disparate and lowest quality. This gives indication that less comprehensive knowledge bases may have different optimal settings although more samples need to be conducted to confirm this hypothesis.
Implications
From our research, we saw the best RAG performance both in answer relevancy and frequency of irrelevant answers using:
Hybrid retrieval
Context summarization using LLMLingua methodology
Without reranking
This mostly aligns with academic research as hybrid retrieval and summarization have seen to be efficacious across both general and specific ones like medical knowledge.
Seeing a slightly higher score without reranking is a non-intuitive observation, as it was assumed that reranking would only help an LLM synthesize relevant answers. A possible takeaway from this observation may be that the original ranking metric (dot-product for hybrid retrieval) was a sufficient metric for ranking documents, making reranking with an intermediate model unnecessary at least with our settings.
Importance of testing combinations (grid searching)
What this study also demonstrated is the importance of grid searching and investigating interaction effects. Experimenting settings individually is of course less compute and time intensive. Grid searching adds complexity in a polynomial fashion where each additional setting drastically increases time and compute - in our study we evaluated 2 retrieval strategies, 2 reranking settings (with and without), and 2 summarization settings (with and without), requiring eight total evaluation runs (2x2x2=8). Experimenting with additional settings like reverse vs forward vs side reranking, or different top Ks would inevitably increase the number of evaluation runs even more.
However, grid searching’s brute force approach comprehensively identifies interaction effects. Not using reranking was an unintuitive result, however for our temperature, top K, LLM choice, and other static settings, leaving out reranking proved to be the optimal choice. This would have been difficult to identify without a grid search analysis.
Wrapping Up
In our study, we evaluated RAG performance using settings recommended by popular academic papers. What we found was fairly consistent with the academic research where hybrid retrieval and summarization were optimal, however reranking with these settings did not prove to be as efficacious when using our SaaS-relevant knowledge bases.
Hybrid retrieval with summarization without reranking not only had the highest average answer relevancy (+2.5% increase from baseline), but also had the lowest frequency of irrelevant answers (~51% reduction from baseline). While the average answer relevancy lift may not look impressive at first glance, RAG retrieval for unstructured documents using base settings is fairly performant. Every incremental improvement will be small, but important to build trust and adoption for your AI product.
If you’re interested in educational content on building production-ready AI products, we encourage you to subscribe to our monthly newsletter: Inference. If you’re interested in building file storage integrations, we encourage you to learn about how Paragon can help with this use case, build your own RAG knowledge chatbot with our tutorial, or reach out to our team if you’d like to talk in person on how Paragon can help with integrations for your AI product.
Appendix:
Methodology Detail:
To provide more context in our RAG workflow methods, we wanted to share our implementations of hybrid retrieval, reranking, and summarization.
Retrieval strategy
For both dense and hybrid vector retrieval, we used a default setting of top_k =
5
Our implementation of hybrid search came from Pinecone’s guide on hybrid search:
Indexed two separate vector indices: a dense index and a sparse index
For each prompt, retrieve top K chunks using dense retrieval and top K chunks using sparse retrieval
Deduplicate chunks if there’s an overlap
Reranking
Used
cohere-rerank-3.5
to rerank text from retrieved chunks
Summarization
Used LongLLMLingua, a python library from Microsoft using small models to condense long prompts and context to mitigate the “lost in the middle” problem
Using
gpt-2
as our small model to generate summaries with target size of500 tokens
Decided on LLMLingua for summarization, as Recomp (a summarization model discussed in one of the papers) requires additional training on the knowledge base
Some additional settings we kept consistent across iterations:
LLM =
gpt-4o-mini
Temperature =
0.25
Chunk size =
512
; overlap =20
We also used DeepEval’s framework for evaluating answer relevancy, faithfulness, and contextual relevancy. Their library has been a huge help.
Evaluation LLM =
gpt-4o-mini
TABLE OF CONTENTS
Jack Mu
,
Developer Advocate
mins to read