Introduction
The Retrieval-Augmented Generation (RAG) pipeline heavily relies on retrieval performance guided by the adoption of various techniques and advanced strategies. Methods like query expansion, query transformations, and query construction each play a distinct role in refining the search process. These techniques enhance the scope of search queries and the overall result quality.
In addition to core methods, strategies such as reranking (with the Cohere Reranker), recursive retrieval, and small-to-big retrieval further enhance the retrieval process.
Together, these techniques create a comprehensive and efficient approach to information retrieval, ensuring that searches are wide-ranging, highly relevant, and accurate.
Querying in LlamaIndex
As mentioned in a previous lesson, the process of querying an index in LlamaIndex is structured around several key components.
- Retrievers: These classes are designed to retrieve a set of nodes from an index based on a given query. Retrievers source the relevant data from the index.
- Query Engine: It is the central class that processes a query and returns a response object. Query Engine leverages the retrievers and the response synthesizer modules to curate the final output.
- Query Transform: It is a class that enhances a raw query string with various transformations to improve the retrieval efficiency. It can be used in conjunction with a Retriever and a Query Engine.
Incorporating the above components can lead to the development of an effective retrieval engine, complementing the functionality of any RAG-based application. However, the relevance of search results can noticeably improve with more advanced techniques like query construction, query expansion, and query transformations.
Query Construction
Query construction in RAG converts user queries to a format that aligns with various data sources. This process involves transforming questions into vector formats for unstructured data, facilitating their comparison with vector representations of source documents to identify the most relevant ones. It also applies to structured data, such as databases where queries are formatted in a compatible language like SQL, enabling effective data retrieval.
The core idea is to answer user queries by leveraging the inherent structure of the data. For instance, a query like "movies about aliens in the year 1980" combines a semantic component like "aliens" (which will get better results if retrieved through vector storage) with a structured component like "year == 1980". The process involves translating a natural language query into the query language of a specific database, such as SQL for relational databases or Cypher for graph databases.
Incorporating different approaches to perform query construction depends on the specific use case. The first category includes the MetadataFilter classes for vector stores with metadata filtering, an auto-retriever that translates natural language into unstructured queries. This involves defining the data source, interpreting the user query, extracting logical conditions, and forming an unstructured request. The other approach is Text-to-SQL for relational databases; converting natural language into SQL requests poses challenges like hallucination (creating fictitious tables or fields) and user errors (mis-spellings or irregularities). This is addressed by providing the LLM with an accurate database description and using few-shot examples to guide query generation.
Query Construction improves RAG answer quality with logical filter conditions inferred directly from user questions, and the retrieved text chunks passed to the LLM are refined before final answer synthesis.
Query Expansion
Query expansion works by extending the original query with additional terms or phrases that are related or synonymous.
For instance, if the original query is too narrow or uses specific terminology, query expansion can include broader or more commonly used terms relevant to the topic. Suppose the original query is "climate change effects." Query expansion would involve adding related terms or synonyms to this query, such as "global warming impact," "environmental consequences," or "temperature rise implications."
One approach to do it is utilizing the synonym_expand_policy
from the KnowledgeGraphRAGRetriever
class. In the context of LlamaIndex, the effectiveness of query expansion is usually enhanced when combined with the Query Transform class.
Query Transformation
Query transformations modify the original query to make it more effective in retrieving relevant information. Transformations can include changes in the query's structure, the use of synonyms, or the inclusion of contextual information.
Consider a user query like "What were Microsoft's revenues in 2021?" To enhance this query through transformations, the original query could be modified to be more like “Microsoft revenues 2021”, which is more optimized for search engines and vector DBs.
Query transformations involve changing the structure of a query to improve its performance.
Query Engine
A Query engine is a sophisticated interface designed to interact with data through natural language queries. It's a system that processes queries and delivers responses. As mentioned in previous lessons, multiple query engines can be combined for enhanced functionality, catering to complex data interrogation needs.
For a more interactive experience resembling a back-and-forth conversation, a Chat Engine can be used in scenarios requiring multiple queries and responses, providing a more dynamic and engaging interaction with data.
A basic usage of query engines is to call the .as_query_engine()
method on the created Index. This section will include a step-by-step example of creating indexes from text files and utilizing query engines to interact with the dataset.
The first step is installing the required packages using Python package manager (PIP), followed by setting the API key environment variables.
pip install -q llama-index==0.9.14.post3 deeplake==3.8.8 openai==1.3.8 cohere==4.37
import os
os.environ['OPENAI_API_KEY'] = '<YOUR_OPENAI_API_KEY>'
os.environ['ACTIVELOOP_TOKEN'] = '<YOUR_ACTIVELOOP_KEY>'
The next step is downloading the text file that serves as our source document. This file is a compilation of all the essays Paul Graham wrote on his blog, merged into a single text file. You have the option to download the file from the provided URL, or you can execute these commands in your terminal to create a directory and store the file.
mkdir -p './paul_graham/'
wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham/paul_graham_essay.txt'
Now, use the SimpleDirectoryReader
within the LlamaIndex framework to read all files from a specified directory. This class will automatically cycle through the files, reading them as Document
objects.
from llama_index import SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader("./paul_graham").load_data()
We can now employ the ServiceContext
to divide the lengthy single document into several smaller chunks with some overlap. Following this, we can proceed to create the nodes out of the generated documents.
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=512, chunk_overlap=64)
node_parser = service_context.node_parser
nodes = node_parser.get_nodes_from_documents(documents)
The nodes must be stored in a vector store database to enable easy access. The DeepLakeVectorStore
class can create an empty dataset when given a path. You can use genai360
to access the processed dataset or alter the organization ID to your Activeloop username to store the data in your workspace.
from llama_index.vector_stores import DeepLakeVectorStore
my_activeloop_org_id = "genai360"
my_activeloop_dataset_name = "LlamaIndex_paulgraham_essays"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
# Create an index over the documnts
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)
Your Deep Lake dataset has been successfully created!
The new database will be wrapped as a StorageContext
object, which accepts nodes to provide the necessary context for establishing relationships if needed. Finally, the VectorStoreIndex
takes in the nodes along with links to the database and uploads the data to the cloud. Essentially, it constructs the index and generates embeddings for each segment.
from llama_index.storage.storage_context import StorageContext
from llama_index import VectorStoreIndex
storage_context = StorageContext.from_defaults(vector_store=vector_store)
storage_context.docstore.add_documents(nodes)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
Uploading data to deeplake dataset.
100%|██████████| 40/40 [00:00<00:00, 40.60it/s]
|Dataset(path='hub://genai360/LlamaIndex_paulgraham_essays', tensors=['text', 'metadata', 'embedding', 'id'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
text text (40, 1) str None
metadata json (40, 1) str None
embedding embedding (40, 1536) float32 None
id text (40, 1) str None
The created index serves as the basis for defining the query engine. We initiate a query engine by using the vector index object and executing the .as_query_engine()
method. The following code sets the streaming
flag to True, which reduces idle waiting time for the end user (more details on this will follow). Additionally, it employs the similarity_top_k
flag to specify the number of source documents it can consult to respond to each query.
query_engine = vector_index.as_query_engine(streaming=True, similarity_top_k=10)
The final step involves utilizing the .query()
method to engage with the source data. We can pose questions and receive answers. As mentioned, the query engine employs retrievers and a response synthesizer to formulate an answer.
streaming_response = query_engine.query(
"What does Paul Graham do?",
)
streaming_response.print_response_stream()
Paul Graham is an artist and entrepreneur. He is passionate about creating paintings that can stand the test of time. He has also co-founded Y Combinator, a startup accelerator, and is actively involved in the startup ecosystem. While he has a background in computer science and has worked on software development projects, his primary focus is on his artistic pursuits and supporting startups.
The query engine can be configured into a streaming mode, providing a real-time response stream to enhance continuity and interactivity. This feature is beneficial in reducing idle time for end users. It allows users to view each word as generated, meaning they don't have to wait for the model to produce the entire text. To observe the impact of this feature, use the print_response_stream
method on the response object of the query engine.
Sub Question Query Engine
Sub Question Query Engine, a more sophisticated querying method, can be employed to address the challenge of responding to complex queries. This engine can generate several sub-questions from the user's main question, answer each separately, and then compile the responses to construct the final answer. First, we must modify the previous query engine by removing the streaming flag, which conflicts with this technique.
query_engine = vector_index.as_query_engine(similarity_top_k=10)
We register the created query_engine
as a tool by employing the QueryEngineTool
class and compose metadata (description) for it. It is done to inform the framework about this tool's function and enable it to select the most suitable tool for a given task, especially when multiple tools are available. Then, the combination of the tools we declared earlier and the service context, which was previously defined, can be used to initialize the SubQuestionQueryEngine
object.
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
query_engine_tools = [
QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name="pg_essay",
description="Paul Graham essay on What I Worked On",
),
),
]
query_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools,
service_context=service_context,
use_async=True,
)
The setup is ready to ask a question using the same query
method. As observed, it formulates three questions, each responding to a part of the query, and attempts to find their answers individually. A response synthesizer then processes these answers to create the final output.
response = query_engine.query(
"How was Paul Grahams life different before, during, and after YC?"
)
print( ">>> The final response:\n", response )
Generated 3 sub questions.
[pg_essay] Q: What did Paul Graham work on before YC?
[pg_essay] Q: What did Paul Graham work on during YC?
[pg_essay] Q: What did Paul Graham work on after YC?
[pg_essay] A: During YC, Paul Graham worked on writing essays and working on YC itself.
[pg_essay] A: Before YC, Paul Graham worked on a variety of projects. He wrote essays, worked on YC's internal software in Arc, and also worked on a new version of Arc. Additionally, he started Hacker News, which was originally meant to be a news aggregator for startup founders.
[pg_essay] A: After Y Combinator (YC), Paul Graham worked on various projects. He focused on writing essays and also worked on a programming language called Arc. However, he gradually reduced his work on Arc due to time constraints and the infrastructure dependency on it. Additionally, he engaged in painting for a period of time. Later, he worked on a new version of Arc called Bel, which he worked on intensively and found satisfying. He also continued writing essays and exploring other potential projects.
>>> The final response:
Paul Graham's life was different before, during, and after YC. Before YC, he worked on a variety of projects including writing essays, developing YC's internal software in Arc, and creating Hacker News. During YC, his focus shifted to writing essays and working on YC itself. After YC, he continued writing essays but also worked on various projects such as developing the programming language Arc and later its new version called Bel. He also explored other potential projects and engaged in painting for a period of time. Overall, his work and interests evolved throughout these different phases of his life.
Custom Retriever Engine
As you might have noticed, the choice of retriever and its parameters (e.g., the number of returned documents) influences the quality and relevance of the outcomes generated by the QueryEngine
. LlamaIndex supports the creation of custom retrievers. Custom retrievers are a combination of different retriever styles, creating more nuanced retrieval strategies that adapt to distinct individual queries. The RetrieverQueryEngine
operates with a designated retriever, which is specified at the time of its initialization. The choice of this retriever is vital as it significantly impacts the query results' outcome.
There are two main types of RetrieverQueryEngine
:
- VectorIndexRetriever fetches the top-k nodes that are most similar to the query. It focuses on relevance and similarity, ensuring the results closely align with the query's intent. It is the approach we used in previous subsections.
- SummaryIndexRetriever retrieves all nodes related to the query without prioritizing their relevance. This approach is less concerned with aligning closely to the specific context of the question and more about providing a broad overview.
Use Case: It is ideal for situations where precision and relevance to the specific query are paramount, like in detailed research or topic-specific inquiries.
Use Case: Useful in scenarios where a comprehensive sweep of information is needed, regardless of the direct relevance to the specific terms of the query, like in exploratory searches or general overviews.
Reranking
While any retrieval mechanism capable of extracting multiple chunks from a large document can be efficient to an extent, there is always a likelihood that it will select some irrelevant candidates among the results. Reranking is re-evaluating and re-ordering search results to present the most relevant options. By eliminating the chunks with lower scores, the final context given to the LLM boosts overall efficiency as the LLM gets more concentrated information.
The Cohere Reranker improves the performance of retrieving close content. While the semantic search component is already highly capable of retrieving relevant documents, the Rerank endpoint boosts the quality of the search results, especially for complex and domain-specific queries. It sorts the search results according to their relevance to the query. It is important to note that Rerank is not a replacement for a search engine but a supplementary tool for sorting search results in the most effective way possible for the user.
The process begins with grouping documents into batches, after which the LLM evaluates each batch, attributing relevance scores to them. The final step in the reranking process involves aggregating the most relevant documents from all these batches to form the final retrieval response. This method guarantees that the most pertinent information is highlighted and becomes the focal point of the search outcomes.
The necessary dependencies have already been installed; the only remaining step is to obtain your API key from Cohere.com and substitute it for the placeholder provided.
import cohere
import os
os.environ['COHERE_API_KEY'] = "<YOUR_COHERE_API_KEY>"
# Get your cohere API key on: www.cohere.com
co = cohere.Client(os.environ['COHERE_API_KEY'])
# Example query and passages
query = "What is the capital of the United States?"
documents = [
"Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274.",
"The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.",
"Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas.",
"Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. ",
"Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.",
"North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."
]
We define a rerank object by passing both the query and the documents. We also set the rerank_top_k
argument to 3; we specifically instruct the system to retrieve the top three highest-scored candidates by the model. In this case, the model employed for reranking is rerank-multilingual-v2.0
.
results = co.rerank(query=query, documents=docs, top_n=3, model='rerank-english-v2.0') # Change top_n to change the number of results returned. If top_n is not passed, all results will be returned.
for idx, r in enumerate(results):
print(f"Document Rank: {idx + 1}, Document Index: {r.index}")
print(f"Document: {r.document['text']}")
print(f"Relevance Score: {r.relevance_score:.2f}")
print("\n")
Document Rank: 1, Document Index: 3
Document: Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. The President of the USA and many major national government offices are in the territory. This makes it the political center of the United States of America.
Relevance Score: 0.99
Document Rank: 2, Document Index: 1
Document: The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.
Relevance Score: 0.30
Document Rank: 3, Document Index: 5
Document: Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states. The federal government (including the United States military) also uses capital punishment.
Relevance Score: 0.27
This can be accomplished using LlamaIndex in conjunction with Cohere Rerank. The rerank object can be integrated into a query engine, allowing it to manage the reranking process seamlessly in the background. We will use the same vector index defined earlier to prevent writing repetitive codes and integrate the rerank object with it. The CohereRerank
class initiates a rerank object by taking in the API key and specifying the number of documents to be returned following the scoring process.
import os
from llama_index.postprocessor.cohere_rerank import CohereRerank
cohere_rerank = CohereRerank(api_key=os.environ['COHERE_API_KEY'], top_n=2)
Now, we can employ the same as_query_engine
method and utilize the node_postprocessing
argument to incorporate the reranker object. The retriever initially selects the top 10 documents based on semantic similarity, and then the reranker reduces this number to 2.
query_engine = vector_index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[cohere_rerank],
)
response = query_engine.query(
"What did Sam Altman do in this essay?",
)
print(response)
Sam Altman was asked if he wanted to be the president of Y Combinator (YC) and initially said no. However, after persistent persuasion, he eventually agreed to take over as president starting with the winter 2014 batch.
The reranking process in search systems offers numerous advantages, including practicality, enhanced performance, simplicity, and integration capabilities. It allows for augmenting existing systems without requiring complete overhauls, making it a cost-effective solution for improving search functionality. Reranking elevates search systems, which is particularly useful for complex, domain-specific queries in embedding-based systems.
The Cohere Rerank has proven to be effective in improving search quality across various embeddings, making it a reliable option for enhancing search results.
Advanced Retrievals
An alternative method for retrieving relevant documents involves using document summaries instead of extracting fragmented snippets or brief text chunks to respond to queries. This technique ensures that the answers reflect the entire context or topic being examined, offering a more thorough grasp of the subject.
Recursive Retrieval
The recursive retrieval method is particularly effective for documents with a hierarchical structure, allowing them to form relationships and connections between the nodes. According to Jerry Liu, founder of LlamaIndex, this is evident in cases like a PDF, which may contain "sub-data" such as tables and diagrams, alongside references to other documents. This technique can precisely navigate through the graph of connected nodes to locate information. This technique is versatile and can be applied in various scenarios, such as with node references, document agents, or even the query engine. For practical applications, including processing a PDF file and utilizing data from tables, you can refer to the tutorials in the LlamaIndex documentation here.
Small-to-Big retrieval
The small-to-big retrieval approach is a strategic method for information search, starting with concise, focused sentences to pinpoint the most relevant section of content with a question. It then passes a longer text to the model, allowing for a broader understanding of the context preceding and following the targeted area. This technique is particularly useful in situations where the initial query may not encompass the entirety of relevant information or where the data's relationships are intricate and multi-layered.
The LlamaIndex framework employs the Sentence Window Retrieval technique, which involves using the SentenceWindowNodeParser
class to break down documents into individual sentences per node. Each node includes a "window" that encompasses the sentences surrounding the main node sentence. (It is 5 sentences before and after each node by default) During retrieval, the single sentences initially retrieved are substituted with their respective windows, including the adjacent sentences, through the MetadataReplacementNodePostProcessor
. This substitution ensures that the Large Language Model receives a comprehensive view of the context surrounding each sentence.
You can follow a hands-on tutorial to implement this technique from the documentation here.
Conclusion
Effective information retrieval involves mastering techniques such as query expansion, query transformations, and query construction, coupled with advanced strategies like reranking, recursive retrieval, and small-to-big retrieval. Together, these techniques enhance the search process by increasing accuracy and broadening the range of results. By incorporating these methods, information retrieval systems become more proficient in providing precise results, essential for improving the performance of RAG-based applications.
>> Notebook.
RESOURCES:
- COHERE RERANK NOTEBOOK
- recursive retrieval
- llamaindex notebook