Fine-tuning vs RAG; Introduction to Activeloop’s Deep Memory

Introduction

In this lesson, we will explore optimization techniques that maximize large language model performance. We will learn about the appropriate use of prompt engineering, retrieval augmented generation (RAG), and fine-tuning, distinguishing how each method contributes and their specific challenges.

A significant portion of the lesson will be dedicated to addressing the limitations of RAG systems in real-world applications. These mainly include maintaining high retrieval accuracy and ensuring accurate responses from LLMs. Much of our discussion will include Activeloop's Deep Memory, a technique designed to augment the retrieval precision of embeddings for user queries.

We will also perform a detailed comparison of empirical data, analyzing the differences in retrieval recall rates between systems employing Deep Memory and those that do not.

Overview of RAG Enhancement Techniques

Expanding on the discussion surrounding fine-tuning, retrieval-augmented generation, and prompt engineering, it's essential to understand each approach's distinct strengths, weaknesses, and most suitable applications.

Prompt engineering

Prompt engineering is often the first step in enhancing the performance of an LLM for specific tasks. This approach alone can be sufficient, especially for simpler or well-defined tasks. Techniques like few-shot prompting can notably improve task performance. This method involves providing small task-specific examples to guide the LLM. Chain of Thought (CoT) prompting can also improve reasoning capabilities and encourage the model to generate more detailed responses.

Combining Few-shot with RAG—using a tailored dataset of examples to retrieve the most relevant information for each query—can be more effective.

Fine-tuning

Fine-tuning enhances LLM’s capabilities in the following areas:

  1. Modifying the structure or tone of responses.
  2. Teaching the model to follow complex instructions.

For example, fine-tuning enables models to perform tasks like extracting JSON-formatted data from text, translating natural language into SQL queries, or adopting a specific writing style.

Fine-tuning demands a large, high-quality, task-specific dataset for effective training. You can start with a small dataset and training to see if the method works for your task.

Fine-tuning is less effective in adapting to new, rapidly changing data or unfamiliar queries beyond the training dataset. It's also not the best choice for incorporating new information into the model. Alternative methods, such as Retrieval-Augmented Generation, are more suitable.

Retrieval-Augmented Generation

RAG specializes in incorporating external knowledge, enabling the model to access current and varied information.

Real-Time Updates: It is more adept at dealing with evolving datasets and can provide more up-to-date responses.

Complexity in Integration: Setting up a RAG system is more complex than basic prompting, requiring extra components like a Vector Database and retrieval algorithms.

Data Management: Managing and updating the external data sources is crucial for maintaining the accuracy and relevance of its outputs.

Retrieval accuracy: Ensuring precise embedding retrieval is crucial in RAG systems to guarantee reliable and comprehensive responses to user queries. For that, we will demonstrate how Active Loop’s Deep Memory method can greatly increase the recall of embedding retrieval.

RAG + Fine-tuning

Fine-tuning and RAGs are not mutually exclusive techniques. Fine-tuning brings the advantage of customizing models for a specific style or format, which can be useful when using LLMs for specific domains such as medical, financial, or legal, requiring a highly specialized tone of writing.

When combined with RAG, the model becomes adept in its specialized area and gains access to a vast range of external information. The resulting model provides accurate responses in the niche area.

Implementing these two methods can demand considerable resources for setup and ongoing upkeep. It involves multiple training runs of fine-tuning with the data handling requirements inherent to RAG.

image

Enhanced RAG with Deep Memory

Deep Memory is a method developed by Active Loop to boost the accuracy of embedding retrieval for RAG systems integrated into the Deep Lake vector store database.

Central to its functionality is an embedding transformation process. Deep Memory trains a model that transforms embeddings into a space optimized for your use case. This reconfiguration significantly improves vector search accuracy.

Deep Memory is effective where query reformulation, query transformation, or document re-ranking might cause latency and increased token usage. It boosts retrieval capabilities without negatively impacting the system's performance.

The figure below shows the recall performance for different algorithms compared to Deep Memory.

Recall@1: This measures whether the top result (i.e., the first result) returned by the retrieval system is relevant to the query.

Recall@10: This metric assesses whether the relevant document is within the top 10 results returned by the retrieval system.

Comparison to Lexical search

BM25 is considered a state-of-the-art approach for "lexical search," based on the explicit presence of words (or lexicons) from the query in the documents. It's particularly effective for applications where the relevance of documents depends heavily on the presence of specific terms, such as in traditional search engines. However, BM25 does not account for the semantic relationships between words, where more advanced techniques like vector search with neural embeddings and semantic search come into play.

Overview of Deep Memory

In the figure above, we see the Inference and Training workflow:

  1. Embeddings: Vector representation of a text sentence or set of words. We can create them using embedding models such as OpenAI’s text-embedding-ada-002 or open-source models.
  2. Deep Memory Training: A dataset of query and context pairs trains the Deep Memory model. This training process runs on the Deep Lake service, which provides the computational resources and infrastructure for handling the training.
  3. Deep Memory Inference: The model enters the inference phase after training, which transforms query embeddings. We can use the Tensor Query Language (TQL) when running an inference/querying in the Vector Store.
  4. Transformed Embeddings: The result of the inference process is a set of transformed embeddings optimized for a specific use case. This optimization means that the embeddings are now in a more conducive space for returning accurate results.
  5. Vector Search: These optimized embeddings are used in a vector search, utilizing standard similarity search techniques (e.g., cosine similarity). The vector search is retrieving information, leveraging the refined embeddings to find and retrieve the most relevant data points for a given query.

Step by Step - Training a Deep Memory Model

Moving forward in our lesson, let's implement Deep Memory within our experimental workflow to see firsthand how it impacts retrieval recall.

You can follow along with this Colab notebook.

  1. Install the required libraries
!pip3 install deeplake langchain openai tiktoken
  1. Set your ACTIVELOOP_TOKEN and OPENAI_API_KEY
import os, getpass
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass()
os.environ['OPENAI_API_KEY'] = getpass.getpass()
  1. Module Import and Vector Store setup
  2. By importing OpenAIEmbeddings, we are choosing text-embedding-ada-002 as our embedding model.

# Import all the modules and dataset
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.deeplake import DeepLake

dataset_path = "hub://genai360/LlamaIndex_paulgraham_essay_managed"
openai_embeddings = OpenAIEmbeddings()

db = DeepLake(
    dataset_path=dataset_path,
    embedding=openai_embeddings,
    runtime={"tensor_db": True},
    read_only=True,
)
  1. Fetching our docs and ids from the vector store.
# Fetch dataset docs and ids 
docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)['value']
ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)['value']
print(len(docs))
  1. Generating a synthetic training dataset.

We need labeled data (query and document_id pairs) to train a Deep Memory model. Sometimes, it can be difficult to get labeled data when you are starting from scratch. This tutorial generates queries/questions using gpt-3.5-turbo from our existing documents.

from openai import OpenAI
client = OpenAI()

def generate_question(text):
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            messages=[
                {"role": "system", "content": "You are a world class expert for generating questions based on provided context. \
                        You make sure the question can be answered by the text."},
                {
                    "role": "user",
                    "content": text,
                },
            ],
        )
        return response.choices[0].message.content
    except:
        question_string = "No question generated"
        return question_string
import random
from tqdm import tqdm

def generate_queries(docs: list[str], ids: list[str], n: int):

    questions = []
    relevances = []
    pbar = tqdm(total=n)
    while len(questions) < n:
        # 1. randomly draw a piece of text and relevance id
        r = random.randint(0, len(docs)-1)
        text, label = docs[r], ids[r]

        # 2. generate queries and assign and relevance id
        generated_qs = [generate_question(text)]
        if generated_qs == ["No question generated"]:
            print("No question generated")
            continue

        questions.extend(generated_qs)
        relevances.extend([[(label, 1)] for _ in generated_qs])
        pbar.update(len(generated_qs))

    return questions[:n], relevances[:n]

5.1 Launch the query generation process with a desired size of 40 queries/questions.

questions, relevances = generate_queries(docs, ids, n=40)
print(len(questions)) #40
print(questions[0])

You will have a list of generated questions and the associated contexts by running the two cells above.

  1. Launch Deep Memory Training
# Train deep memory
job_id = db.vectorstore.deep_memory.train(
    queries=questions,
    relevance=relevances,
)
  1. Starting DeepMemory training job

Your Deep Lake dataset has been successfully created!

Preparing training data for DeepMemory: Creating 20 embeddings in 1 batches of size 20:: 100%|██████████| 1/1 [06:36<00:00, 396.77s/it] DeepMemory training job started. Job ID: 657b3083d528b0fd224173c6

# During training you can check the status of the training run
db.vectorstore.deep_memory.status(job_id="657b3083d528b0fd224173c6")
--------------------------------------------------------------
|                  657b3083d528b0fd224173c6                  |
--------------------------------------------------------------
| status                     | completed                     |
--------------------------------------------------------------
| progress                   | eta: 0.9 seconds              |
|                            | recall@10: 60.00% (+25.00%)   |
--------------------------------------------------------------
| results                    | recall@10: 60.00% (+25.00%)   |
--------------------------------------------------------------
Output

We see an increase of 25% in recall@10 after finetuning.

  1. Run a Deep Memory-enabled inference by setting deep_memory=True.
# Define your question
query = "What is the role of the 'encode' method in tokenizers?"
# Run similarity search
db.similarity_search(query=query, deep_memory=True, k=1)

# Perform the search within 
search_results = db.similarity_search(query="What is the role of the 'encode' method in tokenizers?", deep_memory=False, k=3)

# Print the search results
print(search_results)
# Search results: 
[Document(page_content='A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most\nof the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the\nRust library 🤗 Tokenizers. The “Fast” implementations allows:\na significant speed-up in particular when doing batched tokenization and additional methods to map between the original string (character and words) and the token space (e.g. getting the\nindex of the token comprising a given character or the span of characters corresponding to a given token).\nThe base classes PreTrainedTokenizer and PreTrainedTokenizerFast\nimplement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and\n“Fast” tokeni', metadata={'Unnamed: 0': 16245, 'title': 'Tokenizer', 'url': 'https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#tokenizer', 'source': 'hf_transformers'})]
  1. Now, let's run a quantitative evaluation on another set of synthetically generated test queries.
# Generate validation queries
validation_questions, validation_relevances = generate_queries(docs, ids, n=100)

# Launch the evaluation function
recalls = db.vectorstore.deep_memory.evaluate(
    queries=validation_questions,
    relevance=validation_relevances,
    embedding_function=openai_embeddings.embed_documents,
)
Code
Embedding queries took 0.82 seconds
---- Evaluating without Deep Memory ----
Recall@1:	  27.0%
Recall@3:	  42.0%
Recall@5:	  42.0%
Recall@10:	  50.0%
Recall@50:	  67.0%
Recall@100:	  72.0%
---- Evaluating with Deep Memory ----
Recall@1:	  32.0%
Recall@3:	  45.0%
Recall@5:	  48.0%
Recall@10:	  55.0%
Recall@50:	  69.0%
Recall@100:	  73.0%
Output

Even with our new test dataset, we notice higher recall values using Deep Memory. Comparing these results with the training dataset highlights how a query-context dataset has better quality and represents your use case.

Conclusion

In this lesson, we explored the optimization techniques for large language models, covering prompt engineering as a first way to maximize LLM performance, fine-tuning, and Retrieval-Augmented Generation (RAG) for integrating external, up-to-date knowledge.

We also discussed combining fine-tuning with RAG for complex, domain-specific applications requiring considerable resources. A significant focus was on Active Loop's Deep Memory, which was integrated into RAG systems to enhance embedding retrieval accuracy. Deep Memory outperforms traditional methods like BM25 using lexical search and vector search using cosine similarity. We demonstrated it by getting higher recall values. It also efficiently reduces token usage in LLM prompts compared to query reformulation or transformation.

This approach addresses key embedding retrieval challenges and signals a promising future for increasingly capable and versatile LLMs.

RESOURCES

  • Colab with the lesson code
  • A Survey of Techniques for Maximizing LLM Performance from OpenAI
  • Deep Memory Blog Post
  • Deep Memory Tutorial
  • Llama-index and Deep Memory