Practical Project Overview: Using TorchTune and Deep Lake for RAFT

In this guide we will analyze what the RAFT technique is, how Llama3 can be fine-tuned using this technique and how Deep Lake Vector Store can help us. These results were obtained by mixing several technologies including Torchtune, RAFT and Deep Lake.

What is Torchtune

Torchtune is a native PyTorch library designed for easy tuning of large language models (LLMs). Developed by Team PyTorch, Torchtune aligns perfectly with PyTorch’s core principles by offering modular building blocks and adaptable training recipes to perfect popular LLMs using professional and consumer-grade GPUs.

This comprehensive tool supports the entire tuning process, from dataset acquisition and model preparation to custom training using various model architectures, parameter efficient (PEFT) methods, and more. With Torchtune, monitoring training progress and performance metrics is intuitive, you can also train the model in a quantized manner all by changing just a few parameters.

The Need for Torchtune

Over the past year, interest in open LLMs has increased exponentially and fine-tuning these advanced models has become essential to tailor them to specific applications. However, the customization process, from dataset and model selection to quantization, evaluation, and inference, can be complicated and challenging, especially given the significant memory requirements of these models on GPUs.

Torchtune was crafted with several guiding principles:

Easy Extensibility: New techniques emerge frequently, and every fine-tuning use case is unique. Torchtune’s modular approach allows for easy composition of components and flexible training loops, minimizing abstraction and enabling fine-tuning tailored to specific needs. Each recipe is self-contained, comprising less than 600 lines of code, promoting readability and hackability.
Democratizing Fine-tuning: Torchtune is designed to be accessible to users of all skill levels. Whether cloning and modifying configurations or diving into code, users can easily leverage Torchtune. Moreover, Torchtune’s memory-efficient recipes have been validated on machines with single 24GB gaming GPUs, demonstrating its versatility.
Interoperability: Embracing the thriving open-source LLM ecosystem, Torchtune promotes interoperability with a diverse range of offerings. This flexibility empowers users to dictate how they train and utilize their fine-tuned models.

This library offers the possibility of developing different types of open source LLM, in our case we chose to experiment with the new Meta Llama3 model. In order to fine-tune this model we initially chose a text file that we would need in the training phase and modified it so that it followed the RAFT standard.

What is RAFT technique

RAFT stands for Retrieval Augmented Fine Tuning (RAFT) and is a powerful technique designed to optimize the performance of large language models (LLMs) in domain-specific contexts, particularly within retrieval-augmented generation (RAG) scenarios. RAFT, a refinement of the RAT (Retriever Aware Training) approach, leverages fine-tuning to enhance model proficiency in open-book settings where the model accesses external documents to generate responses.

Understanding RAFT

The RAFT methodology centers on training LLMs to ignore extraneous information retrieved from documents, focusing only on segments pertinent to addressing specific questions. By identifying and quoting relevant sections from retrieved documents, RAFT simplify the model’s reasoning process. This technique significantly improves the model’s capability in domain-specific RAG applications, demonstrated across diverse datasets like PubMed, HotpotQA, and Gorilla.

Key Features of RAFT

Training Strategy: RAFT trains the model to recognize those parts of retrieved documents that contribute to answering questions, promoting accurate and focused responses. Domain-Specific RAG: RAFT specializes in domain-specific open-book scenarios, refining model capabilities within predefined domains like enterprise documents or news repositories. Enhanced Reasoning: RAFT promotes a chain-of-thought-style response, sharpening the LLM’s ability to draw logical connections between retrieved information and questions.

To illustrate RAFT’s effectiveness, consider the analogy of preparing an LLM for an exam:

Closed-Book Exam: Comparable to scenarios where LLMs rely solely on pre-training knowledge for responses, similar to chatbot interactions.

Open-Book Exam: Represents scenarios where LLMs access external information (documents) to supplement responses, relying on retrievers to fetch relevant data.

RAFT (Domain-Specific Open-Book): Extends beyond general open-book settings, preparing LLMs for domain-specific tasks by training them on relevant documents tailored to specific use cases.

Implementing RAFT

In RAFT, training data comprises questions (Q), relevant documents (D*), and distractor documents (Dk-1), optimizing the LLM’s ability to generate accurate answers from retrieved materials. By emphasizing domain-specific training, RAFT ensures that LLMs excel in retrieving and synthesizing information within designated domains.

RAFT (Retrieval Augmented Fine Tuning) introduces a revolutionary approach to preparing fine-tuning data tailored for domain-specific open-book scenarios, aligning seamlessly with in-domain Retrieval Augmented Generation (RAG). This innovative technique optimizes training data by structuring each data point with a question (Q), a collection of documents (Dk), and a corresponding chain-of-thought style answer (A) extracted from one of the documents (D).

Classical Supervised Finetuning:

Consider the supervised fine-tuning (SFT) setting for a Question-Answer dataset. This involves a dataset (D) from which a set of Question (Q) and corresponding Answer (A) pairs are derived or already available. In the classical SFT setting, the model is trained to improve its ability to answer the questions based on knowledge obtained either during pre-training or the SFT training phase. The trained model can also be used at test time with the Retrieval Augmented Generation (RAG) setting, where additional documents can be introduced in the prompt to help the model answer the questions. This can be represented as follows: - Train: Q → A - Zero-shot Inference: Q → A - RAG Inference: Q + D → A

Supervised Finetuning In RAFT:

In RAFT two types of documents are distinguished:

Oracle Documents (D*): These documents contain information essential for deducing the answer to the question. Distractor Documents (Di): Documents that do not contain answer-relevant information.

The oracle document, which can consist of multiple documents (as observed in HotpotQA), is retained for a fraction P % of questions (qi) in the dataset, along with distractor documents (dk-1). Conversely, for the remaining fraction (1 - P) % of questions (qi), only distractor documents (dk) are included. This strategic selection of training data encourages the model to focus on domain-specific knowledge, enhancing its ability to generate accurate answers from the provided documents and questions.

By training LLMs with RAFT, the author of the paper observed improved performance in Retrieval Augmented Generation (RAG) tasks within the specified domain. The removal of oracle documents in certain instances of the training data compels the model to memorize and effectively utilize domain-specific knowledge, refining its adaptability and responsiveness.

Data Configuration for RAFT Training

The training data for RAFT is structured as follows:

For P % of Data: Q + D* + D + D + … + D => A*

For (1 - P) % of Data: Q + D + D + … + D => A*

This configuration enables the model to learn to generate accurate answers based on the provided documents, adapting seamlessly to the nuances of domain-specific RAG applications.

Example Training Data

An illustrative example of training data within the RAFT dataset comprises questions, context, instructions, and final Chain-of-Thought (CoT) answers. To maintain context fidelity and prevent model hallucination, direct quotes from the context are denoted using ##begin_quote## and ##end_quote## markers. This efficient approach ensures that the model remains grounded in the provided contexts, enhancing the accuracy and relevance of generated responses.

Prepare the RAFT dataset

To generate our RAFT dataset we relied on RAFTDatasetPack, a LlamaIndex library that takes a textual document as input and thanks to GPT-4 generates a file in the standard described previously which will then be used to train the LLM:

from llama_index.packs.raft_dataset import RAFTDatasetPack
from llama_index.llms.openai import OpenAI
raft_dataset = RAFTDatasetPack("<your_txt_document>")
dataset = raft_dataset.run()
dataset.save_to_disk(output_path)
dataset.to_json(output_path + r"\RAFT_standard_document.jsonl")

The task described use GPT-4 for dataset creation and this could bring some unpleasant effects:

Time Investment: Creating datasets using GPT-4 may take a considerable amount of time, especially if the dataset size is large.

Cost Implications: Utilizing GPT-4 for dataset creation can incur significant costs.

The larger the file size we pass to it, the more time and cost it will take for this task.

Now that we have created our dataset we can upload it to Activeloop as Deep Lake Dataset, this will be useful when we go to carry out the training.

Upload the RAFT Dataset to Deep Lake

A Deep Lake dataset is a type of dataset provided by the Deep Lake package, which stores data as compressed chunked arrays that can be stored anywhere and later streamed to deep learning models. It allows for efficient storage and retrieval of data for machine learning tasks. Deep Lake datasets can be created or loaded using functions like deeplake.load() or deeplake.dataset(). These datasets are designed to support deep learning workflows by providing a structured and efficient way to manage and access data for training machine learning models.

How to Train an LLM using Torchtune with RAFT Techique and Deep Lake Dataset

To be able to train with RAFT and Deep Lake Dataloader you need to download the repository and install the requirements:

    !git clone -b feature/raft-fine-tuning https://github.com/efenocchi/torchtune.git
    %cd torchtune
    !pip install -e .

Now you can specify your dataset by replacing the one already present in the torchtune/datasets/_raft.py file, in this first phase there are limitations in the use of Deep Lake Dataloader in Torchtune which will be eliminated shortly, so make sure that:

the dataset is in public mode
if you want to fine tune a specific dataset change the dataset path here torchtune/datasets/_raft.py

To continue, all we have to do is download the Llama3 files directly from Hugging Face. Please note that before you can access these files you must accept the Meta terms here.

    llama3_original_checkpoints_folder = "llama3"    os.makedirs(llama3_original_checkpoints_folder,exist_ok = True)
    lora_finetune_output_checkpoints_folder = "lora_finetune_output"    os.makedirs(lora_finetune_output_checkpoints_folder,exist_ok = True)
    !tune download meta-llama/Meta-Llama-3-8B --output-dir llama3 --hf-token <YOUR_HF_TOKEN>

In the torchtune/recipes/configs/llama3/8B_lora_single_device_deep_lake_raft.yaml file change /tmp/Meta-Llama-3-8B/original with llama3/original and `/tmp/Meta-Llama-3-8B/ with lora_finetune_output so that the file we are going to execute during the training phase is able to point to the correct folder.

Now let’s make sure we have the necessary resources for the training phase (an A100 GPU was used in this guide) and proceed with the training command:

    !tune run lora_finetune_single_device --config recipes/configs/llama3/8B_lora_single_device_deep_lake_raft.yaml

Since Torchtune performs excellently during the training phase but not as well during the testing phase, we decided to convert the PyTorch weights to the standard Hugging Face format and upload them to our space.

Make sure you are in the root project folder and not inside torchtune and install the following packages:

    %cd ..
    !pip install git+https://github.com/huggingface/transformers
    !git clone https://github.com/huggingface/transformers
    !pip install tiktoken blobfile
    !pip install accelerate

Moves the tokenizer, Llama3 fine-tuned model checkpoint, and params.json file into the weights folder.

    weights_folder = "weights"    os.makedirs(weights_folder,exist_ok = True)
    !cp llama3/meta_model_0.pt weights/consolidated.00.pth
    !cp llama3/original/params.json weights
    !cp llama3/original/tokenizer.model weights

Now we can transform the weights into the standard used by Hugging Face and load them into our space:

    !python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py \    --input_dir torchtune/weights \    --model_size 8B \    --output_dir hf_weights \    --llama_version 3

We upload the weights on our space by choosing a suitable name:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("hf_weights")
    model = AutoModelForCausalLM.from_pretrained("hf_weights")
    hf_repository_name = "llama3_RAFT"    tokenizer.push_to_hub(hf_repository_name)
    model.push_to_hub(hf_repository_name)

Generate Question-Answer Pairs

This function generates a series of question-answer pairs based on a chosen dataset. The purpose is to provide a set of questions to a fine-tuned model and compare its generated answers against the ground truth obtained in this phase.

The process involves the following steps:

Selecting specific chunks or samples from the dataset.
Using a language model to create questions based on each selected chunk.
Storing the generated question-answer pairs for later comparison.
By having this set of ground truth questions and answers, you can evaluate the performance of the fine-tuned model by comparing its responses with the expected answers from this dataset.

We load our previously saved dataset on Deep Lake, in our case the dataset choosen is deep_memory_biomedical_dataset:

We generate the question-answer pairs based on a chosen dataset:

We can see which question-answer pairs have been generated:

Extract Context from Deep Lake Vector Store Based on the Generated Question

We need to extract the relevant context from the Deep Lake Vector Store that corresponds to the question generated in the previous step. This involves fetching the specific dataset or information stored in the Deep Lake Vector Store that is most pertinent to the generated question. By retrieving this context, we can analyze and compare it with the generated question-answer pairs to assess the accuracy and relevance of the responses provided by the system.

Evaluate the Models

In this phase, we will conduct an evaluation of the Llama3 model and its fine-tuned variant using the RAFT technique. The evaluation will compare the performance of these models when provided with context retrieved from the Deep Lake Vector Store, both with and without leveraging the advanced Deep Memory feature.

Key Components:

Llama3 Model: This is the base model used for generating responses to questions. It serves as the foundation for our evaluation, providing a benchmark for performance comparison.

Llama3 Fine-Tuned with RAFT: We will assess the effectiveness of fine-tuning the Llama3 model using the RAFT technique. This fine-tuning approach aims to enhance the model’s ability to generate accurate responses by incorporating retrieval-based learning.