Introduction

Multimodal RAG enhances the capabilities of language models by integrating information from diverse data modalities, such as text, images, and audio. This approach enables models to generate more accurate and contextually rich responses by accessing a broader spectrum of information sources.

Key Components of Multimodal RAG:

Multimodal Retrieval Mechanism: This component retrieves relevant information across various modalities. For instance, when responding to a query, the system can fetch pertinent text documents, images, or audio clips, providing a comprehensive context for the generation process.
Generative Model Integration: The retrieved multimodal information is then integrated into the generative model, allowing it to produce responses that are informed by the diverse data sources. This integration ensures that the generated content is not only relevant but also enriched with insights from multiple modalities.

Example application:

Medical Domain: In healthcare, Multimodal RAG systems can combine textual medical records with diagnostic images to provide more accurate and comprehensive analyses. For example, the MMed-RAG system enhances the factuality of medical vision-language models by introducing a domain-aware retrieval mechanism and adaptive context selection.

Challenges and Considerations:

Alignment Between Modalities: Ensuring that information from different modalities aligns correctly is crucial. Misalignment can lead to inaccuracies in the generated responses.
Model Complexity: Integrating multiple modalities increases the complexity of the model, requiring sophisticated architectures to handle diverse data types effectively.
Data Availability: Access to high-quality, multimodal datasets is essential for training effective Multimodal RAG systems.

In summary, Multimodal RAG represents a significant advancement in the field of artificial intelligence, enabling models to generate more informed and contextually rich responses by leveraging information from various data modalities. This approach holds promise for numerous applications, from healthcare to open-domain question answering, by providing a more holistic understanding of complex queries.

Code example

The core of this chapter will be focus on practical code example that has been adapted from:

Thanks to our friends at Llamaindex! ♥️

Set up

Do we need to talk about this? Probably not!

%pip install llama-index-multi-modal-llms-openai
%pip install llama_index ftfy regex tqdm
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install wikipedia

import os

OPENAI_API_KEY = "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

Collecting and Downloading Images from Wikipedia

This section of the code demonstrates how to extract and download images from Wikipedia pages. It selects a set of Wikipedia article titles, retrieves their associated images, and saves them locally while keeping track of metadata for each image. The metadata includes details like the file name, file path, and a unique identifier (UUID) for each image. Here's the step-by-step breakdown:

Set Up Image Storage: A directory named mixed_wiki is created to store the downloaded images. Each image is assigned a unique UUID for identification.
Retrieve Images: For each Wikipedia page specified in wiki_titles, the code fetches a list of image URLs. Only .jpg and .png files are considered to ensure compatibility with the image processing pipeline. Feel free to experiment with different wiki pages!
Download and Save: Valid image files are downloaded and saved locally. The code also updates a dictionary, image_metadata_dict, to store metadata for each image, including its UUID, file name, and path.
Image Limit per Page: To prevent downloading excessive images, a maximum of 30 images is downloaded from each Wikipedia page.
Error Handling: If a Wikipedia page lacks images or an error occurs, it prints a message and moves on to the next page.

import wikipedia
import urllib.request
from pathlib import Path


image_path = Path("mixed_wiki")
image_uuid = 0
# image_metadata_dict stores images metadata including image uuid, filename and path
image_metadata_dict = {}
MAX_IMAGES_PER_WIKI = 30

wiki_titles = [
    "Vincent van Gogh",
    "San Francisco",
    "Batman",
    "iPhone",
    "Tesla Model S",
    "BTS band",
]

# create folder for images only
if not image_path.exists():
    Path.mkdir(image_path)


# Download images for wiki pages
# Assing UUID for each image
for title in wiki_titles:
    images_per_wiki = 0
    print(title)
    try:
        page_py = wikipedia.page(title)
        list_img_urls = page_py.images
        for url in list_img_urls:
            if url.endswith(".jpg") or url.endswith(".png"):
                image_uuid += 1
                image_file_name = title + "_" + url.split("/")[-1]

                # img_path could be s3 path pointing to the raw image file in the future
                image_metadata_dict[image_uuid] = {
                    "filename": image_file_name,
                    "img_path": "./" + str(image_path / f"{image_uuid}.jpg"),
                }
                urllib.request.urlretrieve(
                    url, image_path / f"{image_uuid}.jpg"
                )
                images_per_wiki += 1
                # Limit the number of images downloaded per wiki page to 15
                if images_per_wiki > MAX_IMAGES_PER_WIKI:
                    break
    except:
        print(str(Exception("No images found for Wikipedia page: ")) + title)
        continue

Visualizing Downloaded Images

This section processes the downloaded images and displays a sample to verify the collection process. It uses PIL for image manipulation and matplotlib for visualization. Here's the breakdown:

Collect Image Paths: The code scans the mixed_wiki directory to create a list of file paths for all downloaded images. Each file's full path is appended to the image_paths list.
Define Plotting Function: The plot_images function:

Opens image files using the PIL.Image module.
Uses a 3x3 grid to display up to 9 images at a time.
Removes axis ticks for a cleaner display.

Display Images: The function is invoked with the collected image_paths, showcasing a visual summary of the first 9 images.

from PIL import Image
import matplotlib.pyplot as plt
import os

image_paths = []
for img_path in os.listdir("./mixed_wiki"):
    image_paths.append(str(os.path.join("./mixed_wiki", img_path)))


def plot_images(image_paths):
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(3, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 9:
                break


plot_images(image_paths)

Examples of the downloaded images:

As you can see, the dataset is not perfect - The Mercedes is not a Tesla (maybe it is a Batmobile then?).

Building a Multimodal Index

This block initializes a multimodal retrieval system without reliance on specific vector database platforms, showcasing a the easiest possibl approach to combining text and image embeddings.

Document Loading: Text data from the mixed_wiki directory is loaded using SimpleDirectoryReader.
Index Creation: A MultiModalVectorStoreIndex is built directly from the documents, enabling seamless retrieval across modalities.

Pro tip: If you run the notebook multiple times, the images will get stored repeatedly and you will have many duplicates (uhm, my friend told me that, definitely didn’t happen to me 🤨)

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core import  StorageContext, SimpleDirectoryReader

# Create the MultiModal index
documents = SimpleDirectoryReader("./mixed_wiki/").load_data()
index = MultiModalVectorStoreIndex.from_documents(
    documents
)

Displaying an Input Image

This simple cell visualizes a single image from the downloaded dataset, which will later serve as an input for multimodal retrieval.

Input Image Selection: The input_image variable specifies the image to be displayed. In this case, it points to "./mixed_wiki/2.jpg".
Visualization: The plot_images function is used to display the selected image.

Feel free to change the input_image path to any other image from the mixed_wiki directory to explore the dataset visually. This flexibility allows experimentation with different inputs for retrieval.

input_image = "./mixed_wiki/2.jpg"
plot_images([input_image])

Example output

Image-to-Image Retrieval

In this cell, you’ll retrieve images similar to your selected input image using the multimodal retrieval system.

Retriever Engine: The retriever_engine is created from the multimodal index and configured to return the top 4 most similar images (image_similarity_top_k=4).
Image-to-Image Retrieval: The image_to_image_retrieve method retrieves images from the index based on their similarity to the input image ("./mixed_wiki/2.jpg").
Post-Processing:

The metadata (file_path) of each retrieved image is extracted and stored in the retrieved_images list.
The first retrieved image, which is the input image itself (due to the highest similarity score), is removed from the results.

Visualization: The plot_images function is used to display the remaining retrieved images.

Feel free to modify the input_image in the previous cell to test retrieval for different images and explore how the system finds similar images.

# generate Text retrieval results
retriever_engine = index.as_retriever(image_similarity_top_k=4)
# retrieve more information from the GPT4V response
retrieval_results = retriever_engine.image_to_image_retrieve(
    "./mixed_wiki/2.jpg"
)
retrieved_images = []
for res in retrieval_results:
    retrieved_images.append(res.node.metadata["file_path"])

# Remove the first retrieved image as it is the input image
# since the input image will gethe highest similarity score
plot_images(retrieved_images[1:])

Example output:

You can see that similar images were retrieved (Gogh).

Using GPT-4 Vision for Multimodal Analysis

In this cell, you integrate OpenAI's multimodal model, GPT-4 Vision, to analyze the retrieved images and generate a contextual response.

Image Preparation:

The input_image is wrapped in an ImageDocument object for compatibility with the model.
Retrieved images (excluding the input image) are also converted into ImageDocument objects and appended to the list.

Model Initialization:

The OpenAIMultiModal object is set up with GPT-4 Vision (gpt-4o) and your OpenAI API key.
Parameters like max_new_tokens control the length of the generated response.

Generating a Response:

A prompt is provided to the model, asking it to analyze the input image in relation to the retrieved images.
The model generates a detailed response based on the visual content and the task.

This cell demonstrates the seamless combination of multimodal retrieval and generative capabilities. Experiment by modifying the input images or prompts to explore the flexibility of this system for multimodal tasks.

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import SimpleDirectoryReader
from llama_index.core.schema import ImageDocument

# put your local directore here
image_documents = [ImageDocument(image_path=input_image)]

for res_img in retrieved_images[1:]:
    image_documents.append(ImageDocument(image_path=res_img))


openai_mm_llm = OpenAIMultiModal(
    model="gpt-4o", api_key=OPENAI_API_KEY, max_new_tokens=1500
)
response = openai_mm_llm.complete(
    prompt="Given the first image as the base image, what the other images correspond to?",
    image_documents=image_documents,
)

print(response)

Example output:

“The first image is "The Sower" by Vincent van Gogh. The other images correspond to:

Second Image: "Olive Trees" by Vincent van Gogh.
Third Image: "The Painter of Sunflowers" by Paul Gauguin, depicting Vincent van Gogh.
Fourth Image: "Wheat Field with Cypresses" by Vincent van Gogh.

These paintings are all related to Van Gogh's style and themes.”

Image Query with GPT-4 Vision and Custom Template

In this cell, you combine image retrieval, custom prompting, and GPT-4 Vision to answer a specific query related to the retrieved images.

Custom Prompt Template:

A PromptTemplate object is created to define a query format.
The qa_tmpl_str template specifies how the query and retrieved images are presented to the model for response synthesis.

Model Initialization:

The OpenAIMultiModal object is initialized with GPT-4 Vision (gpt-4o) for generating multimodal responses.

Query Engine Setup:

The as_query_engine method from the index creates a query engine that integrates retrieval with GPT-4 Vision.
The image_qa_template is used to format the interaction for the model.

Image Query and Response:

The image_query method retrieves the top-k relevant images for the input_image and formats them alongside the query string using the custom template.
GPT-4 Vision processes this information and generates a response.

This cell demonstrates the integration of multimodal retrieval with flexible query answering, allowing you to synthesize relationships and insights about visual data. Feel free to modify the query string or experiment with different images to explore the versatility of this system.

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import PromptTemplate

qa_tmpl_str = (
    "Given the images provided, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

qa_tmpl = PromptTemplate(qa_tmpl_str)


openai_mm_llm = OpenAIMultiModal(
    model="gpt-4o", api_key=OPENAI_API_KEY, max_new_tokens=1500
)

query_engine = index.as_query_engine(
    llm=openai_mm_llm, image_qa_template=qa_tmpl
)

query_str = "Who is the author of these images, from which year are they and what is their meaning?"
response = query_engine.image_query("./mixed_wiki/2.jpg", query_str)
print(response)

Example output:

The images are paintings by Vincent van Gogh.

The Sower (1888): This painting depicts a sower in a field with a large sun in the background. It symbolizes the cycle of life and the connection between humanity and nature.
Olive Trees (1889): This painting shows a grove of olive trees with swirling skies. It reflects van Gogh's fascination with nature and his emotional response to the landscape around him.

Both paintings are characterized by van Gogh's distinctive use of color and expressive brushwork.

Other Modalities in Multimodal RAG

Multimodal RAG can extend beyond text and images to include audio, video, and other formats. The core principle remains the same: creating high-quality embeddings for each modality and ensuring the LLM can process them effectively. Audio is often transcribed to text before embedding, while video is treated as a series of frames for visual embeddings. With the right embeddings and LLM capabilities, the process is adaptable to virtually any modality.

So what?

Multimodal RAG is an exciting and rapidly evolving field with immense potential. While traditional RAG is powerful, many organizations have vast amounts of data spanning multiple modalities, such as text, images, audio, and more. Combining these diverse data types into a unified pipeline elevates your RAG system, unlocking capabilities far beyond what text-based retrieval alone can achieve. However, this complexity makes choosing the right vector database even more critical than for traditional RAG implementations. A robust solution for handling multimodal data can significantly enhance both performance and scalability.

When it comes to multimodal RAG, Deeplake stands out as a leading solution. Its true multi-modality enables seamless integration of diverse data types, including raw data, metadata, and embeddings, all within a single system. Combined with its cost-efficient scalability and high-speed retrieval capabilities, Deeplake simplifies building powerful multimodal pipelines. At Activeloop, we’re dedicated to creating cutting-edge multimodal retrieval systems—learn more about Deeplake’s latest advancements here: Davit Buniatyan Deep Lake 4.0: The Fastest Multi-Modal AI Search on Data Lakes

Conclusion

In this chapter, we explored the power of Multimodal RAG to enhance generative models by integrating multiple data modalities. By combining embeddings from text, images, and other formats, multimodal RAG systems provide enriched, contextually relevant responses that go beyond single-modality approaches.

Through a hands-on example, we demonstrated how to build a multimodal retrieval pipeline using images, create a flexible index, and integrate GPT-4 Vision to analyze retrieved content. The key takeaway is that success in multimodal RAG depends on high-quality embeddings and a generative model capable of synthesizing insights from diverse data types. This approach opens up endless possibilities for applications, from answering complex visual-textual queries to creating domain-specific retrieval systems.

The next chapter is called “Other notable techniques” - the place for techniques that did not deserve a dedicated chapter. But the chapter still contains some gems, including my favourite - Contextual retrieval by Anthropic!

Multimodal RAG: Retrieving Images & More

Introduction

Key Components of Multimodal RAG:

Example application:

Challenges and Considerations:

Code example

Set up

Collecting and Downloading Images from Wikipedia

Visualizing Downloaded Images

Examples of the downloaded images:

Building a Multimodal Index

Displaying an Input Image

Example output

Image-to-Image Retrieval

Using GPT-4 Vision for Multimodal Analysis

Image Query with GPT-4 Vision and Custom Template

Other Modalities in Multimodal RAG

So what?

Conclusion