Introduction

In this practical project, we dive into the details of Multimodal RAG in Deep Lake, leveraging its key advantages like true multi-modality and cost-efficient scalability. This chapter builds on the Hybrid RAG project, using the same context of restaurants and dataset. However, if you missed the previous project, no worries—this works perfectly as a standalone exploration. Let’s jump into comparing burgers using image embeddings and discover how Deep Lake's advanced capabilities make it an ideal tool for seamless integration of visual and textual data!

Jupyter: Google Colab

To set up for image embedding generation, we start by importing necessary libraries.

Set Device :

We define device to use GPU if available, otherwise defaulting to CPU, ensuring compatibility across hardware.

Load CLIP Model :

We load the CLIP model (ViT-B/32) with its associated preprocessing steps using clip.load(). This model is optimized for multi-modal tasks and is set to run on the specified device.

This setup allows us to efficiently process images for embedding, supporting multi-modal applications like image-text similarity.

The following image illustrates the CLIP (Contrastive Language-Image Pretraining) model's structure, which aligns text and images in a shared embedding space, enabling cross-modal understanding.

import torch
import clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

Create the embedding function for images

To prepare images for embedding generation, we define a transformation pipeline and a function to process images in batches.

Define Transformations (tform) :

The transformation pipeline includes:

Resize : Scales images to 224x224 pixels.
ToTensor : Converts images to tensor format.
Lambda : Ensures grayscale images are replicated across three channels to match the RGB format.
Normalize : Standardizes pixel values based on common RGB means and standard deviations.

Define embedding_function_images :

This function generates embeddings for a list of image.
If images is a single filename, it’s converted to a list.
Batch Processing : Images are processed in batches (default size 4), with transformations applied to each image. The batch is then loaded to the device.
Embedding Creation : The model encodes each batch into embeddings, stored in the embeddings list, which is returned as a single list.

This function supports efficient, batched embedding generation, useful for multi-modal tasks like image-based search.

from torchvision import transforms

tform = transforms.Compose([
    transforms.Resize((224,224)), 
    transforms.ToTensor(),
    transforms.Lambda(lambda x: torch.cat([x, x, x], dim=0) if x.shape[0] == 1 else x),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

def embedding_function_images(images, model = model, transform = tform, batch_size = 4):
    """Creates a list of embeddings based on a list of image. Images are processed in batches."""

    if isinstance(images, str):
        images = [images]

    # Proceess the embeddings in batches, but return everything as a single list
    embeddings = []
    for i in range(0, len(images), batch_size):
        batch = torch.stack([transform(item) for item in images[i:i+batch_size]])
        batch = batch.to(device)
        with torch.no_grad():
            embeddings+= model.encode_image(batch).cpu().numpy().tolist()

    return embeddings

Create a new dataset to save the images

We set up a dataset for restaurant images and embeddings. The dataset includes an embedding column for 512-dimensional image embeddings, a restaurant_name column for names, and an image column for storing images in UInt8 format. After defining the structure, vector_search_images.commit() saves it, making the dataset ready for storing data for multi-modal search tasks with images and metadata.

import deeplake
scraped_data = deeplake.open_read_only("al://activeloop/restaurant_dataset_complete")

This code extracts restaurant details from scraped_data into separate lists:

Initialize Lists : restaurant_name and images are initialized to store respective data for each restaurant.
Populate Lists : For each entry (el) in scraped_data, the code appends:

el['restaurant_name'] to restaurant_name
el['images']['urls'] to images.

After running, each list holds a specific field from all restaurants, ready for further processing.

restaurant_name = []
images = []
for el in scraped_data:
    restaurant_name.append(el['restaurant_name'])
    images.append(el['images']['urls'])

image_dataset_name = "restaurant_dataset_with_images"
vector_search_images = deeplake.create(f"al://{org_id}/{image_dataset_name}")

vector_search_images.add_column(name="embedding", dtype=types.Embedding(512))
vector_search_images.add_column(name="restaurant_name", dtype=types.Text())
vector_search_images.add_column(name="image", dtype=types.Image(dtype=types.UInt8()))

vector_search_images.commit()

Convert the URLs into images

We retrieve images for each restaurant from URLs in scraped_data and store them in restaurants_images. For each restaurant, we extract image URLs, request each URL, and filter for successful responses (status code 200). These responses are then converted to PIL images and added to restaurants_images as lists of images, with each sublist containing the images for one restaurant.

#!pip install requests

import requests
from PIL import Image
from io import BytesIO

restaurants_images = []
for urls in images:
    pil_images = []
    for url in urls:
        response = requests.get(url)
        if response.status_code == 200:
            image = Image.open(BytesIO(response.content))
            if image.mode == "RGB":
                pil_images.append(image)
    if len(pil_images) == 0:
        pil_images.append(Image.new("RGB", (224, 224), (255, 255, 255)))
    restaurants_images.append(pil_images)

We populate vector_search_images with restaurant image data and embeddings. For each restaurant in scraped_data, we retrieve its name and images, create embeddings for the images, and convert them to UInt8 arrays. Then, we append the restaurant names, images, and embeddings to the dataset and save with vector_search_images.commit().

import numpy as np

for sd, rest_images in zip(scraped_data, restaurants_images):
    restaurant_name = [sd["restaurant_name"]] * len(rest_images)
    embeddings = embedding_function_images(rest_images, model=model, transform=tform, batch_size=4)
    vector_search_images.append({"restaurant_name": restaurant_name, "image": [np.array(fn).astype(np.uint8) for fn in rest_images], "embedding": embeddings})

vector_search_images.commit()

Search similar images

If you want direct access to the images and the embeddings, you can copy the Activeloop dataset.

deeplake.copy("al://activeloop/restaurant_dataset_images_v4", f"al://{org_id}/{image_dataset_name}")
vector_search_images = deeplake.open(f"al://{org_id}/{image_dataset_name}")

Alternatively, you can load the dataset you just created.

vector_search_images = deeplake.open(f"al://{org_id}/{image_dataset_name}")
vector_search_images

query = "https://www.moltofood.it/wp-content/uploads/2024/09/Hamburger.jpg"

image_query = requests.get(query)
image_query_pil = Image.open(BytesIO(image_query.content))

Performing a similar image search based on a specific image

image_query_pil

Output:

We generate an embedding for the query image, image_query_pil, by calling embedding_function_images([image_query_pil])[0]. This embedding is then converted into a comma-separated string, query_embedding_string, for compatibility in the query.The query, tql, retrieves entries from the dataset by calculating cosine similarity between embedding and query_embedding_string. It ranks results by similarity score in descending order, limiting the output to the top 6 most similar images.

query_embedding = embedding_function_images([image_query_pil])[0]
query_embedding_string = ",".join([str(item) for item in query_embedding])

tql = f"""
    SELECT * 
    FROM (
        SELECT *, cosine_similarity(embedding, ARRAY[{query_embedding_string}]) AS score 
        FROM (
            SELECT *, ROW_NUMBER() AS row_id
        )
    ) 
    ORDER BY score DESC 
    LIMIT 6
"""

similar_images_result = vector_search_images.query(tql)
similar_images_result

Output: Dataset(columns=(embedding,restaurant_name,image,row_id,score), length=6)

Show similar images and the their respective restaurants

The show_images function displays a grid of similar images, along with restaurant names and similarity scores. It defines a grid with 3 columns and calculates the required number of rows based on the number of images. A figure with subplots is created, where each image is displayed in a cell with its restaurant name and similarity score shown as the title, and axes turned off for a cleaner look. Any extra cells, if present, are hidden to avoid empty spaces. Finally, plt.tight_layout() arranges the grid, and plt.show() displays the images in a well-organized layout, highlighting the most similar images along with their metadata.

import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

def show_images(similar_images: list[dict]):
    # Define the number of rows and columns for the grid
    num_columns = 3
    num_rows = (len(similar_images) + num_columns - 1) // num_columns  # Calculate the required number of rows

    # Create the grid
    fig, axes = plt.subplots(num_rows, num_columns, figsize=(15, 5 * num_rows))
    axes = axes.flatten()  # Flatten for easier access to cells

    for idx, el in enumerate(similar_images):
        img = Image.fromarray(el["image"])
        axes[idx].imshow(img)
        axes[idx].set_title(f"Restaurant: {el['restaurant_name']}, Similarity: {el['score']:.4f}")
        axes[idx].axis('off')  # Turn off axes for a cleaner look

    # Remove empty axes if the number of images doesn't fill the grid
    for ax in axes[len(similar_images):]:
        ax.axis('off')

    plt.tight_layout()
    plt.show()

show_images(similar_images_result)

Conclusion

Through the delicious lens of burger images, this project showcased the power of multimodal RAG systems. By leveraging Deep Lake's capabilities, we explored embedding generation, dataset creation, and similarity-based search, demonstrating how to seamlessly integrate visual and textual data for versatile and practical applications.

In the next chapter, we are going to explore a new and exciting technique for working with image data - ColPali!