Multi-modal AI Search Across Restaurants with ColPali

In this final stage, the system uses an end-to-end neural search approach with a focus on the MaxSim operator, as implemented in ColPali, to improve multi-modal retrieval. MaxSim allows the system to compare different types of data, like text and images, and find the most relevant matches. This helps retrieve results that are contextually accurate and meaningful, making it especially useful for complex applications, like scientific and medical research, where a deep understanding of the content is essential..

image

Recent advancements in Visual Language Models (VLMs), as highlighted in the ColPali paper, demonstrate that VLMs can achieve recall rates on document retrieval benchmarks comparable to those of traditional OCR pipelines. End-to-end learning approaches are positioned to surpass OCR-based methods significantly. However, representing documents as a bag of embeddings demands 30 times more storage than single embeddings. Deep Lake’s format, which inherently supports n-dimensional arrays, enables this storage-intensive approach, and the 4.0 query engine introduces MaxSim operations.

With Deep Lake 4.0’s 10x increase in storage efficiency, we can allocate some of these savings to store PDFs as 'bags of embeddings' processed at high speeds. While this approach requires 30 times more storage than single embeddings, it allows us to capture richer document representations, bypassing OCR-based, manual feature engineering pipelines. This trade-off facilitates seamless integration within VLM/LLM frameworks, leading to more accurate and genuinely multimodal responses.

Unlike CLIP, which primarily focuses on aligning visual and text representations, ColPali leverages advanced Vision Language Model (VLM) capabilities to deeply understand both textual and visual content. This allows ColPali to capture rich document structures—like tables, figures, and layouts—directly from images without needing extensive preprocessing steps like OCR or document segmentation. ColPali also utilizes a late interaction mechanism, which significantly improves retrieval accuracy by enabling more detailed matching between query elements and document content. These features make ColPali faster, more accurate, and especially effective for visually rich document retrieval, surpassing CLIP's capabilities in these areas​.

For more details, see the ColPali paper.

Download the ColPali model

We initialize the ColPali model and its processor to handle images efficiently. The model version is set to "vidore/colpali-v1.2", specifying the desired ColPali release. The model is loaded using ColPali.from_pretrained(), with torch_dtype=torch.bfloat16 for optimized memory use and "cuda:0" as the device, or "mps" for Apple Silicon devices. After loading, we set the model to evaluation mode with .eval() to prepare it for inference tasks. The ColPaliProcessor is also initialized to handle preprocessing of images and texts, enabling seamless input preparation for the model. This setup readies ColPali for high-performance image and document processing.

!pip install colpali-engine accelerate
image

The provided image illustrates the architecture of ColPali , a vision-language model designed specifically for efficient document retrieval using both visual and textual cues. Here’s an overview of its workings and how it’s designed to perform this task efficiently:

  1. Offline Document Encoding :
    • On the left side, we see the offline processing pipeline, where a document is fed into ColPali’s Vision Language Model (VLM) .
    • Each document undergoes encoding through a vision encoder (to handle images and visual content) and a language model (for textual understanding). These two modules generate multi-dimensional embeddings representing both visual and textual aspects of the document.
    • The embeddings are stored in a pre-indexed format, making them ready for fast retrieval during the online phase.
  2. Online Query Processing :
    • On the right side, in the online section, user queries (such as “What are ViTs?”) are processed through the language model to create a query embedding.
    • ColPali uses a late interaction mechanism , where each part of the query embedding is compared with document embeddings through a MaxSim operation to find the most similar regions in the document’s visual and textual content.
  3. Similarity Scoring :
    • ColPali calculates a similarity score based on the MaxSim results, which identifies the most relevant documents or document sections matching the query.
    • This approach allows ColPali to capture fine-grained matches, even within complex document structures. The ColPali model improves on traditional document retrieval methods by incorporating both vision and language models, making it effective for visually rich documents (such as those with tables, images, or infographics). Additionally, its late interaction mechanism enables fast and accurate retrieval, optimizing the model for low-latency performance even in large-scale applications​.
import torch
from PIL import Image

from colpali_engine.models import ColPali, ColPaliProcessor

model_name = "vidore/colpali-v1.2"

model = ColPali.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
).eval()

processor = ColPaliProcessor.from_pretrained(model_name)

We load the FigQA dataset using deeplake, specifically retrieving the "train" split of the "FigQA" subset within the "futurehouse/lab-bench" dataset. This dataset, contains figure data tailored for question-answering tasks, making it an ideal input format for the ColPali model. ColPali’s advanced capabilities in handling structured and tabular data enable effective extraction of answers and insights from these figures, enhancing overall performance on complex, figure-based queries.

figQA_dataset = "figQA_dataset"
fig_qa = deeplake.open_read_only(f"al://activeloop/{figQA_dataset}")
figure_images = [Image.fromarray(el["image"]) for el in fig_qa]
questions = [el["question"] for el in  fig_qa]

Create a new dataset to store the ColPali embeddings

We create a Deep Lake dataset named "tabqa_colpali" for ColPali’s table-based question answering. Stored in vector_search_images, it includes an embedding** column for 2D float arrays, a question column for text, and an image column for table images. After defining the structure, vector_search_images.commit() saves the setup, optimizing it for ColPali’s multi-modal retrieval in table QA tasks.

late_interaction_dataset_name = "figQA_colpali"
vector_search_images = deeplake.create(f"al://{org_id}/{late_interaction_dataset_name}")

vector_search_images.add_column(name="embedding", dtype=types.Array(types.Float32(),dimensions=2))
vector_search_images.add_column(name="question", dtype=types.Text())
vector_search_images.add_column(name="image", dtype=types.Image(dtype=types.UInt8()))

vector_search_images.commit()

Save the data in the dataset

We batch-process and store ColPali embeddings for table-based question answering. Using a batch_size of 2, we take the first 10 tables and questions from table_qa. For each pair, if question is a single string, it’s converted to a list. The table_image is processed in batches, passed through processor and ColPali, and embeddings are generated without gradients. These embeddings are stored as lists and appended with each question and image to vector_search_images.Finally, vector_search_images.commit() saves everything for efficient retrieval.

batch_size = 8

matrix_embeddings: list[torch.Tensor] = []

for i in range(0, len(figure_images), batch_size):
    batch = figure_images[i:i + batch_size]  # Take batch_size images at a time
    batch_images = processor.process_images(batch).to(model.device)
    with torch.no_grad():
        embeddings = model(**batch_images)
        matrix_embeddings.extend(list(torch.unbind(embeddings.to("cpu"))))

# Convert embeddings to list format
matrix_embeddings_list = [embedding.tolist() for embedding in matrix_embeddings]

# Append question, images, and embeddings to the dataset
vector_search_images.append({
    "question": questions,
    "image": [np.array(img).astype(np.uint8) for img in figure_images],
    "embedding": matrix_embeddings_list
})

# Commit the additions to the dataset
vector_search_images.commit()

Chat with images

We randomly select three questions from questions and process them with processor, sending the batch to the model’s device. Embeddings are generated without gradients and converted to a list format, stored in query_embeddings.

queries = [
    "At Time (ms) = 0, the membrane potential modeled by n^6 is at -70 ms. If the axis of this graph was extended to t = infinity, what Membrane Voltage would the line modeled by n^6 eventually reach?",
    "Percent frequency distribution of fiber lengths in cortex and spinal cord by diameter"
]

batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
    query_embeddings = model(**batch_queries)
query_embeddings = query_embeddings.tolist()

Retrieve the most similar images

For each embedding in query_embeddings, we format it as a nested array string for querying. The innermost lists (q_substrs) are converted to ARRAY[] format, and then combined into a single string, q_str. This formatted string is used in a query on vector_search_images, calculating the maxsim similarity between q_str and embedding. The query returns the top 2 results, ordered by similarity score (score). This loop performs similarity searches for each query embedding.

colpali_results = []
n_res = 1

for el in query_embeddings:
    # Convert each sublist of embeddings into a formatted TQL array string
    q_substrs = [f"ARRAY[{','.join(str(x) for x in sq)}]" for sq in el]
    q_str = f"ARRAY[{','.join(q_substrs)}]"
    
    # Construct a formatted TQL query
    tql_colpali = f"""
        SELECT *, maxsim({q_str}, embedding) AS score 
        ORDER BY maxsim({q_str}, embedding) DESC 
        LIMIT {n_res}
    """
    
    # Execute the query and append the results
    colpali_results.append(vector_search_images.query(tql_colpali))

For each result in view, this code prints the question text and its similarity score. It then converts the image data back to an image format with Image.fromarray(el["image"]) and displays it using el_img.show(). This loop visually presents each query's closest matches alongside their similarity scores.

import matplotlib.pyplot as plt
    
num_columns = n_res
num_rows = len(colpali_results)

fig, axes = plt.subplots(num_rows, num_columns, figsize=(15, 5 * num_rows))
axes = axes.flatten()  # Flatten for easier access to cells

idx_plot = 0
for res, query in zip(colpali_results, queries):
    for el in res: 
        img = Image.fromarray(el["image"])
        axes[idx_plot].imshow(img)
        axes[idx_plot].set_title(f"Query: {query}, Similarity: {el['score']:.4f}")
        axes[idx_plot].axis('off')  # Turn off axes for a cleaner look
        idx_plot += 1
for ax in axes[len(colpali_results):]:
    ax.axis('off')

plt.tight_layout()
plt.show()
image

VQA: Visual Question Answering

The following function, generate_VQA, creates a visual question-answering (VQA) system that takes an image and a question, then analyzes the image to provide an answer based on visual cues.

  1. Convert Image to Base64 : The image (img) is encoded to a base64 string, allowing it to be embedded in the API request.
  2. System Prompt : A structured prompt instructs the model to analyze the image, focusing on visual details that can answer the question.
  3. Payload and Headers : The request payload includes the model (gpt-4o-mini), the system prompt, and the base64-encoded image. The model is expected to respond in JSON format, specifically returning an answer field with insights based on the image.
  4. Send API Request : Using requests.post, the function sends the payload to the OpenAI API. If successful, it parses and returns the answer; otherwise, it returns False.

This approach enables an AI-powered visual analysis of images to generate contextually relevant answers.

import json

def generate_VQA(base64_image: str, question:str):

    system_prompt = f"""You are a visual language model specialized in analyzing images. Below is an image provided by the user along with a question. Analyze the image carefully, paying attention to details relevant to the question. Construct a clear and informative answer that directly addresses the user's question, based on visual cues.

    The output must be in JSON format with the following structure:
    {{
        "answer": "The answer to the question based on visual analysis."
    }}

    Here is the question: {question}
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": system_prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                    },
                ],
            }
        ],
        response_format={"type": "json_object"},
    )

    try:
        
        response = response.choices[0].message.content
        response = json.loads(response)
        answer = response["answer"]
        return answer
    except Exception as e:
        print(f"Error: {e}")
        return False

This code sets question to the first item in queries, converts the first image in colpali_results to an image format, and saves it as "image.jpg".

question = queries[0]
output_image = "image.jpg"
img = Image.fromarray(colpali_results[0]["image"][0])
img.save(output_image)

The following code opens "image.jpg" in binary mode, encodes it to a base64 string, and passes it with question to the generate_VQA function, which returns an answer based on the image.

import base64

with open(output_image, "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

answer = generate_VQA(base64_image, question)
answer

Expected output: 'As time approaches infinity, the voltage modeled by n^6 will eventually stabilize at the equilibrium potential for potassium (EK), which is represented at approximately -90 mV on the graph.

Conclusion

TBD below is conclusion to the whole doc example (7 parts)

We've now gained a solid understanding of multi-modal data processing, advanced retrieval techniques, and hybrid search methods using state-of-the-art models like ColPali. With these skills, you’re equipped to tackle complex, real-world applications that require deep insights from both text and image data.

Keep experimenting, stay curious, and continue building innovative solutions—this is just the beginning of what’s possible in the field of AI-driven search and information retrieval.