Most time, classic rag systems fail to properly encode highly unstructured data, such as graphs, plots and graphics in documents. In real life, most corporate documents are very different from the datasets used to train images to text models and contain hard-to-parse multipage tables, custom visuals etc.
Moreover, converting documents to text first is expensive and time-consuming. For example, you might need to fine-tune the OCR model on your specific domain data.
So, it makes sense to think about ways to simplify the pipeline. One straightforward solution, taking advances of the late models, is to just convert each page to an image and feed them as context to a multimodal language model; virtually all providers have LLMs with vision capabilities..
However, pages might be hundreds and thus blindy prepending them to the query will slow down generation and raise costs. Imagine a situation where we have a 100 page document, for each query `q` we made, the model needs to compute the attention scores for all the pages every single time. Recall that in Vision Language Models (VLM) the images are tokenized into small patches and during attention computation each image's patch looks at the other, similar to text. The main difference is that text is sequential by design, images no so each patch must attend to all the others.
A "simple" fix is to cache the model's weight till the query to disk, using KV cache. This allows to only compute the attention's part of the query tokens with the images tokens. This indeed saves costs and improves performances and can be done with images or text as well, it is usually referred as "prompt caching".
There is one major drawback, models have a limited context size. If we have too many pages, we simply cannot fit them.
We note a query as `q`, and a document as `d`.
What if, similar to a normal RAG system, we first retrieve the `d`s where we think the answer to a `q` would be and then we inject those images into a VLM to get an answer. This is the idea behind ColPali (read ColPali: Efficient Document Retrieval with Vision Language Models paper for more).
If you are familiar with ColBert, it is almost the same thing but with images directly.
ColPali is built upon Google's VLM PaliGemma, a "small" 3B model with very good performance. Let's take a bottom-up approach and explain how PaliGemma works first.
Usually, VLM is composed of an image encoder and a normal LM. An image is first processed by the image encoded and then projected into the space of the LLM's tokens so the model can "understand" it. The next figure from the paper explains the idea
The image encoder is SigLIP a variant of the famous CLIP which is a self-supervised model trained with contrastive loss. The main difference from CLIP is a smarter and more efficient loss function. Practically, SigLIP divides and images in `32x32` patches.
ColPali's main idea is to use PaliGemma to obtain both images and text representation and then compute a score for each `q` and `d`s we can use to retrieve the most relevant document's pages in the same way Colbert does it with text. The next figure, from the paper, shows exactly this.
From left to right we can see a document split into patches and process by the VLM, PaliGemma, to obtain `Nd` the vectors representation for each tokenized patch in `d`, of `128` dimension each. From right to left a query `q` `what are VIs?` is processed by the same VLM to obtain `Nq` the vectors representation of the tokenized text.
We now have two sets, one for the image and one for the text, we need somehow to create a score that will tell us "how much that page is relevant for that query". This is done by using a MaxSim operation that in a nutshell computes a similarity score for each query token `q` to all the document's patches `d` and takes the max. We do this for all `q`and then we sum up the results obtaining a single scalar. This allows each `q`, so a token from the text's query, to be compared to each `d`, a tokenized patch from the document's page.
If we do this for all the pages in a document we'll obtain a score for that text query for **each document page**, thus we can then fetch the `top-k`pages to feed into our VLM to get the final answer.
This has two main benefits, we do not need to rely on text preprocessing and complicated OCR pipelines and we can store into a database, such as Deep Lake and the embedded document's pages even in lower precision to speed up retrieval. If we do so, the online operation only requires embedding the text query using the VLM and then performing the similarity scoring.