In this lesson, we will explore the application of the Retrieval-augmented Generation (RAG) method in processing a company's financial information contained within a PDF document. The process includes extracting critical data from a PDF file (like text, tables, graphs, etc.) and saving them in a vector store database such as Deep Lake for quick and efficient retrieval. Next, a RAG-enabled bot can access stored information to respond to end-user queries.

This task requires diverse tools, including Unstructured.io for text/table extraction, OpenAI's GPT-4V for extracting information from graphs, and LlamaIndex for developing a bot with retrieval capabilities. As previously mentioned, data preprocessing plays a significant role in the RAG process. So, we start by pulling data from a PDF document.

Extracting Data

Extracting textual data is relatively straightforward, but processing graphical elements such as line or bar charts can be more challenging. The latest OpenAI model equipped with vision processing, GPT-4V, is valuable for visual elements. We can feed the slides to the model and ask it to describe it in detail, which then will be used to complement the textual information. This lesson uses Tesla's Q3 financial report as the source document. It is possible to download the document using the wget command.

wget https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf

The sample code.

💡

The preprocessing tasks outlined in the next section might be time-consuming and necessitate API calls to OpenAI endpoints, which come with associated costs. To mitigate this, we have made the preprocessed dataset and the checkpoints of the output of each section available at the end of this lesson, allowing you to utilize them with the provided notebook.

1. Text/Tables

The unstructured package is an effective tool for extracting information from PDF files. It requires two tools, popplerand tesseract, that help render PDF documents. We suggest setting up these packages on Google Colab, freely available for students to execute and experiment with code. We will briefly mention the installation of these packages on other operating systems. Let's install the utilities and their dependencies using the following commands.

apt-get -qq install poppler-utils
apt-get -qq install tesseract-ocr

pip install -q unstructured[all-docs]==0.11.0 fastapi==0.103.2 kaleido==0.2.1 uvicorn==0.24.0.post1 typing-extensions==4.5.0 pydantic==1.10.13

The commands to install required packages.

💡

These packages are easy to install on Linux and Mac operating systems using apt-get and brew. However, they are more complex to install on Windows OS. You can follow the below instructions for a step-by-step guide if you use Windows. [Installing Poppler on Windows] [Installing Tesseract on Windows]

The process is simple after installing all the necessary packages and dependencies. We simply use the partition_pdf function, which extracts text and table data from the PDF and divides it into multiple chunks. We can customize the size of these chunks based on the number of characters.

from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="./TSLA-Q3-2023-Update-3.pdf",
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000
)

The sample code.

The previous code identifies and extracts various elements from the PDF, which can be classified into CompositeElements (the textual content) and Tables. We use the Pydantic package to create a new data structure that stores information about each element, including their type and text. The code below iterates through all extracted elements, keeping them in a list where each item is an instance of the Element type.

from pydantic import BaseModel
from typing import Any

# Define data structure
class Element(BaseModel):
    type: str
    text: Any

# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

The sample code.

Creating the Element data structure enables convenient storage of the additional information, which can be beneficial for identifying the source of each answer, whether it is derived from texts, tables, or figures.

2. Graphs

The next step is gathering information from the charts to add context. The primary challenge is extracting images from the pages to feed into OpenAI's endpoint. A practical approach is to convert the PDF to images and pass each page to the model, inquiring if it detects any graphs. If it identifies one or more charts, the model can describe the data and the trends they represent. If no graphs are detected, the model will return an empty array as an indication.

💡

A drawback of this approach is that it increases the number of requests to the model, consequently leading to higher costs. The issue is that each page must be processed, regardless of whether it contains graphs, which is not an efficient approach. It is possible to reduce the cost by manually flagging the pages.

The initial step involves installing the pdf2image package to convert the PDF into images. This also requires the poppler tool, which we have already installed.

!pip install -q pdf2image==1.16.3

The commands to install required packages.

The code below uses the convert_from_path function, which takes the path of a PDF file. We can iterate over each page and save it as a PNG file using the .save() method. These images will be saved in the ./pages directory. Additionally, we define the pages_png variable that holds the path of each image.

import os
from pdf2image import convert_from_path


os.mkdir("./pages")
convertor = convert_from_path('./TSLA-Q3-2023-Update-3.pdf')

for idx, image in enumerate( convertor ):
    image.save(f"./pages/page-{idx}.png")

pages_png = [file for file in os.listdir("./pages") if file.endswith('.png')]

The sample code.

Defining a few helper functions and variables is necessary before sending the image files to the OpenAI API. The headers variable will contain the OpenAI API Key, enabling the server to authenticate our requests. The payload carries configurations such as the model name, the maximum token limit, and the prompts. It instructs the model to describe the graphs and generate responses in JSON format, addressing scenarios like encountering multiple graphs on a single page or finding no graphs at all. We will add the images to the payload before sending the requests. Finally, there is the encode_image() function, which encodes the images in base64 format, allowing them to be processed by OpenAI.

headers = {
  "Content-Type": "application/json",
  "Authorization": "Bearer " + str( os.environ["OPENAI_API_KEY"] )
}

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "You are an assistant that find charts, graphs, or diagrams from an image and summarize their information. There could be multiple diagrams in one image, so explain each one of them separately. ignore tables."
        },
        {
          "type": "text",
          "text": 'The response must be a JSON in following format {"graphs": [<chart_1>, <chart_2>, <chart_3>]} where <chart_1>, <chart_2>, and <chart_3> placeholders that describe each graph found in the image. Do not append or add anything other than the JSON format response.'
        },
        {
          "type": "text",
          "text": 'If could not find a graph in the image, return an empty list JSON as follows: {"graphs": []}. Do not append or add anything other than the JSON format response. Dont use coding "```" marks or the word json.'
        },
        {
          "type": "text",
          "text": "Look at the attached image and describe all the graphs inside it in JSON format. ignore tables and be concise."
        }
      ]
    }
  ],
  "max_tokens": 1000
}

# Function to encode the image to base64 format
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

The sample code.

The remaining steps include: 1) utilizing the pages_png variable to loop through the images, 2) encoding the image into base64 format, 3) adding the image into the payload, and finally, 4) sending the request to OpenAI and handling its responses. We will use the same Element data structure to store each image's type (graph) and the text (descriptions of the graphs).

graphs_description = []
for idx, page in tqdm( enumerate( pages_png ) ):
  # Getting the base64 string
  base64_image = encode_image(f"./pages/{page}")

  # Adjust Payload
  tmp_payload = copy.deepcopy(payload)
  tmp_payload['messages'][0]['content'].append({
    "type": "image_url",
    "image_url": {
      "url": f "data:image/png;base64,{base64_image}"
    }
  })

  try:
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=tmp_payload)
    response = response.json()
    graph_data = json.loads( response['choices'][0]['message']['content'] )['graphs']

    desc = [f"{page}\n" + '\n'.join(f"{key}: {item[key]}" for key in item.keys()) for item in graph_data]

    graphs_description.extend( desc )

  except:
    # Skip the page if there is an error.
    print("skipping... error in decoding.")
    continue;

graphs_description = [Element(type="graph", text=str(item)) for item in graphs_description]

The sample code.

Store on Deep Lake

This section will utilize the Deep Lake vector database to store the collected information and their embeddings. These embedding vectors convert pieces of text into numerical representations that capture their meaning, enabling similarity metrics such as cosine similarity to identify documents with close relationships. For instance, a prompt inquiring about a company's total revenue would result in high cosine similarity with a database document stating the revenue amount as X dollars.

The data preparation is complete with the extraction of all crucial information from the PDF. The next step involves combining the output from the previous sections, resulting in a list containing 41 entries.

all_docs = categorized_elements + graphs_description

print( len( all_docs ) )

The sample code.

The output.

Given that we are using LlamaIndex, we can use its integration with Deep Lake to create and store the dataset. Begin by installing LlamaIndex and deeplake packages along with their dependencies.

!pip install -q llama_index==0.9.8 deeplake==3.8.8 cohere==4.37

The commands to install required packages.

Before using the libraries, it's essential to configure the OPENAI_API_KEY and ACTIVELOOP_TOKEN variables in the environment. Remember to substitute the placeholder values with your actual keys from the respective platforms.

import os

os.environ["OPENAI_API_KEY"] = "<Your_OpenAI_Key>"
os.environ["ACTIVELOOP_TOKEN"] = "<Your_Activeloop_Key>"

The sample code.

The integration of LlamaIndex enables the use of DeepLakeVectorStore class, which is designed to create a new dataset. Simply enter your organization ID, which by default is your Activeloop username, in the code provided below. This code will generate an empty dataset, ready to store documents.

from llama_index.vector_stores import DeepLakeVectorStore

# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "tsla_q3"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

vector_store = DeepLakeVectorStore( dataset_path=dataset_path,
																		overwrite=False)

The sample code.

Your Deep Lake dataset has been successfully created!

The output.

Next, we must pass the created vector store to a StorageContext class. This class serves as a wrapper to create storage from various data types. In our case, we're generating the storage from a vector database, which is accomplished simply by passing the created database instance using the .from_defaults() method.

from llama_index.storage.storage_context import StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)

The sample code.

To store our preprocessed data, we must transform them into LlamaIndex Documents for compatibility with the library. The LlamaIndex Document is an abstract class that acts as a wrapper for various data types, including text files, PDFs, and database outputs. This wrapper facilitates the storage of valuable information with each sample. In our case, we can include a metadata tag to hold extra details like the data type (text, table, or graph) or denote document relationships. This approach simplifies the retrieval of these details later.

As shown in the code below, you can employ built-in classes like SimpleDirectoryReader to automatically read files from a specified path or proceed manually. It will loop through our list containing all the extracted information and assign text and a category to each document.

from llama_index import Document

documents = [Document(text=t.text, metadata={"category": t.type},) for t in categorized_elements]

The sample code.

Lastly, we can utilize the VectorStoreIndex class to generate embeddings for the documents and employ the database instance to store these values. By default, it uses OpenAI's Ada model to create the embeddings.

from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

The sample code.

Uploading data to deeplake dataset.
100%|██████████| 29/29 [00:00<00:00, 46.26it/s]
\Dataset(path='hub://alafalaki/tsla_q3-nograph', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (29, 1)      str     None   
 metadata     json      (29, 1)      str     None   
 embedding  embedding  (29, 1536)  float32   None   
    id        text      (29, 1)      str     None

The output.

💡

The dataset has already been created and is hosted under the GenAI360 organization on the Activeloop hub. If you prefer not to use OpenAI APIs for generating embeddings, you can test the remaining codes using these publicly accessible datasets. Just substitute the dataset_path variable with the following: hub://genai360/tsla_q3.

Chatbot In Action

In this section, we will use the created dataset as the retrieval object, providing the necessary context for the GPT-3.5-turbo model (the default choice for LlamaIndex) to answer the questions.

The DeepLakeVectorStore class also handles loading a dataset from the hub. The key distinction in the code below, compared to the previous sections, lies in the use of the .from_vector_store() method. This method creates indexes directly from the database rather than variables.

from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index import VectorStoreIndex

vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

The sample code.

We can now use the .as_query_engine() method of the index variables to establish a query engine. This will allow us to ask questions from various data sources. The .query() method takes a prompt and searches for the most relevant data points within the database to construct an answer.

query_engine = index.as_query_engine()
response = query_engine.query(
    "What are the trends in vehicle deliveries?",
)

The sample code.

The trends in vehicle deliveries show an increasing trend over the reported quarters.

The output.

Screenshot referenced graph.

As observed, the chatbot effectively utilized the data from the descriptions of the graphs we generated in the report. On the right, there's a screenshot of the bar chart which the chatbot referenced to generate its response.

Additionally, we conducted an experiment where we compiled the same dataset but excluded the graph descriptions. This dataset can be accessed via hub://genai360/tsla_q3-nograph path. The purpose was to determine whether including the descriptions aids the chatbot's performance.

The trends in vehicle deliveries show a sequential decline in production volumes during the quarter, with September accounting for a smaller percentage of deliveries compared to the previous year. However, overall deliveries have been increasing year-over-year, with Model 3/Y deliveries showing a significant increase. Additionally, the total end of quarter operating lease vehicle count has been increasing steadily.

The output of the chatbot without the graph data.

You'll observe that the chatbot points to incorrect text segments. Despite the answer being contextually similar, it doesn't provide the correct answer. The graph shows an upward trend, a detail that might not have been mentioned in the report's text.

Conclusion

In this lesson, we explored the steps of developing a chatbot capable of utilizing PDF files as a knowledge base to answer questions. Additionally, we employed the vision capability of GPT-4V to identify and describe graphs from each page. Describing the charts and their illustrated trends improves the chatbot's accuracy in answering and providing additional context.

>> Notebook.

>> Preprocessed Text/Label:

categorized_elements.pkl34.1KB

>> Preprocessed Graphs:

graphs_description.pkl5.6KB

Multimodal Financial Document Analysis and Recall; Tesla Investor Presentations

Extracting Data

1. Text/Tables

2. Graphs

Store on Deep Lake

Chatbot In Action

Conclusion