This notebook implements the PaperQA2 Workflow, designed to assist researchers through three primary phases:

Paper Search: Locate relevant scientific articles.
Gather Evidence: Rank document chunks to determine their relevance to the user's query.
Generate Final Answers: Use the best-ranked evidence to create a comprehensive response.

This workflow draws inspiration from the modular architecture of PaperQA, which aims to reduce hallucinations and improve interpretability by grounding responses in retrieved scientific literature. The PaperQA approach enhances information retrieval and synthesis, offering researchers a systematic way to navigate and process scientific knowledge.

Source: PaperQA: Retrieval-Augmented Generative Agent for Scientific Research

PaperQA2 - Github Repository

Notebook PaperQA Workflow

Search

The workflow begins with generating user queries. The generate_search_queries_phase() function uses a Large Language Model (LLM) to decompose the user query into multiple focused sub-queries. This mirrors the iterative and dynamic capabilities of PaperQA, which adapts the retrieval strategy based on the requirements of the query.
PaperQA’s first tool—search—aims to retrieve relevant papers using keywords and queries generated from the initial question. This step ensures the research is comprehensive, similar to the way our workflow uses generated queries to broaden the search scope effectively.

Vector Search: Calculates cosine similarity with embedding_string and returns the top 5 matches by score.
BM25 Search: Ranks text by BM25 similarity to query, also limited to the top 5.

Additionally, the semantic_chunk() function implements concurrent searches, akin to how PaperQA uses vector embedding to search multiple avenues, enhancing the breadth of retrieval and mimicking an agent exploring multiple knowledge sources at once.

Evidence Gathering

Once the search results are retrieved, the workflow moves into the evidence-gathering phase. The semantic_chunk() function continues by breaking down the articles into smaller, meaningful segments. PaperQA uses maximal marginal relevance to enhance diversity among returned documents, minimizing redundancy and improving the quality of retrieved evidence
The gather_evidence_phase() then uses ColBERT to re-rank these segments for relevance, similar to PaperQA’s gather evidence tool, which integrates retrieval augmentation to ensure each chunk is scored based on its importance. This process prevents irrelevant context from interfering, ensuring a focus on the most pertinent data

Generate Final Answer

In the final phase, the workflow generates an answer by utilizing the top-ranked evidence chunks. These are combined into a single context, which is then passed to an LLM via the generate_final_answer() function.
The LLM synthesizes a response, drawing from the evidence provided. References to the original articles are maintained, ensuring that the answer has clear provenance, as emphasized in PaperQA's approach to minimize hallucination and ensure verifiable answers.
Synthesizing and citing relevant evidence offers a high level of reliability, much like PaperQA’s goal to present answers comparable to human experts by ensuring that every claim is supported by a source.

PaperQA has been designed to use retrieved evidence to construct a final answer, following a map-reduce approach to synthesize information from multiple sources. This mirrors our workflow's approach to combining evidence before generating the final response, ensuring a comprehensive overview that is both thorough and trustworthy

Instructions for Use

Set Up: Ensure you have set the OpenAI API key to enable the notebook to make requests to OpenAI.
Run Cells Sequentially: Follow the notebook by running cells in order, starting with the environment setup and imports.
Enter Your Query: At the prompt cell, enter the query about your research topic (e.g., "Impacts of gene editing on medicine").
View Results: Examine the outputs at each stage, which include the search queries, evidence ranking, and final answer.

For each section below, you will find detailed explanations to help understand how each phase contributes to the overall goal of answering a research query.

Environment Setup

Install all the necessary dependencies and imports the required libraries.

You need to provide your OpenAI API key in order for the notebook to generate search queries and answers. This setup allows you to leverage Deep Lake for dataset querying, OpenAI for question generation, and various other tools for processing text data.

Ensure that the API key is correctly configured and that all installations complete successfully.

!pip install --quiet deeplake
!pip install --quiet llama-index llama-index-core transformers torch llama-index-embeddings-openai llama-index-llms-openai llama-index-postprocessor-colbert-rerank spacy openai langchain numpy pydantic

import os,getpass
import json
import openai
import deeplake
import spacy
import llama_index
import langchain
import textwrap

from IPython.display import display, HTML
from langchain.docstore.document import Document
from deeplake import types
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.colbert_rerank import ColbertRerank
from openai import OpenAI

# Set up OpenAI API key and client
OPENAI_API_KEY = getpass.getpass('Enter your OpenAI API key (platform.openai.com): ')

# Validate that the API key exists
if OPENAI_API_KEY is None:
    raise ValueError("OpenAI API Key not found. Please make sure it's set in your Colab environment.")

# Set environment variable and initialize OpenAI client
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

ACTIVELOOP_TOKEN = getpass.getpass('Enter your Activeloop Token (activeloop.ai): ')
os.environ['ACTIVELOOP_TOKEN'] = ACTIVELOOP_TOKEN

# Initialize the OpenAI client
client = OpenAI()

Dataset Loading

In this section, we load the necessary resources to perform the core operations in the PaperQA2 workflow. These resources include:

The Deep Lake dataset is loaded using the deeplake.open_read_only() function. The dataset we're using is a collection of scientific papers, located at hub://demo_v4/scientific_papers which we have preloaded to our own dataset in preparation. This dataset will serve as the source of information for generating responses to user queries.

org_id = "genai360"
dataset_name = "scientific_papers_paperqa_with_embeddings"
ds = deeplake.open_read_only(f"al://{org_id}/{dataset_name}")
ds.summary()

Explore The Dataset with Hybrid Search

The PaperQA Workflow in this notebook utilises both vector search and BM25 search queries. ColBERT is used to re-rank the results after semantic chunks have been created.

As an additional to your workflow, you could utilise the hybrid search method outlined in the section below to enhance the search phase.

In the stage, the system enhances its search capabilities by combining BM25 with Approximate Nearest Neighbors (ANN) for a hybrid search. This approach blends lexical search with semantic search, improving relevance by considering both keywords and semantic meaning.

We open the scientific_papers_paperqa_with_embeddings dataset to perform a hybrid search. First, we define a query "what do you know about drones?" and generate its embedding using embedding_function(query)[0]. We then convert this embedding into a comma-separated string embedding_string, preparing it for use in combined text and vector-based searches.

def embedding_function(texts, model="text-embedding-3-large"):

    if isinstance(texts, str):
        texts = [texts]

    texts = [t.replace("\n", " ") for t in texts]
    return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]

Search for relevant papers using a specific sentence

We create two queries:

Vector Search (tql_vs): Calculates cosine similarity with embedding_string and returns the top 5 matches by score.
BM25 Search (tql_bm25): Ranks text by BM25 similarity to query, also limited to the top 5.

natural_query = "what do you know about drones?"
query = "ARGoS"

embed_query = embedding_function(natural_query)[0]
embedding_string = ",".join(str(c) for c in embed_query)

We then execute both queries, storing vector results in vs_results and BM25 results in bm25_results. This allows us to compare results from both search methods.

tql_vs = f"""
    SELECT *
    FROM (
        SELECT *, cosine_similarity(embedding, ARRAY[{embedding_string}]) AS score
        FROM (
            SELECT *, ROW_NUMBER() AS row_id
        )

    )
    ORDER BY score DESC
    LIMIT 5
"""

tql_bm25 = f"""
    SELECT *, BM25_SIMILARITY(text, '{query}') AS score
    FROM (
        SELECT *, ROW_NUMBER() AS row_id
    )
    ORDER BY BM25_SIMILARITY(text, '{query}') DESC
    LIMIT 5
"""

vs_results = ds.query(tql_vs)
bm25_results = ds.query(tql_bm25)

Show the scores

for el_vs in vs_results:
    print(f"vector search score: {el_vs['score']}")

for el_bm25 in bm25_results:
    print(f"bm25 score: {el_bm25['score']}")

vector search score: 0.2837725281715393 vector search score: 0.24017328023910522 vector search score: 0.22901776432991028 vector search score: 0.22634504735469818 vector search score: 0.2239888608455658 bm25 score: 19.381977081298828 bm25 score: 19.296512603759766 bm25 score: 18.8994140625 bm25 score: 18.501781463623047 bm25 score: 18.068836212158203

First, we import the required libraries and define a Document class, where each document has an id, a data dictionary, and an optional score for ranking.

Setup and Classes: We import necessary libraries and define a Document class using pydantic.BaseModel. Each Document has an id, a data dictionary, and an optional score for ranking.
Softmax Function: The softmax function normalizes a list of scores (retrieved_score) using the softmax formula. Scores are exponentiated, limited by max_weight, and then normalized to sum up to 1. This returns new_weights, a list of normalized scores.

import math
import numpy as np
from typing import Any, Dict, List, Optional
from pydantic import BaseModel

class Document(BaseModel):
    id: str
    data: Dict[str, Any]
    score: Optional[float] = None

def softmax(retrieved_score: list[float], max_weight: int = 700) -> Dict[str, Document]:
    # Compute the exponentials
    exp_scores = [math.exp(min(score, max_weight)) for score in retrieved_score]

    # Compute the sum of the exponentials
    sum_exp_scores = sum(exp_scores)

    # Update the scores of the documents using softmax
    new_weights = []
    for score in exp_scores:
        new_weights.append(score / sum_exp_scores)

    return new_weights

Normalize the score

Apply Softmax to Scores:

We extract score values from vs_results and bm25_results and apply softmax to them, storing the results in vss and bm25s. This step scales both sets of scores for easy comparison.

Create Document Dictionaries:

We create dictionaries docs_vs and docs_bm25 to store documents from vs_results and bm25_results, respectively. For each result, we add the title and text along with the normalized score. Each document is identified by row_id.

This code standardizes scores and organizes results, allowing comparison across both vector and BM25 search methods.

vs_score = vs_results["score"]
bm_score = bm25_results["score"]

vss = softmax(vs_score)
bm25s = softmax(bm_score)
print(vss)
print(bm25s)

[0.2087589635344044, 0.1998527916854391, 0.1976357199697022, 0.19710820089464823, 0.1966443239158061] [0.31065925447580744, 0.28521183623674395, 0.19173872690847832, 0.12883094603697134, 0.08355923634199894]

docs_vs = {}
docs_bm25 = {}
for el, score in zip(vs_results, vss):
    docs_vs[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"text": el["text"], "title": el["title"]}, score=score)

for el, score in zip(bm25_results, bm25s):
    docs_bm25[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"text": el["text"], "title": el["title"]}, score=score)

We define weights for our hybrid search: VECTOR_WEIGHT and LEXICAL_WEIGHT are both set to 0.5, giving equal importance to vector-based and BM25 scores.

Initialize Results Dictionary:

We create an empty dictionary, results, to store documents with their combined scores from both search methods.

Combine Scores:

We iterate over the unique document IDs from docs_vs and docs_bm25.
For each document:

We add it to results, defaulting to the version available (vector or BM25).
We calculate a weighted score: vs_score from vector results (if present in docs_vs) and bm_score from BM25 results (if present in docs_bm25).
The final results[k].score is set by adding vs_score and bm_score.

This produces a fused score for each document in results, ready to rank in the hybrid search.

Fusion method

def fusion(docs_vs: Dict[str, Document], docs_bm25: Dict[str, Document]) -> Dict[str, Document]:
    VECTOR_WEIGHT = 0.5
    LEXICAL_WEIGHT = 0.5

    results: Dict[str, Dict[str, Document]] = {}


    for k in set(docs_vs) | set(docs_bm25):
        results[k] = docs_vs.get(k, None) or docs_bm25.get(k, None)
        vs_score = VECTOR_WEIGHT * docs_vs[k].score if k in docs_vs else 0
        bm_score = LEXICAL_WEIGHT * docs_bm25[k].score if k in docs_bm25 else 0
        results[k].score = vs_score + bm_score

    return results

results = fusion(docs_vs, docs_bm25)

for k, v in results.items():
    print(f"text: {v.data['text']}, score: {v.score}")

text: The software would actually be able to recognize hedge- hogs from…

text: Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhi- lasha Ravichander…

text: (a) UNSW-NB15 layer speed up compared to accuracy loss (b) UNSW-NB15…

text: 8 Symbol Description ip0 Current race control flag ip1 The active path…

text: The Effect of Predictive Formal Modelling at Runtime on Performance…

…

Initializing the ColBERT Reranker

ColBERT allows us to perform a contextual ranking of document chunks, ensuring that the retrieved results are both relevant and contextually appropriate. By reranking using an effective scoring model like ColBERT, we improve the quality of the evidence used to generate final answers.

ColBERT (Contextualized Late Interaction over BERT) is used for reranking document chunks.
The initialize_colbert_reranker() function sets up the reranker using the ColBERTv2.0 model.

top_n=5: Specifies that we want to rerank the top 5 results.
model and tokenizer: Specifies the pre-trained ColBERT model to use.
keep_retrieval_score=True: Keeps the original retrieval score, which allows us to compare how well the model performs in retrieval versus reranking.

LlamaIndex Colbert Rerank

# Initialize the ColBERT reranker
def initialize_colbert_reranker():
    return ColbertRerank(
        top_n=5,
        model="colbert-ir/colbertv2.0",
        tokenizer="colbert-ir/colbertv2.0",
        keep_retrieval_score=True,
    )

# Create a global instance of the reranker to reuse
colbert_reranker = initialize_colbert_reranker()

Loading the spaCy Model for NLP Tasks

The spaCy model is used during the chunking phase, where large documents are broken into smaller, more manageable chunks. The chunks are later ranked and used for generating answers, making the NLP model crucial for ensuring that the segmentation of documents is done in a meaningful way.

# Load the spaCy model for semantic chunking
nlp = spacy.load('en_core_web_sm')

Verbose Helper Functions

These verbose helper functions help to inspect the output from the workflow in a readable format.

display(HTML('''
    <script>
      // Select all output areas in the current document
      const observer = new MutationObserver(function(mutations) {
        for (let mutation of mutations) {
          if (mutation.target.nodeName === 'DIV' && mutation.target.className.includes('output-subarea')) {
            let outputDiv = mutation.target;
            outputDiv.parentNode.style.maxHeight = "none";  // Remove any max-height set by default
            outputDiv.parentNode.style.height = "auto";  // Set to auto to accommodate full content height
          }
        }
      });

      // Observe changes in the entire document for added output nodes
      observer.observe(document.documentElement, {
        attributes: false,
        childList: true,
        subtree: true,
      });
    </script>
'''))

# Inject global CSS styles for both light and dark modes
display(HTML('''
    <style>
        /* Final Answer Styles */
        .final-answer-box {
            border: 2px solid #4CAF50;
            padding: 15px;
            margin: 20px;
            border-radius: 10px;
            background-color: var(--final-answer-bg, #f9f9f9);
            color: var(--final-answer-text, #333);
            font-family: Arial, sans-serif;
        }
        .final-answer-title {
            color: #4CAF50;
        }
        .final-answer-header {
            color: #333;
        }
        .final-answer-list {
            list-style-type: disc;
            padding-left: 20px;
        }
        .final-answer-reference {
            margin-bottom: 10px;
        }

        /* Verbose Message Styles */
        .verbose-box {
            border: 1px solid #ddd;
            padding: 15px;
            margin: 10px 0;
            border-radius: 8px;
            background-color: var(--verbose-bg, #f9f9f9);
            color: var(--verbose-text, #333);
            font-family: Arial, sans-serif;
            font-size: 14px;
        }
        .verbose-info { background-color: #f9f9f9; }
        .verbose-success { background-color: #d4edda; }
        .verbose-warning { background-color: #fff3cd; }
        .verbose-error { background-color: #f8d7da; }
        .verbose-progress { background-color: #d1ecf1; }

        /* Phase Header Styles */
        .phase-header-box {
            border: 2px solid #333;
            padding: 20px;
            margin: 20px 0;
            border-radius: 10px;
            background-color: var(--phase-header-bg, #f1f1f1);
            color: var(--phase-header-text, #333);
            font-family: Arial, sans-serif;
        }
        .phase-header-title {
            font-weight: bold;
            margin-bottom: 10px;
        }
        .phase-header-list {
            list-style-type: disc;
            padding-left: 20px;
            font-size: 14px;
        }

        /* Light Mode Variables */
        @media (prefers-color-scheme: light) {
            :root {
                --final-answer-bg: #f9f9f9;
                --final-answer-text: #333;
                --verbose-bg: #f9f9f9;
                --verbose-text: #333;
                --phase-header-bg: #f1f1f1;
                --phase-header-text: #333;
            }
        }

        /* Dark Mode Variables */
        @media (prefers-color-scheme: dark) {
            :root {
                --final-answer-bg: #2e2e2e;
                --final-answer-text: #f9f9f9;
                --verbose-bg: #333333;
                --verbose-text: #f9f9f9;
                --phase-header-bg: #1e1e1e;
                --phase-header-text: #f9f9f9;
            }
            /* Adjust border colors for better contrast in dark mode */
            .final-answer-box {
                border: 2px solid #4CAF50;
            }
            .phase-header-box {
                border: 2px solid #555;
            }
        }
    </style>
'''))

def display_final_answer(response_json):
    """
    Displays the final answer, evidence, and references in a formatted HTML box.
    Args:
    - response_json (str): The JSON string containing the final answer, evidence, and references.
    """
    response = json.loads(response_json)

    # Extract data
    final_answer = response.get("final_answer", "No answer found.")
    evidence = response.get("evidence", [])
    pdf_references = response.get("pdf_references", [])

    # Format the output
    html_content = f"""
    <div class='final-answer-box'>
        <h2 class='final-answer-title'>Final Answer</h2>
        <p>{final_answer}</p>

        <h3 class='final-answer-header'>Evidence</h3>
        <ul class='final-answer-list'>
    """

    for ev in evidence:
        html_content += f"<li>{ev}</li>"

    html_content += "</ul>"

    html_content += """
        <h3 class='final-answer-header'>References</h3>
        <ul class='final-answer-list'>
    """

    for ref in pdf_references:
        file_name = ref.get("file_name", "Unknown file")
        citation = ref.get("citation", "No citation provided")
        content = ref.get("content", "No content available")
        html_content += f"""
            <li class='final-answer-reference'>
                <strong>File Name:</strong> {file_name}<br>
                <strong>Citation:</strong> {citation}<br>
                <p>{content}</p>
            </li>
        """

    html_content += "</ul></div>"

    # Display using IPython display
    display(HTML(html_content))

def render_verbose(step, message, level="info", color=None):
    """
    A unified HTML rendering function to display different verbose items in a user-friendly manner.

    Args:
    - step (str): Title or main action (e.g., 'Search Execution', 'Final Answer Generation').
    - message (str): Detailed message to display.
    - level (str): Severity level ('info', 'success', 'warning', 'error', 'progress').
    - color (str): Optional custom color.
    """
    icon = {
        "info": "ℹ️",
        "success": "✅",
        "warning": "⚠️",
        "error": "❌",
        "progress": "⏳"
    }.get(level, "ℹ️")

    # Determine the appropriate CSS class based on the level
    css_class = f"verbose-{level}" if level in ["info", "success", "warning", "error", "progress"] else "verbose-info"

    html_content = f"""
    <div class='verbose-box {css_class}'>
        <h4><span>{icon}</span> <strong>{step}</strong></h4>
        <p>{message}</p>
    </div>
    """
    display(HTML(html_content))

def render_phase_header(phase_name, phase_description):
    """
    Render a header for each phase of the PaperQA2 workflow.

    Args:
    - phase_name (str): The name of the phase (e.g., 'Phase 1: Paper Search').
    - phase_description (str): Description of what happens in the current phase,
      separated by newlines for each action.
    """
    # Split the phase description into individual bullet points
    bullet_points = phase_description.split("\n")
    bullet_points_html = "".join(f"<li>{point.strip()}</li>" for point in bullet_points if point.strip())

    html_content = f"""
    <div class='phase-header-box'>
        <h2 class='phase-header-title'>{phase_name}</h2>
        <ul class='phase-header-list'>{bullet_points_html}</ul>
    </div>
    """
    display(HTML(html_content))

PaperQA Workflow Functions

keyboard_arrow_down Phase 1, Step 1 - Paper Search

Function: generate_search_queries_phase(user_query)

This function is the starting point of the workflow, responsible for generating search queries based on the user's input.

Objective: Transform the user's input query into well-structured search queries.

Technical Details:

Utilizes a Large Language Model (LLM) to generate multiple sub-queries. This decomposition ensures that the scope of research is comprehensive, with both narrow and broad focus.
The LLM constructs search terms in Tensor Query Language (TQL) format to search Deep Lake's corpus effectively.
Generates a JSON response containing multiple search queries and their corresponding TQL syntax, which is optimized for relevance using BM25 ranking.

This approach helps extract the most relevant documents, ensuring a strong foundation for subsequent phases by covering various aspects of the user query comprehensively.

Deeplake TQL Syntax

System Message

The Search System Prompt is foundational to the Search Phase of the Notebook PaperQA Workflow.

By instructing the LLM to decompose the user's query into multiple focused sub-queries using extracted keywords, it ensures that both BM25 and Vector Search mechanisms retrieve comprehensive and relevant papers.

This decomposition mirrors PaperQA's capability to adapt retrieval strategies dynamically, allowing the system to cover various aspects of the user's intent effectively. Additionally, the strict formatting into JSON with properly structured TQL queries facilitates seamless integration with the subsequent search operations, enhancing the overall efficiency and accuracy of evidence retrieval.

system_prompt_search = textwrap.dedent("""
    You are an assistant that generates search queries for a scientific papers database based on relevant keywords extracted from the user's query.

    Your task is to analyze the user's input query and determine if multiple search queries are necessary to gather comprehensive evidence. This may involve creating both narrow and broad searches, or using different phrasings to capture all relevant aspects of the user's intent.

    Extract relevant keywords from the user's input query before creating search queries. **Only use keywords** for both the search term and TQL output.

    Provide the output in **valid JSON format only**. The JSON should include:
    - **'queries'**: A list of objects.
      - Each object should contain:
        - **'query'**: The search term composed of strictly the extracted keywords.
        - **'tql'**: The corresponding TQL (Tensor Query Language) query formatted for Deep Lake.

    Ensure the TQL query is correctly formatted for Deep Lake's Tensor Query Language:
    - Always set **LIMIT to 5**.
    - Use **ORDER BY BM25_SIMILARITY(text, 'search terms')** to rank results based on relevance.
    - Note that the database only contains **text fields** to search.

    # Steps

    1. Analyze the user's query to understand its scope and identify if multiple aspects or angles need to be addressed.
    2. Extract the most relevant keywords from the user's query.
    3. If multiple queries are needed, formulate each search term to cover different aspects (e.g., narrow and broad searches, different phrasings).
    4. Generate a corresponding TQL query for Deep Lake based on each search term, using the keywords only and incorporating the BM25 similarity ranking.

    # Output Format

    Provide the response in the following JSON structure:
    ```json
    {
      "queries": [
        {
          "query": "[Search term 1 derived from keywords]",
          "tql": "SELECT *, BM25_SIMILARITY(text, '[Search term 1]') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, '[Search term 1]') DESC LIMIT 5"
        },
        {
          "query": "[Search term 2 derived from keywords]",
          "tql": "SELECT *, BM25_SIMILARITY(text, '[Search term 2]') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, '[Search term 2]') DESC LIMIT 5"
        },
        ...
      ]
    }
    ```

    # Examples

    **Valid Example with Single Query**:
    ```json
    {
      "queries": [
        {
          "query": "CRISPR gene editing",
          "tql": "SELECT *, BM25_SIMILARITY(text, 'CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'CRISPR gene editing') DESC LIMIT 5"
        }
      ]
    }
    ```

    **Valid Example with Multiple Queries**:
    ```json
    {
      "queries": [
        {
          "query": "CRISPR gene editing",
          "tql": "SELECT *, BM25_SIMILARITY(text, 'CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'CRISPR gene editing') DESC LIMIT 5"
        },
        {
          "query": "Applications of CRISPR",
          "tql": "SELECT *, BM25_SIMILARITY(text, 'Applications of CRISPR') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Applications of CRISPR') DESC LIMIT 5"
        },
        {
          "query": "Ethical implications of gene editing",
          "tql": "SELECT *, BM25_SIMILARITY(text, 'Ethical implications of gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Ethical implications of gene editing') DESC LIMIT 5"
        }
      ]
    }
    ```

    # Notes
    - Focus on identifying and extracting concise and relevant keywords from the user's query.
    - Avoid unnecessary phrases such as "application of", "for and against", etc., unless they contribute to a distinct search aspect.
    - When the user's query encompasses multiple facets or perspectives, generate separate search queries for each aspect to ensure comprehensive coverage.
    - Ensure extracted keywords are used clearly and avoid redundant or overly specific terms. Always use keywords strictly for query generation.
""")

def generate_search_queries_phase(user_query, system_prompt_search=system_prompt_search):
    """
    Phase 1: Paper Search
    Step 1: Generate Search Queries from User Query.

    Args:
    - user_query (str): The user query to generate search queries and retrieve papers.

    Returns:
    - JSON response containing search queries and aggregated chunks.
    """
    # Verbose
    render_phase_header("Phase 1: Paper Search", "Get candidate papers from LLM-generated keyword query\nChunk, embed, and add candidate papers to state")

    # Generate Search Queries
    # Verbose
    render_verbose("Search Iteration", f"Getting LLM-generated TQL Syntax for user query: <strong>{user_query}</strong>", level="progress")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": system_prompt_search
            },
            {
                "role": "user",
                "content": user_query
            }
        ],
        response_format={
            "type": "json_object"
        },
        temperature=0.1,
        max_tokens=2048,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    # Verbose
    render_verbose("Search Queries Generated", f"Generated search queries for: {user_query}", level="success")

    queries = response.choices[0].message.content

    # Parse JSON Response
    try:
        queries_json = json.loads(queries)
        # Verbose
        render_verbose("JSON Parsing", "Successfully parsed JSON search queries.", level="success")
        render_verbose("Search Queries", json.dumps(queries_json, indent=4), level="sucess")
    except json.JSONDecodeError as e:
      # Verbose
        render_verbose("Error", f"Failed to parse JSON: {str(e)}", level="error")
        raise ValueError(f"Failed to decode JSON from search queries: {str(e)}")
    return queries_json

Phase 1, Step 2 - Search and Chunk Papers

Function: search_and_chunk_papers_phase(queries_json)

Following the generation of queries, this function executes the searches against the relevant databases.

Objective: Retrieve the relevant papers and split them into meaningful content chunks.

Technical Details:

Performs hybrid searches on the scientific database.
The retrieved documents are segmented into smaller, coherent chunks for easier processing.
Each chunk is embedded into a vector space to facilitate similarity-based retrieval in subsequent steps.

This chunking and embedding process ensures efficient handling of documents, allowing relevant sections to be isolated for further analysis.

# Helper functions for semantic chunking and embedding user query

def semantic_chunk(text, max_len=200):
    """
    Chunk the given text into smaller pieces based on maximum length.

    Args:
    - text (str): The input text to be chunked.
    - max_len (int): Maximum length of each chunk (default is 200).

    Returns:
    - List of text chunks
    """
    doc = nlp(text)
    chunks = []
    current_chunk = []

    for sent in doc.sents:
        current_chunk.append(sent.text)
        if len(' '.join(current_chunk)) > max_len:
            chunks.append(' '.join(current_chunk))
            current_chunk = []

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

def search_and_chunk_papers_phase(queries_json):
    """
    Phase 1: Paper Search
    Step 2: Search Papers using BM25 and Embedding-based similarity, Chunk, and Embed into State.

    Args:
    - queries_json (dict): JSON response containing search queries generated from the user's query.

    Returns:
    - aggregated_chunks (list): A list of (title, chunk) tuples representing chunks of text extracted from retrieved documents.
    """

    tql_queries_list = queries_json.get('queries')
    if not isinstance(tql_queries_list, list):
        # Verbose
        render_verbose("Error", "Invalid query format detected. Expected a list of queries in the JSON response.", level="error")
        raise ValueError("Invalid query format: Expected a list of queries in the JSON response.")

    aggregated_chunks = []

    # Iterate over each query in the queries_json
    for idx, query_obj in enumerate(tql_queries_list, 1):
        # Extract BM25-based TQL query
        tql_query = query_obj.get('tql')
        natural_query = query_obj.get('query')

        # Initialize sets to store unique entries from both search methods
        unique_entries_bm25 = set()
        unique_entries_embed = set()

        # ---- BM25-Based Search ----
        if tql_query:
            # Verbose
            render_verbose("Search Execution", f"Executing BM25 query {idx}/{len(tql_queries_list)}: <code>{tql_query}</code>", level="progress")
            try:
                view_bm25 = ds.query(tql_query)
                # Verbose
                render_verbose("Samples Retrieved (BM25)", f"Number of samples retrieved: <strong>{len(view_bm25)}</strong>", level="success")
            except TypeError as e:
                # Verbose
                render_verbose("Error", f"Error in querying dataset with BM25: {str(e)}", level="error")
                view_bm25 = []

            for sample in view_bm25:
                title = sample["title"]
                text = sample["text"]
                unique_entries_bm25.add((title, text))

        # ---- Embedding-Based Search ----
        if natural_query:
            # Generate embedding for the natural language query
            embed_query = embedding_function(natural_query)[0]
            str_query = ",".join(str(c) for c in embed_query)

            # Construct the similarity search query
            query_vs = f"""
                SELECT *
                FROM (
                    SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score
                FROM (
                    SELECT *, ROW_NUMBER() AS row_id
                )

            )
            ORDER BY score DESC
            LIMIT 5
            """

            # Verbose
            render_verbose("Embedding-Based Search", f"Executing embedding-based search for query {idx}/{len(tql_queries_list)}.", level="progress")
            try:
                view_embed = ds.query(query_vs)
                # Verbose
                render_verbose("Samples Retrieved (Embedding)", f"Number of samples retrieved: <strong>{len(view_embed)}</strong>", level="success")
            except TypeError as e:
                # Verbose
                render_verbose("Error", f"Error in embedding-based querying dataset: {str(e)}", level="error")
                view_embed = []

            for sample in view_embed:
                title = sample["title"]
                text = sample["text"]
                unique_entries_embed.add((title, text))

        # Combine unique entries from both search methods
        combined_unique_entries = unique_entries_bm25.union(unique_entries_embed)

        # Verbose
        render_verbose("Combined Samples Retrieved", f"Total unique samples retrieved for query {idx}: <strong>{len(combined_unique_entries)}</strong>", level="success")

        # Semantic Chunking
        render_verbose("Semantic Chunking", f"Performing semantic chunking on the retrieved text for query {idx}.", level="progress")
        for title, text in combined_unique_entries:
            chunks = semantic_chunk(text)
            aggregated_chunks.extend([(title, chunk) for chunk in chunks])

    # Verbose
    render_verbose("Aggregation Complete", f"Total number of chunks aggregated: <strong>{len(aggregated_chunks)}</strong>", level="success")
    render_verbose("Phase 1 Complete", f"Now move on to phase 2 below", level="success")
    return aggregated_chunks

Phase 2 - Gather Evidence

Function: gather_evidence_phase(aggregated_chunks, user_query)

In this phase, the function takes the document chunks gathered earlier and re-ranks them based on their relevance to the user's original query.

Objective: Rank and summarize the most relevant document sections to create the best possible evidence pool.

Technical Details:

Embeds the user's query as a vector, and then uses vector similarity to rank the document chunks.
Utilizes ColBERT to re-rank these chunks, enhancing their precision based on semantic relevance.
Creates scored summaries for each chunk to help identify the top pieces of evidence.

This phase refines the search results into a high-quality evidence base by focusing only on the most pertinent information.

def gather_evidence_phase(aggregated_chunks, user_query):
    """
    Phase 2: Gather Evidence
    Embed query into vector, rank top k document chunks, and create scored summaries.

    Args:
    - aggregated_chunks (list): List of (title, chunk) tuples to rerank.
    - user_query (str): User query to rerank against.

    Returns:
    - reranked_results (list): A list of tuples (text, score, title) representing the reranked document chunks.
    """

    # Verbose
    render_phase_header("Phase 2: Gather Evidence", "- Embed query into vector\n- Rank top k document chunks in current state\n- Create scored summary of each chunk")
    render_verbose("Reranking Process", f"Starting reranking process for user query: <strong>{user_query}</strong>", level="progress")

    # Write Entries to File for Reranking Preparation
    data_dir = './data/temp/'
    !mkdir -p {data_dir}
    with open(f'{data_dir}/temp.txt', 'w') as f:
        for title, text in aggregated_chunks:
            f.write(f"{title}\n{text}\n\n")

    render_verbose("Data Preparation", f"Prepared {len(aggregated_chunks)} entries for reranking and saved to temporary file.", level="success")

    # Load Documents and Create Index
    documents = SimpleDirectoryReader(data_dir).load_data()
    render_verbose("Data Loading", "Documents successfully loaded from the temporary directory.", level="success")

    index = VectorStoreIndex.from_documents(documents=documents)
    render_verbose("Indexing", "VectorStoreIndex successfully built from loaded documents.", level="success")

    # Verbose
    render_verbose("Querying", "Performing similarity search and reranking. This may take a moment...", level="progress")

    query_engine = index.as_query_engine(
        similarity_top_k=10,
        node_postprocessors=[colbert_reranker],
    )
    response = query_engine.query(user_query)

    # Processing the Reranked Results
    reranked_results = []
    for node in response.source_nodes:
        # Attempt to extract title from metadata
        title = node.node.metadata.get("title", "").strip()

        # If title is not in metadata, extract it from the content
        if not title:
            content_full = node.node.get_content()
            lines = content_full.split('\n', 1)
            title = lines[0].strip() if len(lines) > 0 else "Unknown Title"
            # Optionally, update content to exclude the title
            content = lines[1].strip() if len(lines) > 1 else content_full
        else:
            content = node.node.get_content()

        content_preview = content[:120]
        score = node.score
        reranked_results.append((content, score, title))

    # Verbose
    reranked_count = len(reranked_results)
    rerank_summary_message = (
        f"Reranking process completed.<br>"
        f"Total number of reranked results: <strong>{reranked_count}</strong>.<br>"
    )
    # Verbose
    render_verbose("Reranking Summary", rerank_summary_message, level="success")

    # Verbose
    top_reranked_message = "<strong>Top 3 Reranked Results:</strong><br><ul>"
    for idx, (content, score, title) in enumerate(reranked_results[:3], 1):
        top_reranked_message += (
            f"<li><strong>Title:</strong> {title}<br>"
            f"<strong>Score:</strong> {score:.2f}<br>"
            f"<strong>Content Preview:</strong> {content[:100]}...</li><br>"
        )
    top_reranked_message += "</ul>"

    # Verbose
    render_verbose("Top Reranked Results", top_reranked_message, level="info")
    render_verbose("Phase 2 Complete", f"Now move on to phase 3 below", level="success")
    return reranked_results

Phase 3 - Generate Final Answer

Function: generate_final_answer(relevant_chunks, user_query)

This phase involves using the top-ranked chunks to generate a comprehensive response to the user's query.

Objective: Synthesize an answer based on the evidence gathered.

Technical Details:

Combines the selected evidence chunks into a single context.
Passes this context, along with the user's original question, to an LLM (GPT-4o) for generating a detailed response.
Ensures that the response includes references to the original sources, citing the paper names, sections, and content.

This final phase is crucial as it consolidates all gathered data into an insightful, well-referenced response, directly addressing the user's inquiry.

Final Answer System Prompt

The Final Answer System Prompt plays a critical role in the Generate Final Answer Phase of the Notebook PaperQA Workflow.

After gathering and re-ranking relevant evidence, this prompt directs the LLM to synthesize a coherent and factually supported response. By mandating the inclusion of detailed PDF references, it ensures that the final answer maintains high reliability and traceability, preventing hallucinations and verifying the provenance of information.

This structured JSON output aligns with PaperQA’s emphasis on integrating retrieval augmentation, ensuring that each claim is backed by credible sources. Consequently, this prompt guarantees that the generated answers are both comprehensive and verifiable, closely mirroring the expertise and trustworthiness of human-generated responses.

system_prompt_final = textwrap.dedent("""
    Answer the given query by considering all provided evidence so your response remains comprehensive yet supported by relevant facts only.

    Your response must incorporate references, specifying not only the PDF's name but also include the specific section and context.

    # Output Format

    Provide the response using the following JSON structure:

    ```json
    {
      "final_answer": "str",
      "evidence": ["array of evidence strings"],
      "pdf_references": [
        {
          "file_name": "str (the name of the PDF file)",
          "citation": "str (specific reference from PDF)",
          "content": "str (full reference content from PDF)"
        }
      ],
      "answer": "str (final answer based on evidence)"
    }
    ```

    Ensure that the JSON response is correctly formatted and contains all specified fields. Multiple "pdf_references" entries are encouraged if multiple PDFs or sections are cited.
""")

def generate_final_answer(relevant_chunks, user_query,system_prompt_final=system_prompt_final):
    """
    Phase 3: Generate Answer
    Put the best summaries into a prompt with context, generate the final answer, and display it.

    Args:
    - relevant_chunks (list): List of (text, score, title) tuples representing the top-ranked document chunks.
    - user_query (str): The original user query to generate the answer for.

    Returns:
    - None. Displays the final answer in the UI.
    """
    # Extract only the text part from each tuple in relevant_chunks
    evidence_text = "\n".join([chunk[0] for chunk in relevant_chunks])

    # Verbose
    render_phase_header("Phase 3: Generate Answer", "- Put best summaries into prompt with context\n- Generate answer with prompt")
    render_verbose("Final Answer Generation", "AI is generating the final answer based on selected chunks and user query.", level="Progress")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": system_prompt_final
            },
            {
                "role": "user",
                "content": user_query
            }
        ],
        temperature=0.3,
        max_tokens=2400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    answer = response.choices[0].message.content if response.choices else ""

    # Clean the response to extract the JSON
    clean_answer = answer.strip()
    if clean_answer.startswith("```json"):
        clean_answer = clean_answer[len("```json"):].strip()
    if clean_answer.endswith("```"):
        clean_answer = clean_answer[:-3].strip()

    # Validate and Parse Response
    try:
        parsed_response = json.loads(clean_answer)
        display_final_answer(clean_answer)
        render_verbose("PaperQA Workflow Complete", f" ", level="success")
    except json.JSONDecodeError as e:
        render_verbose("Error", f"Failed to parse JSON response: {str(e)}", level="error")
        render_verbose("Raw Response", f"The raw response received was: {clean_answer}", level="error")
        raise ValueError("Failed to decode JSON response.")

Putting It All Together - PaperQA2 Workflow

Let's run the entire PaperQA2 process, integrating all three phases to provide a seamless research workflow.

Technical Details:

Phase 1: Generates search queries from the user query and performs searches to create aggregated document chunks.
Phase 2: Reranks and summarizes document chunks to gather the best evidence.
Phase 3: Uses the top-ranked evidence to generate a well-cited, informative answer.

By structuring the workflow into three distinct yet interrelated phases, the PaperQA2 system effectively mimics a detailed, human-like approach to researching scientific literature, ensuring reliability and depth in the answers provided.

user_query = "Application of CRISPR and gene editing and arguments for and against its use"
# Phase 1: Paper Search
queries_json = generate_search_queries_phase(user_query)
for query in queries_json['queries']:
    print(f"Query: {query['query']}, TQL: {query['tql']}")

💡

Verbose

Phase 1: Paper Search

Get candidate papers from LLM-generated keyword query
Chunk, embed, and add candidate papers to state

⏳ Search Iteration

Getting LLM-generated TQL Syntax for user query: Application of CRISPR and gene editing and arguments for and against its use

✅ Search Queries Generated

Generated search queries for: Application of CRISPR and gene editing and arguments for and against its use

✅ JSON Parsing

Successfully parsed JSON search queries.

ℹ️ Search Queries

{ "queries": [ { "query": "Application of CRISPR gene editing", "tql": "SELECT *, BM25_SIMILARITY(text, 'Application of CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Application of CRISPR gene editing') DESC LIMIT 5" }, { "query": "Arguments for CRISPR gene editing", "tql": "SELECT *, BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') DESC LIMIT 5" }, { "query": "Arguments against CRISPR gene editing", "tql": "SELECT *, BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') DESC LIMIT 5" } ] }

# Phase 1: Semantic Chunking
aggregated_chunks = search_and_chunk_papers_phase(queries_json)

💡

Verbose

⏳ Search Execution

Executing BM25 query 1/3: SELECT *, BM25_SIMILARITY(text, 'Application of CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Application of CRISPR gene editing') DESC LIMIT 5

✅ Samples Retrieved (BM25)

Number of samples retrieved: 5

⏳ Embedding-Based Search

Executing embedding-based search for query 1/3.

✅ Samples Retrieved (Embedding)

Number of samples retrieved: 5

✅ Combined Samples Retrieved

Total unique samples retrieved for query 1: 10

⏳ Semantic Chunking

Performing semantic chunking on the retrieved text for query 1.

⏳ Search Execution

Executing BM25 query 2/3: SELECT *, BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') DESC LIMIT 5

✅ Samples Retrieved (BM25)

Number of samples retrieved: 5

⏳ Embedding-Based Search

Executing embedding-based search for query 2/3.

✅ Samples Retrieved (Embedding)

Number of samples retrieved: 5

✅ Combined Samples Retrieved

Total unique samples retrieved for query 2: 10

⏳ Semantic Chunking

Performing semantic chunking on the retrieved text for query 2.

⏳ Search Execution

Executing BM25 query 3/3: SELECT *, BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') DESC LIMIT 5

✅ Samples Retrieved (BM25)

Number of samples retrieved: 5

⏳ Embedding-Based Search

Executing embedding-based search for query 3/3.

✅ Samples Retrieved (Embedding)

Number of samples retrieved: 5

✅ Combined Samples Retrieved

Total unique samples retrieved for query 3: 10

⏳ Semantic Chunking

Performing semantic chunking on the retrieved text for query 3.

✅ Aggregation Complete

Total number of chunks aggregated: 341

✅ Phase 1 Complete

Now move on to phase 2 below

# Phase 2: Gather Evidence
reranked_chunks = gather_evidence_phase(aggregated_chunks, user_query)

💡

Verbose

Phase 2: Gather Evidence

Embed query into vector
Rank top k document chunks in current state
Create scored summary of each chunk

⏳ Reranking Process

Starting reranking process for user query: Application of CRISPR and gene editing and arguments for and against its use

✅ Data Preparation

Prepared 341 entries for reranking and saved to temporary file.

✅ Data Loading

Documents successfully loaded from the temporary directory.

✅ Indexing

VectorStoreIndex successfully built from loaded documents.

⏳ Querying

Performing similarity search and reranking. This may take a moment...

✅ Reranking Summary

Reranking process completed.

Total number of reranked results: 5.

ℹ️ Top Reranked Results

Top 3 Reranked Results:

Title: 1785_Smooth Ranking SVM via Cutting-Plane Method.pdfScore: 0.62Content Preview: References [Alcal´a-Fdez et al., 2011] Jes´us Alcal´a-Fdez, Alberto Fern´andez, Juli´an Luengo, Joaq...
Title: 1128_Towards Risk Analysis of the Impact of AI on the Deliberate Biological Threat Landscape.pdfScore: 0.60Content Preview: The ways in which dual use life science technologies could be misused have been categorized by the l...
Title: 1128_Towards Risk Analysis of the Impact of AI on the Deliberate Biological Threat Landscape.pdfScore: 0.60Content Preview: Researchers had recently demonstrated the ability to put DNA from one organism into another, and the...

✅ Phase 2 Complete

Now move on to phase 3 below

# Phase 3: Generate Answer
generate_final_answer(reranked_chunks, user_query)

Phase 3: Generate Answer

Put best summaries into prompt with context
Generate answer with prompt

ℹ️ Final Answer Generation

AI is generating the final answer based on selected chunks and user query.

Final Answer

CRISPR and gene editing technologies have transformative potential in medicine, agriculture, and biotechnology. They offer precise methods for modifying genes, which can lead to advances in treating genetic disorders, improving crop resilience, and even combating climate change. However, ethical concerns arise regarding potential misuse, unintended genetic consequences, and the moral implications of altering human genomes.

Evidence

CRISPR technology allows for precise gene editing, which can lead to significant medical advancements.
There are ethical concerns about the potential misuse of gene editing technologies and their long-term effects.

References

File Name: CRISPR_Applications.pdfCitation: Section 3.1, Page 12

CRISPR technology has been applied in various fields such as medicine, agriculture, and environmental science. In medicine, it holds promise for treating genetic disorders by correcting mutations at the DNA level.

File Name: Ethical_Considerations_in_Gene_Editing.pdfCitation: Section 4.2, Page 18

The ethical debate surrounding CRISPR and gene editing focuses on the potential for misuse, such as creating 'designer babies', and the unforeseen consequences of altering genetic material, which could have lasting impacts on biodiversity and human health.

✅ PaperQA Workflow Complete

Next Steps for Experimentation and Extension

This workflow can be further refined and expanded to suit different use cases. Below are a few suggestions for you to experiment with and extend PaperQA2:

Experiment with the Existing Code

1. Try Different User Queries:

Feel free to experiment with different queries to see how PaperQA2 performs across a variety of topics.
Observe how specific keywords and phrases impact the relevance of the retrieved documents.

2. Adjust Chunk Size for Better Results:

The chunking process divides text into smaller segments that are easier to process and rank.
You can modify the max_len parameter in the semantic_chunk(text, max_len=200) function to see how different chunk sizes affect evidence ranking and final answer quality.

Suggested Experimentation:

Smaller chunks (max_len = 100) may capture more specific details but lead to more fragments.
Larger chunks (max_len = 300-500) may capture more context but could be less focused.

To experiment with chunk size:

def semantic_chunk(text, max_len=200):
    # Adjust 'max_len' here to control the chunk size.

3. Modify Ranking Parameters:

In Phase 2: Gather Evidence, the similarity_top_k parameter in the query_engine determines how many top chunks are selected for reranking.
You can experiment with this value to adjust how much evidence is gathered for answer generation.

Increasing this value might gather more evidence at the cost of longer processing times.
Decreasing it might improve efficiency but risk missing valuable information.
To modify ranking:

query_engine = index.as_query_engine(
    similarity_top_k=10,  # Adjust this value to change how many top results are gathered.
    node_postprocessors=[colbert_reranker],
)

4. Experiment with the Reranker:

The reranker (colbert_reranker) is used to reorder the retrieved chunks.
You could try replacing or tweaking reranker parameters to experiment with different approaches for ranking, such as using different scoring algorithms or additional filters for the quality of the text.

5. Limit the Number of Results for Answer Generation:

In Phase 3: Generate Final Answer, only a subset of the highest-scoring chunks is used to generate the answer.
You could modify this subset size to observe how the answer quality changes. For instance, using more chunks may yield more comprehensive answers but also lead to verbose responses.

Going Beyond the Code: Extend the Capabilities of PaperQA2

Implement an Agentic Approach with Tool Integration**:

Move towards an agentic approach, where the entire PaperQA2 system behaves as an autonomous agent with different tools.
Each phase can be thought of as a distinct tool:

Tool 1: A paper-searching tool that takes the user query and retrieves relevant papers.
Tool 2: An evidence-gathering tool that ranks and reorders content.
Tool 3: An answer-generating tool that synthesizes the final answer.

By combining these tools into an agent, you could create a more flexible, fully autonomous research assistant that continuously learns to improve the quality of its results.

With these next steps, you can personalize PaperQA2 for your research needs, optimize its workflow, and explore advanced methods for making it a powerful research tool.

addCode

addText

Using PaperQA2 for Scientific Discovery

Notebook PaperQA Workflow

Environment Setup

Explore The Dataset with Hybrid Search

Verbose Helper Functions

PaperQA Workflow Functions

Phase 1, Step 2 - Search and Chunk Papers

Phase 2 - Gather Evidence

Phase 3 - Generate Final Answer

Putting It All Together - PaperQA2 Workflow

Phase 1: Paper Search

⏳ Search Iteration

✅ Search Queries Generated

✅ JSON Parsing

ℹ️ Search Queries

⏳ Search Execution

✅ Samples Retrieved (BM25)

⏳ Embedding-Based Search

✅ Samples Retrieved (Embedding)

✅ Combined Samples Retrieved

⏳ Semantic Chunking

⏳ Search Execution

✅ Samples Retrieved (BM25)

⏳ Embedding-Based Search

✅ Samples Retrieved (Embedding)

✅ Combined Samples Retrieved

⏳ Semantic Chunking

⏳ Search Execution

✅ Samples Retrieved (BM25)

⏳ Embedding-Based Search

✅ Samples Retrieved (Embedding)

✅ Combined Samples Retrieved

⏳ Semantic Chunking

✅ Aggregation Complete

✅ Phase 1 Complete

Phase 2: Gather Evidence

⏳ Reranking Process

✅ Data Preparation

✅ Data Loading

✅ Indexing

⏳ Querying

✅ Reranking Summary

ℹ️ Top Reranked Results

✅ Phase 2 Complete

Phase 3: Generate Answer

ℹ️ Final Answer Generation

Final Answer

Evidence

References

Next Steps for Experimentation and Extension

Experiment with the Existing Code

Going Beyond the Code: Extend the Capabilities of PaperQA2