This notebook implements the PaperQA2 Workflow, designed to assist researchers through three primary phases:
- Paper Search: Locate relevant scientific articles.
- Gather Evidence: Rank document chunks to determine their relevance to the user's query.
- Generate Final Answers: Use the best-ranked evidence to create a comprehensive response.
This workflow draws inspiration from the modular architecture of PaperQA, which aims to reduce hallucinations and improve interpretability by grounding responses in retrieved scientific literature. The PaperQA approach enhances information retrieval and synthesis, offering researchers a systematic way to navigate and process scientific knowledge.
Source: PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
Notebook PaperQA Workflow
Search
- The workflow begins with generating user queries. The
generate_search_queries_phase()
function uses a Large Language Model (LLM) to decompose the user query into multiple focused sub-queries. This mirrors the iterative and dynamic capabilities of PaperQA, which adapts the retrieval strategy based on the requirements of the query. - PaperQA’s first tool—search—aims to retrieve relevant papers using keywords and queries generated from the initial question. This step ensures the research is comprehensive, similar to the way our workflow uses generated queries to broaden the search scope effectively.
- Vector Search: Calculates cosine similarity with
embedding_string
and returns the top 5 matches by score. - BM25 Search: Ranks
text
by BM25 similarity to query, also limited to the top 5. - Additionally, the
semantic_chunk()
function implements concurrent searches, akin to how PaperQA uses vector embedding to search multiple avenues, enhancing the breadth of retrieval and mimicking an agent exploring multiple knowledge sources at once.
Evidence Gathering
- Once the search results are retrieved, the workflow moves into the evidence-gathering phase. The
semantic_chunk()
function continues by breaking down the articles into smaller, meaningful segments. PaperQA uses maximal marginal relevance to enhance diversity among returned documents, minimizing redundancy and improving the quality of retrieved evidence - The
gather_evidence_phase()
then uses ColBERT to re-rank these segments for relevance, similar to PaperQA’s gather evidence tool, which integrates retrieval augmentation to ensure each chunk is scored based on its importance. This process prevents irrelevant context from interfering, ensuring a focus on the most pertinent data
Generate Final Answer
- In the final phase, the workflow generates an answer by utilizing the top-ranked evidence chunks. These are combined into a single context, which is then passed to an LLM via the
generate_final_answer()
function. - The LLM synthesizes a response, drawing from the evidence provided. References to the original articles are maintained, ensuring that the answer has clear provenance, as emphasized in PaperQA's approach to minimize hallucination and ensure verifiable answers.
- Synthesizing and citing relevant evidence offers a high level of reliability, much like PaperQA’s goal to present answers comparable to human experts by ensuring that every claim is supported by a source.
PaperQA has been designed to use retrieved evidence to construct a final answer, following a map-reduce approach to synthesize information from multiple sources. This mirrors our workflow's approach to combining evidence before generating the final response, ensuring a comprehensive overview that is both thorough and trustworthy
Instructions for Use
- Set Up: Ensure you have set the OpenAI API key to enable the notebook to make requests to OpenAI.
- Run Cells Sequentially: Follow the notebook by running cells in order, starting with the environment setup and imports.
- Enter Your Query: At the prompt cell, enter the query about your research topic (e.g., "Impacts of gene editing on medicine").
- View Results: Examine the outputs at each stage, which include the search queries, evidence ranking, and final answer.
For each section below, you will find detailed explanations to help understand how each phase contributes to the overall goal of answering a research query.
Environment Setup
Install all the necessary dependencies and imports the required libraries.
You need to provide your OpenAI API key in order for the notebook to generate search queries and answers. This setup allows you to leverage Deep Lake for dataset querying, OpenAI for question generation, and various other tools for processing text data.
- Ensure that the API key is correctly configured and that all installations complete successfully.
!pip install --quiet deeplake
!pip install --quiet llama-index llama-index-core transformers torch llama-index-embeddings-openai llama-index-llms-openai llama-index-postprocessor-colbert-rerank spacy openai langchain numpy pydantic
import os,getpass
import json
import openai
import deeplake
import spacy
import llama_index
import langchain
import textwrap
from IPython.display import display, HTML
from langchain.docstore.document import Document
from deeplake import types
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.colbert_rerank import ColbertRerank
from openai import OpenAI
# Set up OpenAI API key and client
OPENAI_API_KEY = getpass.getpass('Enter your OpenAI API key (platform.openai.com): ')
# Validate that the API key exists
if OPENAI_API_KEY is None:
raise ValueError("OpenAI API Key not found. Please make sure it's set in your Colab environment.")
# Set environment variable and initialize OpenAI client
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")
ACTIVELOOP_TOKEN = getpass.getpass('Enter your Activeloop Token (activeloop.ai): ')
os.environ['ACTIVELOOP_TOKEN'] = ACTIVELOOP_TOKEN
# Initialize the OpenAI client
client = OpenAI()
Dataset Loading
In this section, we load the necessary resources to perform the core operations in the PaperQA2 workflow. These resources include:
The Deep Lake dataset is loaded using the deeplake.open_read_only()
function. The dataset we're using is a collection of scientific papers, located at hub://demo_v4/scientific_papers
which we have preloaded to our own dataset in preparation. This dataset will serve as the source of information for generating responses to user queries.
org_id = "genai360"
dataset_name = "scientific_papers_paperqa_with_embeddings"
ds = deeplake.open_read_only(f"al://{org_id}/{dataset_name}")
ds.summary()
Explore The Dataset with Hybrid Search
The PaperQA Workflow in this notebook utilises both vector search and BM25 search queries. ColBERT is used to re-rank the results after semantic chunks have been created.
As an additional to your workflow, you could utilise the hybrid search method outlined in the section below to enhance the search phase.
In the stage, the system enhances its search capabilities by combining BM25 with Approximate Nearest Neighbors (ANN) for a hybrid search. This approach blends lexical search with semantic search, improving relevance by considering both keywords and semantic meaning.
We open the scientific_papers_paperqa_with_embeddings
dataset to perform a hybrid search. First, we define a query "what do you know about drones?" and generate its embedding using embedding_function(query)[0]
. We then convert this embedding into a comma-separated string embedding_string
, preparing it for use in combined text and vector-based searches.
def embedding_function(texts, model="text-embedding-3-large"):
if isinstance(texts, str):
texts = [texts]
texts = [t.replace("\n", " ") for t in texts]
return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]
Search for relevant papers using a specific sentence
We create two queries:
- Vector Search (
tql_vs
): Calculates cosine similarity withembedding_string
and returns the top 5 matches by score. - BM25 Search (
tql_bm25
): Rankstext
by BM25 similarity to query, also limited to the top 5.
natural_query = "what do you know about drones?"
query = "ARGoS"
embed_query = embedding_function(natural_query)[0]
embedding_string = ",".join(str(c) for c in embed_query)
We then execute both queries, storing vector results in vs_results
and BM25 results in bm25_results
. This allows us to compare results from both search methods.
tql_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{embedding_string}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
ORDER BY score DESC
LIMIT 5
"""
tql_bm25 = f"""
SELECT *, BM25_SIMILARITY(text, '{query}') AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
ORDER BY BM25_SIMILARITY(text, '{query}') DESC
LIMIT 5
"""
vs_results = ds.query(tql_vs)
bm25_results = ds.query(tql_bm25)
Show the scores
for el_vs in vs_results:
print(f"vector search score: {el_vs['score']}")
for el_bm25 in bm25_results:
print(f"bm25 score: {el_bm25['score']}")
vector search score: 0.2837725281715393 vector search score: 0.24017328023910522 vector search score: 0.22901776432991028 vector search score: 0.22634504735469818 vector search score: 0.2239888608455658 bm25 score: 19.381977081298828 bm25 score: 19.296512603759766 bm25 score: 18.8994140625 bm25 score: 18.501781463623047 bm25 score: 18.068836212158203
First, we import the required libraries and define a Document class, where each document has an id, a data dictionary, and an optional score for ranking.
- Setup and Classes: We import necessary libraries and define a Document class using
pydantic.BaseModel
. EachDocument
has anid
, adata
dictionary, and an optionalscore
for ranking. - Softmax Function: The
softmax
function normalizes a list of scores (retrieved_score
) using the softmax formula. Scores are exponentiated, limited bymax_weight
, and then normalized to sum up to 1. This returnsnew_weights
, a list of normalized scores.
import math
import numpy as np
from typing import Any, Dict, List, Optional
from pydantic import BaseModel
class Document(BaseModel):
id: str
data: Dict[str, Any]
score: Optional[float] = None
def softmax(retrieved_score: list[float], max_weight: int = 700) -> Dict[str, Document]:
# Compute the exponentials
exp_scores = [math.exp(min(score, max_weight)) for score in retrieved_score]
# Compute the sum of the exponentials
sum_exp_scores = sum(exp_scores)
# Update the scores of the documents using softmax
new_weights = []
for score in exp_scores:
new_weights.append(score / sum_exp_scores)
return new_weights
Normalize the score
- Apply Softmax to Scores:
- We extract
score
values fromvs_results
andbm25_result
s and applysoftmax
to them, storing the results invss
andbm25s
. This step scales both sets of scores for easy comparison.
- Create Document Dictionaries:
- We create dictionaries
docs_vs
anddocs_bm25
to store documents fromvs_result
s andbm25_results
, respectively. For each result, we add thetitle
andtext
along with the normalized score. Each document is identified byrow_id
.
This code standardizes scores and organizes results, allowing comparison across both vector and BM25 search methods.
vs_score = vs_results["score"]
bm_score = bm25_results["score"]
vss = softmax(vs_score)
bm25s = softmax(bm_score)
print(vss)
print(bm25s)
[0.2087589635344044, 0.1998527916854391, 0.1976357199697022, 0.19710820089464823, 0.1966443239158061] [0.31065925447580744, 0.28521183623674395, 0.19173872690847832, 0.12883094603697134, 0.08355923634199894]
docs_vs = {}
docs_bm25 = {}
for el, score in zip(vs_results, vss):
docs_vs[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"text": el["text"], "title": el["title"]}, score=score)
for el, score in zip(bm25_results, bm25s):
docs_bm25[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"text": el["text"], "title": el["title"]}, score=score)
We define weights for our hybrid search: VECTOR_WEIGHT
and LEXICAL_WEIGHT
are both set to 0.5
, giving equal importance to vector-based and BM25 scores.
- Initialize Results Dictionary:
- We create an empty dictionary,
results
, to store documents with their combined scores from both search methods.
- Combine Scores:
- We iterate over the unique document IDs from
docs_vs
anddocs_bm25
. - For each document:
- We add it to
results
, defaulting to the version available (vector or BM25). - We calculate a weighted score:
vs_score
from vector results (if present indocs_vs
) andbm_score
from BM25 results (if present indocs_bm25
). - The final
results[k].score
is set by addingvs_score
andbm_score
.
This produces a fused score for each document in results
, ready to rank in the hybrid search.
Fusion method
def fusion(docs_vs: Dict[str, Document], docs_bm25: Dict[str, Document]) -> Dict[str, Document]:
VECTOR_WEIGHT = 0.5
LEXICAL_WEIGHT = 0.5
results: Dict[str, Dict[str, Document]] = {}
for k in set(docs_vs) | set(docs_bm25):
results[k] = docs_vs.get(k, None) or docs_bm25.get(k, None)
vs_score = VECTOR_WEIGHT * docs_vs[k].score if k in docs_vs else 0
bm_score = LEXICAL_WEIGHT * docs_bm25[k].score if k in docs_bm25 else 0
results[k].score = vs_score + bm_score
return results
results = fusion(docs_vs, docs_bm25)
for k, v in results.items():
print(f"text: {v.data['text']}, score: {v.score}")
text: The software would actually be able to recognize hedge- hogs from…
text: Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhi- lasha Ravichander…
text: (a) UNSW-NB15 layer speed up compared to accuracy loss (b) UNSW-NB15…
text: 8 Symbol Description ip0 Current race control flag ip1 The active path…
text: The Effect of Predictive Formal Modelling at Runtime on Performance…
…
Initializing the ColBERT Reranker
ColBERT allows us to perform a contextual ranking of document chunks, ensuring that the retrieved results are both relevant and contextually appropriate. By reranking using an effective scoring model like ColBERT, we improve the quality of the evidence used to generate final answers.
- ColBERT (Contextualized Late Interaction over BERT) is used for reranking document chunks.
- The initialize_colbert_reranker() function sets up the reranker using the ColBERTv2.0 model.
- top_n=5: Specifies that we want to rerank the top 5 results.
- model and tokenizer: Specifies the pre-trained ColBERT model to use.
- keep_retrieval_score=True: Keeps the original retrieval score, which allows us to compare how well the model performs in retrieval versus reranking.
# Initialize the ColBERT reranker
def initialize_colbert_reranker():
return ColbertRerank(
top_n=5,
model="colbert-ir/colbertv2.0",
tokenizer="colbert-ir/colbertv2.0",
keep_retrieval_score=True,
)
# Create a global instance of the reranker to reuse
colbert_reranker = initialize_colbert_reranker()
Loading the spaCy Model for NLP Tasks
The spaCy model is used during the chunking phase, where large documents are broken into smaller, more manageable chunks. The chunks are later ranked and used for generating answers, making the NLP model crucial for ensuring that the segmentation of documents is done in a meaningful way.
# Load the spaCy model for semantic chunking
nlp = spacy.load('en_core_web_sm')
Verbose Helper Functions
These verbose helper functions help to inspect the output from the workflow in a readable format.
display(HTML('''
<script>
// Select all output areas in the current document
const observer = new MutationObserver(function(mutations) {
for (let mutation of mutations) {
if (mutation.target.nodeName === 'DIV' && mutation.target.className.includes('output-subarea')) {
let outputDiv = mutation.target;
outputDiv.parentNode.style.maxHeight = "none"; // Remove any max-height set by default
outputDiv.parentNode.style.height = "auto"; // Set to auto to accommodate full content height
}
}
});
// Observe changes in the entire document for added output nodes
observer.observe(document.documentElement, {
attributes: false,
childList: true,
subtree: true,
});
</script>
'''))
# Inject global CSS styles for both light and dark modes
display(HTML('''
<style>
/* Final Answer Styles */
.final-answer-box {
border: 2px solid #4CAF50;
padding: 15px;
margin: 20px;
border-radius: 10px;
background-color: var(--final-answer-bg, #f9f9f9);
color: var(--final-answer-text, #333);
font-family: Arial, sans-serif;
}
.final-answer-title {
color: #4CAF50;
}
.final-answer-header {
color: #333;
}
.final-answer-list {
list-style-type: disc;
padding-left: 20px;
}
.final-answer-reference {
margin-bottom: 10px;
}
/* Verbose Message Styles */
.verbose-box {
border: 1px solid #ddd;
padding: 15px;
margin: 10px 0;
border-radius: 8px;
background-color: var(--verbose-bg, #f9f9f9);
color: var(--verbose-text, #333);
font-family: Arial, sans-serif;
font-size: 14px;
}
.verbose-info { background-color: #f9f9f9; }
.verbose-success { background-color: #d4edda; }
.verbose-warning { background-color: #fff3cd; }
.verbose-error { background-color: #f8d7da; }
.verbose-progress { background-color: #d1ecf1; }
/* Phase Header Styles */
.phase-header-box {
border: 2px solid #333;
padding: 20px;
margin: 20px 0;
border-radius: 10px;
background-color: var(--phase-header-bg, #f1f1f1);
color: var(--phase-header-text, #333);
font-family: Arial, sans-serif;
}
.phase-header-title {
font-weight: bold;
margin-bottom: 10px;
}
.phase-header-list {
list-style-type: disc;
padding-left: 20px;
font-size: 14px;
}
/* Light Mode Variables */
@media (prefers-color-scheme: light) {
:root {
--final-answer-bg: #f9f9f9;
--final-answer-text: #333;
--verbose-bg: #f9f9f9;
--verbose-text: #333;
--phase-header-bg: #f1f1f1;
--phase-header-text: #333;
}
}
/* Dark Mode Variables */
@media (prefers-color-scheme: dark) {
:root {
--final-answer-bg: #2e2e2e;
--final-answer-text: #f9f9f9;
--verbose-bg: #333333;
--verbose-text: #f9f9f9;
--phase-header-bg: #1e1e1e;
--phase-header-text: #f9f9f9;
}
/* Adjust border colors for better contrast in dark mode */
.final-answer-box {
border: 2px solid #4CAF50;
}
.phase-header-box {
border: 2px solid #555;
}
}
</style>
'''))
def display_final_answer(response_json):
"""
Displays the final answer, evidence, and references in a formatted HTML box.
Args:
- response_json (str): The JSON string containing the final answer, evidence, and references.
"""
response = json.loads(response_json)
# Extract data
final_answer = response.get("final_answer", "No answer found.")
evidence = response.get("evidence", [])
pdf_references = response.get("pdf_references", [])
# Format the output
html_content = f"""
<div class='final-answer-box'>
<h2 class='final-answer-title'>Final Answer</h2>
<p>{final_answer}</p>
<h3 class='final-answer-header'>Evidence</h3>
<ul class='final-answer-list'>
"""
for ev in evidence:
html_content += f"<li>{ev}</li>"
html_content += "</ul>"
html_content += """
<h3 class='final-answer-header'>References</h3>
<ul class='final-answer-list'>
"""
for ref in pdf_references:
file_name = ref.get("file_name", "Unknown file")
citation = ref.get("citation", "No citation provided")
content = ref.get("content", "No content available")
html_content += f"""
<li class='final-answer-reference'>
<strong>File Name:</strong> {file_name}<br>
<strong>Citation:</strong> {citation}<br>
<p>{content}</p>
</li>
"""
html_content += "</ul></div>"
# Display using IPython display
display(HTML(html_content))
def render_verbose(step, message, level="info", color=None):
"""
A unified HTML rendering function to display different verbose items in a user-friendly manner.
Args:
- step (str): Title or main action (e.g., 'Search Execution', 'Final Answer Generation').
- message (str): Detailed message to display.
- level (str): Severity level ('info', 'success', 'warning', 'error', 'progress').
- color (str): Optional custom color.
"""
icon = {
"info": "ℹ️",
"success": "✅",
"warning": "⚠️",
"error": "❌",
"progress": "⏳"
}.get(level, "ℹ️")
# Determine the appropriate CSS class based on the level
css_class = f"verbose-{level}" if level in ["info", "success", "warning", "error", "progress"] else "verbose-info"
html_content = f"""
<div class='verbose-box {css_class}'>
<h4><span>{icon}</span> <strong>{step}</strong></h4>
<p>{message}</p>
</div>
"""
display(HTML(html_content))
def render_phase_header(phase_name, phase_description):
"""
Render a header for each phase of the PaperQA2 workflow.
Args:
- phase_name (str): The name of the phase (e.g., 'Phase 1: Paper Search').
- phase_description (str): Description of what happens in the current phase,
separated by newlines for each action.
"""
# Split the phase description into individual bullet points
bullet_points = phase_description.split("\n")
bullet_points_html = "".join(f"<li>{point.strip()}</li>" for point in bullet_points if point.strip())
html_content = f"""
<div class='phase-header-box'>
<h2 class='phase-header-title'>{phase_name}</h2>
<ul class='phase-header-list'>{bullet_points_html}</ul>
</div>
"""
display(HTML(html_content))
PaperQA Workflow Functions
keyboard_arrow_down Phase 1, Step 1 - Paper Search
Function: generate_search_queries_phase(user_query)
This function is the starting point of the workflow, responsible for generating search queries based on the user's input.
Objective: Transform the user's input query into well-structured search queries.
Technical Details:
- Utilizes a Large Language Model (LLM) to generate multiple sub-queries. This decomposition ensures that the scope of research is comprehensive, with both narrow and broad focus.
- The LLM constructs search terms in Tensor Query Language (TQL) format to search Deep Lake's corpus effectively.
- Generates a JSON response containing multiple search queries and their corresponding TQL syntax, which is optimized for relevance using BM25 ranking.
This approach helps extract the most relevant documents, ensuring a strong foundation for subsequent phases by covering various aspects of the user query comprehensively.
System Message
The Search System Prompt is foundational to the Search Phase of the Notebook PaperQA Workflow.
By instructing the LLM to decompose the user's query into multiple focused sub-queries using extracted keywords, it ensures that both BM25 and Vector Search mechanisms retrieve comprehensive and relevant papers.
This decomposition mirrors PaperQA's capability to adapt retrieval strategies dynamically, allowing the system to cover various aspects of the user's intent effectively. Additionally, the strict formatting into JSON with properly structured TQL queries facilitates seamless integration with the subsequent search operations, enhancing the overall efficiency and accuracy of evidence retrieval.
system_prompt_search = textwrap.dedent("""
You are an assistant that generates search queries for a scientific papers database based on relevant keywords extracted from the user's query.
Your task is to analyze the user's input query and determine if multiple search queries are necessary to gather comprehensive evidence. This may involve creating both narrow and broad searches, or using different phrasings to capture all relevant aspects of the user's intent.
Extract relevant keywords from the user's input query before creating search queries. **Only use keywords** for both the search term and TQL output.
Provide the output in **valid JSON format only**. The JSON should include:
- **'queries'**: A list of objects.
- Each object should contain:
- **'query'**: The search term composed of strictly the extracted keywords.
- **'tql'**: The corresponding TQL (Tensor Query Language) query formatted for Deep Lake.
Ensure the TQL query is correctly formatted for Deep Lake's Tensor Query Language:
- Always set **LIMIT to 5**.
- Use **ORDER BY BM25_SIMILARITY(text, 'search terms')** to rank results based on relevance.
- Note that the database only contains **text fields** to search.
# Steps
1. Analyze the user's query to understand its scope and identify if multiple aspects or angles need to be addressed.
2. Extract the most relevant keywords from the user's query.
3. If multiple queries are needed, formulate each search term to cover different aspects (e.g., narrow and broad searches, different phrasings).
4. Generate a corresponding TQL query for Deep Lake based on each search term, using the keywords only and incorporating the BM25 similarity ranking.
# Output Format
Provide the response in the following JSON structure:
```json
{
"queries": [
{
"query": "[Search term 1 derived from keywords]",
"tql": "SELECT *, BM25_SIMILARITY(text, '[Search term 1]') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, '[Search term 1]') DESC LIMIT 5"
},
{
"query": "[Search term 2 derived from keywords]",
"tql": "SELECT *, BM25_SIMILARITY(text, '[Search term 2]') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, '[Search term 2]') DESC LIMIT 5"
},
...
]
}
```
# Examples
**Valid Example with Single Query**:
```json
{
"queries": [
{
"query": "CRISPR gene editing",
"tql": "SELECT *, BM25_SIMILARITY(text, 'CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'CRISPR gene editing') DESC LIMIT 5"
}
]
}
```
**Valid Example with Multiple Queries**:
```json
{
"queries": [
{
"query": "CRISPR gene editing",
"tql": "SELECT *, BM25_SIMILARITY(text, 'CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'CRISPR gene editing') DESC LIMIT 5"
},
{
"query": "Applications of CRISPR",
"tql": "SELECT *, BM25_SIMILARITY(text, 'Applications of CRISPR') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Applications of CRISPR') DESC LIMIT 5"
},
{
"query": "Ethical implications of gene editing",
"tql": "SELECT *, BM25_SIMILARITY(text, 'Ethical implications of gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Ethical implications of gene editing') DESC LIMIT 5"
}
]
}
```
# Notes
- Focus on identifying and extracting concise and relevant keywords from the user's query.
- Avoid unnecessary phrases such as "application of", "for and against", etc., unless they contribute to a distinct search aspect.
- When the user's query encompasses multiple facets or perspectives, generate separate search queries for each aspect to ensure comprehensive coverage.
- Ensure extracted keywords are used clearly and avoid redundant or overly specific terms. Always use keywords strictly for query generation.
""")
def generate_search_queries_phase(user_query, system_prompt_search=system_prompt_search):
"""
Phase 1: Paper Search
Step 1: Generate Search Queries from User Query.
Args:
- user_query (str): The user query to generate search queries and retrieve papers.
Returns:
- JSON response containing search queries and aggregated chunks.
"""
# Verbose
render_phase_header("Phase 1: Paper Search", "Get candidate papers from LLM-generated keyword query\nChunk, embed, and add candidate papers to state")
# Generate Search Queries
# Verbose
render_verbose("Search Iteration", f"Getting LLM-generated TQL Syntax for user query: <strong>{user_query}</strong>", level="progress")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": system_prompt_search
},
{
"role": "user",
"content": user_query
}
],
response_format={
"type": "json_object"
},
temperature=0.1,
max_tokens=2048,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Verbose
render_verbose("Search Queries Generated", f"Generated search queries for: {user_query}", level="success")
queries = response.choices[0].message.content
# Parse JSON Response
try:
queries_json = json.loads(queries)
# Verbose
render_verbose("JSON Parsing", "Successfully parsed JSON search queries.", level="success")
render_verbose("Search Queries", json.dumps(queries_json, indent=4), level="sucess")
except json.JSONDecodeError as e:
# Verbose
render_verbose("Error", f"Failed to parse JSON: {str(e)}", level="error")
raise ValueError(f"Failed to decode JSON from search queries: {str(e)}")
return queries_json
Phase 1, Step 2 - Search and Chunk Papers
Function: search_and_chunk_papers_phase(queries_json)
Following the generation of queries, this function executes the searches against the relevant databases.
Objective: Retrieve the relevant papers and split them into meaningful content chunks.
Technical Details:
- Performs hybrid searches on the scientific database.
- The retrieved documents are segmented into smaller, coherent chunks for easier processing.
- Each chunk is embedded into a vector space to facilitate similarity-based retrieval in subsequent steps.
This chunking and embedding process ensures efficient handling of documents, allowing relevant sections to be isolated for further analysis.
# Helper functions for semantic chunking and embedding user query
def semantic_chunk(text, max_len=200):
"""
Chunk the given text into smaller pieces based on maximum length.
Args:
- text (str): The input text to be chunked.
- max_len (int): Maximum length of each chunk (default is 200).
Returns:
- List of text chunks
"""
doc = nlp(text)
chunks = []
current_chunk = []
for sent in doc.sents:
current_chunk.append(sent.text)
if len(' '.join(current_chunk)) > max_len:
chunks.append(' '.join(current_chunk))
current_chunk = []
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
def search_and_chunk_papers_phase(queries_json):
"""
Phase 1: Paper Search
Step 2: Search Papers using BM25 and Embedding-based similarity, Chunk, and Embed into State.
Args:
- queries_json (dict): JSON response containing search queries generated from the user's query.
Returns:
- aggregated_chunks (list): A list of (title, chunk) tuples representing chunks of text extracted from retrieved documents.
"""
tql_queries_list = queries_json.get('queries')
if not isinstance(tql_queries_list, list):
# Verbose
render_verbose("Error", "Invalid query format detected. Expected a list of queries in the JSON response.", level="error")
raise ValueError("Invalid query format: Expected a list of queries in the JSON response.")
aggregated_chunks = []
# Iterate over each query in the queries_json
for idx, query_obj in enumerate(tql_queries_list, 1):
# Extract BM25-based TQL query
tql_query = query_obj.get('tql')
natural_query = query_obj.get('query')
# Initialize sets to store unique entries from both search methods
unique_entries_bm25 = set()
unique_entries_embed = set()
# ---- BM25-Based Search ----
if tql_query:
# Verbose
render_verbose("Search Execution", f"Executing BM25 query {idx}/{len(tql_queries_list)}: <code>{tql_query}</code>", level="progress")
try:
view_bm25 = ds.query(tql_query)
# Verbose
render_verbose("Samples Retrieved (BM25)", f"Number of samples retrieved: <strong>{len(view_bm25)}</strong>", level="success")
except TypeError as e:
# Verbose
render_verbose("Error", f"Error in querying dataset with BM25: {str(e)}", level="error")
view_bm25 = []
for sample in view_bm25:
title = sample["title"]
text = sample["text"]
unique_entries_bm25.add((title, text))
# ---- Embedding-Based Search ----
if natural_query:
# Generate embedding for the natural language query
embed_query = embedding_function(natural_query)[0]
str_query = ",".join(str(c) for c in embed_query)
# Construct the similarity search query
query_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
ORDER BY score DESC
LIMIT 5
"""
# Verbose
render_verbose("Embedding-Based Search", f"Executing embedding-based search for query {idx}/{len(tql_queries_list)}.", level="progress")
try:
view_embed = ds.query(query_vs)
# Verbose
render_verbose("Samples Retrieved (Embedding)", f"Number of samples retrieved: <strong>{len(view_embed)}</strong>", level="success")
except TypeError as e:
# Verbose
render_verbose("Error", f"Error in embedding-based querying dataset: {str(e)}", level="error")
view_embed = []
for sample in view_embed:
title = sample["title"]
text = sample["text"]
unique_entries_embed.add((title, text))
# Combine unique entries from both search methods
combined_unique_entries = unique_entries_bm25.union(unique_entries_embed)
# Verbose
render_verbose("Combined Samples Retrieved", f"Total unique samples retrieved for query {idx}: <strong>{len(combined_unique_entries)}</strong>", level="success")
# Semantic Chunking
render_verbose("Semantic Chunking", f"Performing semantic chunking on the retrieved text for query {idx}.", level="progress")
for title, text in combined_unique_entries:
chunks = semantic_chunk(text)
aggregated_chunks.extend([(title, chunk) for chunk in chunks])
# Verbose
render_verbose("Aggregation Complete", f"Total number of chunks aggregated: <strong>{len(aggregated_chunks)}</strong>", level="success")
render_verbose("Phase 1 Complete", f"Now move on to phase 2 below", level="success")
return aggregated_chunks
Phase 2 - Gather Evidence
Function: gather_evidence_phase(aggregated_chunks, user_query)
In this phase, the function takes the document chunks gathered earlier and re-ranks them based on their relevance to the user's original query.
Objective: Rank and summarize the most relevant document sections to create the best possible evidence pool.
Technical Details:
- Embeds the user's query as a vector, and then uses vector similarity to rank the document chunks.
- Utilizes ColBERT to re-rank these chunks, enhancing their precision based on semantic relevance.
- Creates scored summaries for each chunk to help identify the top pieces of evidence.
This phase refines the search results into a high-quality evidence base by focusing only on the most pertinent information.
def gather_evidence_phase(aggregated_chunks, user_query):
"""
Phase 2: Gather Evidence
Embed query into vector, rank top k document chunks, and create scored summaries.
Args:
- aggregated_chunks (list): List of (title, chunk) tuples to rerank.
- user_query (str): User query to rerank against.
Returns:
- reranked_results (list): A list of tuples (text, score, title) representing the reranked document chunks.
"""
# Verbose
render_phase_header("Phase 2: Gather Evidence", "- Embed query into vector\n- Rank top k document chunks in current state\n- Create scored summary of each chunk")
render_verbose("Reranking Process", f"Starting reranking process for user query: <strong>{user_query}</strong>", level="progress")
# Write Entries to File for Reranking Preparation
data_dir = './data/temp/'
!mkdir -p {data_dir}
with open(f'{data_dir}/temp.txt', 'w') as f:
for title, text in aggregated_chunks:
f.write(f"{title}\n{text}\n\n")
render_verbose("Data Preparation", f"Prepared {len(aggregated_chunks)} entries for reranking and saved to temporary file.", level="success")
# Load Documents and Create Index
documents = SimpleDirectoryReader(data_dir).load_data()
render_verbose("Data Loading", "Documents successfully loaded from the temporary directory.", level="success")
index = VectorStoreIndex.from_documents(documents=documents)
render_verbose("Indexing", "VectorStoreIndex successfully built from loaded documents.", level="success")
# Verbose
render_verbose("Querying", "Performing similarity search and reranking. This may take a moment...", level="progress")
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[colbert_reranker],
)
response = query_engine.query(user_query)
# Processing the Reranked Results
reranked_results = []
for node in response.source_nodes:
# Attempt to extract title from metadata
title = node.node.metadata.get("title", "").strip()
# If title is not in metadata, extract it from the content
if not title:
content_full = node.node.get_content()
lines = content_full.split('\n', 1)
title = lines[0].strip() if len(lines) > 0 else "Unknown Title"
# Optionally, update content to exclude the title
content = lines[1].strip() if len(lines) > 1 else content_full
else:
content = node.node.get_content()
content_preview = content[:120]
score = node.score
reranked_results.append((content, score, title))
# Verbose
reranked_count = len(reranked_results)
rerank_summary_message = (
f"Reranking process completed.<br>"
f"Total number of reranked results: <strong>{reranked_count}</strong>.<br>"
)
# Verbose
render_verbose("Reranking Summary", rerank_summary_message, level="success")
# Verbose
top_reranked_message = "<strong>Top 3 Reranked Results:</strong><br><ul>"
for idx, (content, score, title) in enumerate(reranked_results[:3], 1):
top_reranked_message += (
f"<li><strong>Title:</strong> {title}<br>"
f"<strong>Score:</strong> {score:.2f}<br>"
f"<strong>Content Preview:</strong> {content[:100]}...</li><br>"
)
top_reranked_message += "</ul>"
# Verbose
render_verbose("Top Reranked Results", top_reranked_message, level="info")
render_verbose("Phase 2 Complete", f"Now move on to phase 3 below", level="success")
return reranked_results
Phase 3 - Generate Final Answer
Function: generate_final_answer(relevant_chunks, user_query)
This phase involves using the top-ranked chunks to generate a comprehensive response to the user's query.
Objective: Synthesize an answer based on the evidence gathered.
Technical Details:
- Combines the selected evidence chunks into a single context.
- Passes this context, along with the user's original question, to an LLM (GPT-4o) for generating a detailed response.
- Ensures that the response includes references to the original sources, citing the paper names, sections, and content.
This final phase is crucial as it consolidates all gathered data into an insightful, well-referenced response, directly addressing the user's inquiry.
Final Answer System Prompt
The Final Answer System Prompt plays a critical role in the Generate Final Answer Phase of the Notebook PaperQA Workflow.
After gathering and re-ranking relevant evidence, this prompt directs the LLM to synthesize a coherent and factually supported response. By mandating the inclusion of detailed PDF references, it ensures that the final answer maintains high reliability and traceability, preventing hallucinations and verifying the provenance of information.
This structured JSON output aligns with PaperQA’s emphasis on integrating retrieval augmentation, ensuring that each claim is backed by credible sources. Consequently, this prompt guarantees that the generated answers are both comprehensive and verifiable, closely mirroring the expertise and trustworthiness of human-generated responses.
system_prompt_final = textwrap.dedent("""
Answer the given query by considering all provided evidence so your response remains comprehensive yet supported by relevant facts only.
Your response must incorporate references, specifying not only the PDF's name but also include the specific section and context.
# Output Format
Provide the response using the following JSON structure:
```json
{
"final_answer": "str",
"evidence": ["array of evidence strings"],
"pdf_references": [
{
"file_name": "str (the name of the PDF file)",
"citation": "str (specific reference from PDF)",
"content": "str (full reference content from PDF)"
}
],
"answer": "str (final answer based on evidence)"
}
```
Ensure that the JSON response is correctly formatted and contains all specified fields. Multiple "pdf_references" entries are encouraged if multiple PDFs or sections are cited.
""")
def generate_final_answer(relevant_chunks, user_query,system_prompt_final=system_prompt_final):
"""
Phase 3: Generate Answer
Put the best summaries into a prompt with context, generate the final answer, and display it.
Args:
- relevant_chunks (list): List of (text, score, title) tuples representing the top-ranked document chunks.
- user_query (str): The original user query to generate the answer for.
Returns:
- None. Displays the final answer in the UI.
"""
# Extract only the text part from each tuple in relevant_chunks
evidence_text = "\n".join([chunk[0] for chunk in relevant_chunks])
# Verbose
render_phase_header("Phase 3: Generate Answer", "- Put best summaries into prompt with context\n- Generate answer with prompt")
render_verbose("Final Answer Generation", "AI is generating the final answer based on selected chunks and user query.", level="Progress")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": system_prompt_final
},
{
"role": "user",
"content": user_query
}
],
temperature=0.3,
max_tokens=2400,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
answer = response.choices[0].message.content if response.choices else ""
# Clean the response to extract the JSON
clean_answer = answer.strip()
if clean_answer.startswith("```json"):
clean_answer = clean_answer[len("```json"):].strip()
if clean_answer.endswith("```"):
clean_answer = clean_answer[:-3].strip()
# Validate and Parse Response
try:
parsed_response = json.loads(clean_answer)
display_final_answer(clean_answer)
render_verbose("PaperQA Workflow Complete", f" ", level="success")
except json.JSONDecodeError as e:
render_verbose("Error", f"Failed to parse JSON response: {str(e)}", level="error")
render_verbose("Raw Response", f"The raw response received was: {clean_answer}", level="error")
raise ValueError("Failed to decode JSON response.")
Putting It All Together - PaperQA2 Workflow
Let's run the entire PaperQA2 process, integrating all three phases to provide a seamless research workflow.
Technical Details:
- Phase 1: Generates search queries from the user query and performs searches to create aggregated document chunks.
- Phase 2: Reranks and summarizes document chunks to gather the best evidence.
- Phase 3: Uses the top-ranked evidence to generate a well-cited, informative answer.
By structuring the workflow into three distinct yet interrelated phases, the PaperQA2 system effectively mimics a detailed, human-like approach to researching scientific literature, ensuring reliability and depth in the answers provided.
user_query = "Application of CRISPR and gene editing and arguments for and against its use"
# Phase 1: Paper Search
queries_json = generate_search_queries_phase(user_query)
for query in queries_json['queries']:
print(f"Query: {query['query']}, TQL: {query['tql']}")
Verbose
Phase 1: Paper Search
- Get candidate papers from LLM-generated keyword query
- Chunk, embed, and add candidate papers to state
⏳ Search Iteration
Getting LLM-generated TQL Syntax for user query: Application of CRISPR and gene editing and arguments for and against its use
✅ Search Queries Generated
Generated search queries for: Application of CRISPR and gene editing and arguments for and against its use
✅ JSON Parsing
Successfully parsed JSON search queries.
ℹ️ Search Queries
{ "queries": [ { "query": "Application of CRISPR gene editing", "tql": "SELECT *, BM25_SIMILARITY(text, 'Application of CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Application of CRISPR gene editing') DESC LIMIT 5" }, { "query": "Arguments for CRISPR gene editing", "tql": "SELECT *, BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') DESC LIMIT 5" }, { "query": "Arguments against CRISPR gene editing", "tql": "SELECT *, BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') DESC LIMIT 5" } ] }
# Phase 1: Semantic Chunking
aggregated_chunks = search_and_chunk_papers_phase(queries_json)
Verbose
⏳ Search Execution
Executing BM25 query 1/3: SELECT *, BM25_SIMILARITY(text, 'Application of CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Application of CRISPR gene editing') DESC LIMIT 5
✅ Samples Retrieved (BM25)
Number of samples retrieved: 5
⏳ Embedding-Based Search
Executing embedding-based search for query 1/3.
✅ Samples Retrieved (Embedding)
Number of samples retrieved: 5
✅ Combined Samples Retrieved
Total unique samples retrieved for query 1: 10
⏳ Semantic Chunking
Performing semantic chunking on the retrieved text for query 1.
⏳ Search Execution
Executing BM25 query 2/3: SELECT *, BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments for CRISPR gene editing') DESC LIMIT 5
✅ Samples Retrieved (BM25)
Number of samples retrieved: 5
⏳ Embedding-Based Search
Executing embedding-based search for query 2/3.
✅ Samples Retrieved (Embedding)
Number of samples retrieved: 5
✅ Combined Samples Retrieved
Total unique samples retrieved for query 2: 10
⏳ Semantic Chunking
Performing semantic chunking on the retrieved text for query 2.
⏳ Search Execution
Executing BM25 query 3/3: SELECT *, BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') AS score FROM ( SELECT *, ROW_NUMBER() AS row_id ) ORDER BY BM25_SIMILARITY(text, 'Arguments against CRISPR gene editing') DESC LIMIT 5
✅ Samples Retrieved (BM25)
Number of samples retrieved: 5
⏳ Embedding-Based Search
Executing embedding-based search for query 3/3.
✅ Samples Retrieved (Embedding)
Number of samples retrieved: 5
✅ Combined Samples Retrieved
Total unique samples retrieved for query 3: 10
⏳ Semantic Chunking
Performing semantic chunking on the retrieved text for query 3.
✅ Aggregation Complete
Total number of chunks aggregated: 341
✅ Phase 1 Complete
Now move on to phase 2 below
# Phase 2: Gather Evidence
reranked_chunks = gather_evidence_phase(aggregated_chunks, user_query)
Verbose
Phase 2: Gather Evidence
- Embed query into vector
- Rank top k document chunks in current state
- Create scored summary of each chunk
⏳ Reranking Process
Starting reranking process for user query: Application of CRISPR and gene editing and arguments for and against its use
✅ Data Preparation
Prepared 341 entries for reranking and saved to temporary file.
✅ Data Loading
Documents successfully loaded from the temporary directory.
✅ Indexing
VectorStoreIndex successfully built from loaded documents.
⏳ Querying
Performing similarity search and reranking. This may take a moment...
✅ Reranking Summary
Reranking process completed.
Total number of reranked results: 5.
ℹ️ Top Reranked Results
Top 3 Reranked Results:
- Title: 1785_Smooth Ranking SVM via Cutting-Plane Method.pdfScore: 0.62Content Preview: References [Alcal´a-Fdez et al., 2011] Jes´us Alcal´a-Fdez, Alberto Fern´andez, Juli´an Luengo, Joaq...
- Title: 1128_Towards Risk Analysis of the Impact of AI on the Deliberate Biological Threat Landscape.pdfScore: 0.60Content Preview: The ways in which dual use life science technologies could be misused have been categorized by the l...
- Title: 1128_Towards Risk Analysis of the Impact of AI on the Deliberate Biological Threat Landscape.pdfScore: 0.60Content Preview: Researchers had recently demonstrated the ability to put DNA from one organism into another, and the...
✅ Phase 2 Complete
Now move on to phase 3 below
# Phase 3: Generate Answer
generate_final_answer(reranked_chunks, user_query)
Phase 3: Generate Answer
- Put best summaries into prompt with context
- Generate answer with prompt
ℹ️ Final Answer Generation
AI is generating the final answer based on selected chunks and user query.
Final Answer
CRISPR and gene editing technologies have transformative potential in medicine, agriculture, and biotechnology. They offer precise methods for modifying genes, which can lead to advances in treating genetic disorders, improving crop resilience, and even combating climate change. However, ethical concerns arise regarding potential misuse, unintended genetic consequences, and the moral implications of altering human genomes.
Evidence
- CRISPR technology allows for precise gene editing, which can lead to significant medical advancements.
- There are ethical concerns about the potential misuse of gene editing technologies and their long-term effects.
References
- File Name: CRISPR_Applications.pdfCitation: Section 3.1, Page 12
- File Name: Ethical_Considerations_in_Gene_Editing.pdfCitation: Section 4.2, Page 18
CRISPR technology has been applied in various fields such as medicine, agriculture, and environmental science. In medicine, it holds promise for treating genetic disorders by correcting mutations at the DNA level.
The ethical debate surrounding CRISPR and gene editing focuses on the potential for misuse, such as creating 'designer babies', and the unforeseen consequences of altering genetic material, which could have lasting impacts on biodiversity and human health.
✅ PaperQA Workflow Complete
Next Steps for Experimentation and Extension
This workflow can be further refined and expanded to suit different use cases. Below are a few suggestions for you to experiment with and extend PaperQA2:
Experiment with the Existing Code
- 1. Try Different User Queries:
- Feel free to experiment with different queries to see how PaperQA2 performs across a variety of topics.
- Observe how specific keywords and phrases impact the relevance of the retrieved documents.
- 2. Adjust Chunk Size for Better Results:
- The chunking process divides text into smaller segments that are easier to process and rank.
- You can modify the
max_len
parameter in thesemantic_chunk(text, max_len=200)
function to see how different chunk sizes affect evidence ranking and final answer quality. - Suggested Experimentation:
- Smaller chunks (
max_len = 100
) may capture more specific details but lead to more fragments. - Larger chunks (
max_len = 300-500
) may capture more context but could be less focused. - To experiment with chunk size:
- 3. Modify Ranking Parameters:
- In Phase 2: Gather Evidence, the
similarity_top_k
parameter in thequery_engine
determines how many top chunks are selected for reranking. - You can experiment with this value to adjust how much evidence is gathered for answer generation.
- Increasing this value might gather more evidence at the cost of longer processing times.
- Decreasing it might improve efficiency but risk missing valuable information.
- To modify ranking:
- 4. Experiment with the Reranker:
- The reranker (
colbert_reranker
) is used to reorder the retrieved chunks. - You could try replacing or tweaking reranker parameters to experiment with different approaches for ranking, such as using different scoring algorithms or additional filters for the quality of the text.
- 5. Limit the Number of Results for Answer Generation:
- In Phase 3: Generate Final Answer, only a subset of the highest-scoring chunks is used to generate the answer.
- You could modify this subset size to observe how the answer quality changes. For instance, using more chunks may yield more comprehensive answers but also lead to verbose responses.
def semantic_chunk(text, max_len=200):
# Adjust 'max_len' here to control the chunk size.
query_engine = index.as_query_engine(
similarity_top_k=10, # Adjust this value to change how many top results are gathered.
node_postprocessors=[colbert_reranker],
)
Going Beyond the Code: Extend the Capabilities of PaperQA2
Implement an Agentic Approach with Tool Integration**:
- Move towards an agentic approach, where the entire PaperQA2 system behaves as an autonomous agent with different tools.
- Each phase can be thought of as a distinct tool:
- Tool 1: A paper-searching tool that takes the user query and retrieves relevant papers.
- Tool 2: An evidence-gathering tool that ranks and reorders content.
- Tool 3: An answer-generating tool that synthesizes the final answer.
- By combining these tools into an agent, you could create a more flexible, fully autonomous research assistant that continuously learns to improve the quality of its results.
With these next steps, you can personalize PaperQA2 for your research needs, optimize its workflow, and explore advanced methods for making it a powerful research tool.
addCode
addText