Distill-SynthKG by Intel Labs and Salesforce Research: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

Imagine a world where your AI systems can not only retrieve data but understand it in context of other data, making connections that drive smarter decisions. That’s the magic of Knowledge Graphs (KGs) in action. They’re changing Retrieval-Augmented Generation (RAG) applications by offering a structured, dynamic way to access and interpret information. Distill-SynthKG is an innovative knowledge graph (KG) synthesis model that advances the efficiency and coverage of document-level KGs generated by large language models (LLMs). Recognizing the limitations of traditional prompt-based KG extraction methods, such as high inference costs and information loss in long documents, Distill-SynthKG leverages a fine-tuned, smaller LLM to synthesize KGs from documents in a single inference step. Based on an initial multi-step approach called SynthKG, this streamlined workflow reduces the reliance on commercial API calls and enables efficient KG construction without sacrificing data quality.

Through SynthKG, Distill-SynthKG produces high-quality, ontology-free KGs that significantly enhance the performance of RAG applications, with tested superiority in retrieval and question-answering (QA) tasks. The following article in the course is a review of a promising paper by Intel Labs and Salesforce Research.

A Brief Reminder: What are KGs?

KGs are structured representations of information that capture entities and the relationships between them in a graph format. They’re crucial in enhancing RAG applications, which combine retrieval mechanisms with generative models to improve the quality and relevance of generated content.

Here’s why KGs are vital in this context:

Structured Information Retrieval: KGs provide a structured way to access and retrieve information, enabling RAG systems to efficiently find relevant data points and relationships. This structured approach enhances the precision of information retrieval, which is critical for generating accurate and contextually relevant responses.
Contextual Understanding: Think of KGs as the neural network for your data. They map out relationships and context, enabling RAG systems to weave together coherent narratives even from complex, multi-layered queries.
Scalability and Flexibility: KGs expand effortlessly, adapting to new information and evolving datasets. This flexibility is crucial for keeping pace with rapidly changing knowledge domains.
Enhanced Reasoning Capabilities: Using the relational power of KGs, RAG systems can leap beyond basic fact retrieval into advanced reasoning. They connect the dots to infer new insights, providing a better depth of understanding.

Barriers to Efficient KG Construction

While KGs are benefitting RAG applications, building them efficiently from large datasets is no small feat.

Overview of how the new method generates KGs.

Let's dive into the key challenges and how we can overcome them:

Scalability Issues: Traditional KG extraction methods often hit a wall when scaling up to vast amounts of unstructured data. These processes demand hefty computational resources and time, making real-time applications seem like a distant dream.
Information Loss with Long Documents: Breaking down lengthy documents into chunks can lead to fragmented context and information loss, resulting in incomplete or inaccurate KGs. Maintaining the narrative thread is crucial for capturing the full picture.
Breaking Free from Ontology Constraints: Many current techniques are shackled by predefined ontologies, limiting their reach across diverse domains. This dependency stifles innovation, preventing the capture of novel or domain-specific knowledge.
High Computational Costs: The reliance on LLMs for KG extraction is a costly affair. Frequent inference calls ramp up processing time and resource consumption, posing a significant barrier to efficient KG generation.
Quality vs. Efficiency Trade-off: There’s often a tug-of-war between achieving high-quality KGs and maintaining an efficient extraction process. Detailed processing ensures quality but can slow down workflows considerably.

Addressing these challenges head-on is crucial for crafting more efficient and effective KG construction methods that meet the growing demands of RAG applications.

What is SynthKG?

SynthKG is an LLM-based construction workflow that is changing the way we build KGs. It introduces a dynamic, multi-step workflow that operates at the document level without the constraints of predefined ontologies. This approach uses the power of LLMs to extract and structure information with unprecedented efficiency, tackling the limitations of traditional KG methods head-on.

There is also a smaller LLM known as Distil-SynthKG for one-step generation of document-level KGs.

How SynthKG Solves Common Challenges

SynthKG takes on common challenges like document chunking and relation extraction in the following ways:

Document Chunking and Decontextualization

SynthKG starts by breaking down lengthy documents into smaller, manageable chunks. This strategic chunking along sentence boundaries ensures each piece remains semantically complete and ready for independent processing by LLMs.

Next, decontextualization refines these chunks by replacing entity mentions with their most informative forms, preserving context across the document. This meticulous process prevents information loss and maintains continuity, ensuring the KGs are both rich in detail and accurate.

Prompt for chunk decontextualization.

Once prepared, these chunks undergo a rigorous extraction process where LLMs identify all entities and their types - capturing key elements like people, places, and organizations. Following this, relation and proposition extraction takes center stage.

Each relationship is distilled into a quadruplet that includes a source entity, predicate, target entity, and a detailed proposition. These propositions serve as fine-grained retrieval units within the KG, including essential details about relationships that are crucial for constructing precise retrieval indices.

SynthKG constructs high-quality KGs that are semantically rich and contextually accurate across diverse documents. This structured approach not only enhances information retrieval but also boosts reasoning capabilities in RAG applications.

Entity and Relation Extraction

Prompt for relation extraction.

In KG construction, SynthKG is setting new standards by seamlessly integrating entity and relation extraction into its workflow.

Here’s how it turns raw data into insightful graphs:

Entity Extraction: Once documents are broken down into manageable chunks, SynthKG springs into action. The LLM is tasked with identifying all entities within each chunk - think of it as finding key players in a complex narrative. Whether it's people, places, or organizations, SynthKG ensures no significant detail is overlooked. This meticulous process lays the foundation for building a comprehensive KG that captures the essence of your data.
Relation and Proposition Extraction: Following entity extraction, SynthKG goes a step further by generating propositions and relation triplets. Each relationship is distilled into a quadruplet: source entity, predicate, target entity, and a detailed proposition. These propositions act as the connective tissue of your KG, providing rich semantic context that enhances understanding and retrieval.
Propositions as Retrieval Units: In SynthKG, propositions are more than just data points—they’re powerful retrieval units. By encapsulating essential relationship details, they form the backbone of efficient retrieval indices. Imagine a proposition stating "OWC Pharmaceutical Research Corp preferred stock is convertible to common stock at $0.20 per share." This level of detail ensures precision in retrieval tasks, empowering applications with contextually rich insights.

Average number of triplets per 100 words for varying document lengths.

By employing these steps, SynthKG effectively constructs high-quality KGs that maintain semantic richness and contextual accuracy across diverse documents. This structured approach allows for more efficient information retrieval and reasoning in RAG applications.

Evaluating SynthKG: Setting New Standards

To push boundaries of KG synthesis, new datasets and metrics to evaluate the quality of KGs produced by SynthKG and Distill-SynthKG. This evaluation focuses on how well these graphs represent relevant information for complex multi-hop reasoning tasks.

Datasets

Existing multi-hop QA datasets were repurposed like MuSiQue, 2WikiMultiHopQA, and HotpotQA to create proxy ground truth relation triplets.

Example from HotPot QA dataset.

Each QA pair was turned into a triplet format, where the answer appears as either the head, relation, or tail. This approach ensures our evaluation zeroes in on the critical relationships necessary for answering questions effectively.

Metrics

Three complementary metrics were introduced to evaluate KG coverage:

Semantic Score: By calculating cosine similarity between vector representations of ground truth triplets and those in the KG, this metric provides a precise measure of alignment. A higher score means a closer match.
Triplet Coverage: This metric assesses whether a ground truth triplet is covered by the KG based on a semantic score threshold, ensuring comprehensive coverage.
F1 Score: By comparing extracted triplets with ground truth data, this metric evaluates precision and recall, offering a robust measure of KG quality.

KG coverage evaluation.

Experimental Results

Distill-SynthKG is making waves by outperforming traditional models like Llama-3-8b and Llama-3-70b across all key metrics.

Here’s how it’s redefining quality and performance:

Quality of KGs

The quality of KGs generated by Distill-SynthKG was evaluated against baseline models, including Llama-3-8b and Llama-3-70b. The results demonstrated that Distill-SynthKG outperformed these baselines across all metrics.

Distill-SynthKG isn't just about quality - it's about quantity too. This model generates a significantly higher number of triplets, ensuring richer data representation than its competitors.

With higher semantic scores and triplet coverage, Distill-SynthKG aligns more closely with ground truth data, offering a more accurate depiction of complex relationships. Consistently achieving superior F1 scores, Distill-SynthKG excels in capturing relevant relations with precision and recall that set it apart from the rest.

For instance, in the MuSiQue dataset, Distill-SynthKG achieved a semantic score of 0.8546 and a triplet coverage of 46.90%, leaving Llama-3-70b trailing with a semantic score of 0.8346 and coverage of 40.34%. This shows its capability to deliver high-quality KGs that enhance retrieval accuracy and reasoning in RAG applications.

Retrieval Performance

Distill-SynthKG shines with impressive gains in Hits@2 and Hits@10 metrics across all datasets. Take the MuSiQue dataset, for example - Distill-SynthKG achieved a remarkable Hits@2 of 53.35%, leaving Llama-3-70b trailing at 48.64%.

Additionally, Distill-SynthKG's KGs led to higher MRR and MAP scores, indicating more effective retrieval of relevant information.

Retrieval evaluation.

These results show how effective SynthKG and Distill-SynthKG are at crafting high-quality KGs that not only boost retrieval accuracy but also improve question-answering performance. Their ability to maintain dense triplet representation across diverse documents further cements SynthKG's adaptability in KG synthesis.

Impact and Applications of SynthKG

SynthKG is at the forefront of redefining knowledge retrieval through its graph-based framework, leveraging the structured power of KGs to deliver precision and contextual relevance like never before.

Here's how it changes the retrieval landscape:

Proposition-Entity Graph Retriever: Dive into a simplified retrieval process where propositions from KGs are matched with your query, narrowing down the search space to only the most relevant information. This efficiency ensures rapid access to critical data points.
Graph Traversal: Once key propositions are identified, a sub-graph is constructed, linking these propositions with related entities. By traversing this sub-graph, starting from entities in your query, SynthKG filters out noise, retaining only logically connected insights.
Re-ranking with LLMs: Once key propositions are identified, a sub-graph is constructed, linking these propositions with related entities. By traversing this sub-graph, starting from entities in your query, SynthKG filters out noise, retaining only logically connected insights.
Seamless Integration of Graph and LLM Insights: The final retrieval output combines both LLM-identified and graph-selected propositions, offering a comprehensive dataset for tackling complex queries with ease.

This graph-based retrieval method outperforms traditional dense retrieval methods and other KG-based retrieval approaches across multiple datasets, demonstrating its effectiveness in improving retrieval accuracy and question-answering performance.

Potential Impact

By delivering high-quality KGs with unparalleled coverage and efficiency, these tools improve RAG systems across domains like healthcare, finance, and legal services.

The efficiency gains from Distill-SynthKG make large-scale KG construction feasible, supporting intelligent virtual assistants and automated reasoning systems.

Maintaining high triplet density across documents enhances multi-hop reasoning capabilities. This is especially important for scientific research and policy analysis.

SynthKG sets new standards for what’s possible in knowledge-driven applications. Whether it's powering smarter AI systems or opening new avenues for research, SynthKG is paving the way for a future where information is not just retrieved but understood and used to its fullest potential.

Moving Forward

The paper presents notable advancements in KG synthesis, especially when it comes to enhancing RAG applications.

Key Contributions

An overview of its key contributions:

SynthKG Workflow: A multi-step process that synthesizes high-quality, ontology-free KGs at the document level using LLMs. This workflow addresses traditional inefficiencies by ensuring semantic richness and contextual accuracy.
Distill-SynthKG Model: This model refines the SynthKG process into a single-step approach with a smaller LLM, significantly reducing inference calls and boosting efficiency without sacrificing quality.
Evaluation Framework: The paper introduces innovative datasets and metrics for evaluating KG quality, focusing on semantic scores, triplet coverage, and F1 scores by repurposing existing multi-hop QA datasets.
Graph-Based Retrieval Framework: A new method that leverages KGs from Distill-SynthKG to outperform existing retrieval methods, enhancing accuracy and performance across various benchmarks.

What’s Next?

While SynthKG and Distill-SynthKG represent significant strides in knowledge graph (KG) synthesis, the journey is just beginning.

It focused exclusively on English-language documents, a well-trodden path in NLP research. The next frontier? Expanding KG synthesis to encompass a multitude of languages, unlocking global insights and applications.

In addition, the study focused on the power of two foundational LLMs—Llama3-70b and GPT-4o—for synthetic data generation. Yet, the landscape of LLMs is vast. Exploring diverse models could reveal new dimensions of performance and capability, enhancing SynthKG's adaptability.

The benchmarks for assessing KG coverage rely on automated processes like question decomposition and triplet extraction with GPT-4o, introducing potential errors or omissions. There’s opportunities to refine these methods, incorporating human evaluation for greater accuracy.

Future research could optimize KG synthesis workflows to handle larger datasets with minimal resources, exploring advanced decontextualization and chunk processing techniques.

There’s also potential for integrating SynthKG and Distill-SynthKG into broader AI systems, such as virtual assistants or automated reasoning frameworks, expanding their impact.

Moreover, developing sophisticated metrics to capture the nuances of ontology-free KGs could provide deeper insights into their quality and downstream impact.