!pip install deeplake llama-index openai

%pip install llama-index llama-index-graph-stores-neo4j

from typing import Literal
from llama_index.core import PropertyGraphIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor
from llama_index.graph_stores.neo4j import Neo4jPGStore
import os
import openai
from llama_index.core import StorageContext

To proceed with this course example, please download the following recipe dataset.

‣

Download recipe dataset

Use Neo4j Aura

Neo4j Aura is a fully managed cloud service provided by Neo4j, designed to simplify the deployment and management of graph databases. It allows users to create, manage, and scale Neo4j databases in the cloud without the need for extensive infrastructure management. Key features of Neo4j Aura include:

Ease of Use: Users can quickly set up and access graph databases via a user-friendly interface.
Scalability: The service can automatically scale resources based on application needs, accommodating varying workloads.
Security: Aura provides built-in security features, including data encryption and access controls.
Performance: Optimized for performance, it ensures fast query responses and efficient data handling.

Relation to Graph RAG

Graph RAG (Resource Allocation Graph) is a concept in computer science that represents the allocation of resources in a system. In the context of Neo4j and graph databases, RAG can be used to model relationships between resources and their allocations, helping in visualizing and managing dependencies within complex systems.

Neo4j Aura can be utilized to implement Graph RAG models effectively due to its capabilities in handling complex relationships and large datasets inherent in resource management scenarios. By leveraging Neo4j’s graph structure, users can easily analyze resource allocation patterns, detect deadlocks, and optimize resource usage within applications.

In summary, Neo4j Aura provides a robust platform for managing graph databases, while concepts like Graph RAG can be implemented using its capabilities to address specific resource management challenges.

Here is the official web site.

How to Use GraphRag with Neo4j Aura

This guide will show you the steps to set up and use GraphRag with a free instance of Neo4j Aura.

Step 1: Go to the Neo4j Aura Website

Visit the Neo4j Aura website in your browser, click on Get Started Free and create a new account.

Step 2: Create a New Free Instance

After logging in, select “Create New Instance” from the dashboard.
Choose the free tier for your instance (usually labeled as “Aura Free”).
Follow the prompts to configure your new instance, and confirm to create it.

Step 3: Download the Credentials

Once the instance is ready, find the option to download or view the credentials. This will include: NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, AURA_INSTANCEID, AURA_INSTANCENAME

NEO4J_URI : This is the URI (Uniform Resource Identifier) of your Neo4j Aura instance, which specifies the location of your database and the protocol to use.

The neo4j+s prefix indicates that the connection is secure (encrypted with SSL). This URI is necessary for establishing a network connection to your specific database instance.

NEO4J_USERNAME : This is the username for authenticating access to the Neo4j database.

Default: neo4j
Neo4j Aura typically provides neo4j as the username, but you may configure other users in advanced setups. This is required to identify the user connecting to the database.

NEO4J_PASSWORD : This is the password associated with the NEO4J_USERNAME for authenticating the connection to the database.

Ensure this password is stored securely, as it allows access to the database. It’s critical for establishing a successful and secure connection.

AURA_INSTANCEID : This unique identifier represents your specific Neo4j Aura instance. It’s mainly for internal tracking and reference, making it easier to manage multiple instances.

Although not always required by all applications, it can be useful for organizing and keeping track of database instances, especially in multi-instance setups.

AURA_INSTANCENAME : This is a custom or default name assigned to your Neo4j Aura instance, allowing for easy identification.

Example: Instance01
This name can help distinguish between different instances in your Aura dashboard or environment configuration, especially when managing multiple databases.

Make sure to save these credentials securely, as they are necessary for connecting GraphRag to your Neo4j Aura instance.

import os, getpass
os.environ["NEO4J_URI"] = getpass.getpass("NEO4J_URI: ")
os.environ["NEO4J_USERNAME"] = getpass.getpass("NEO4J_USERNAME: ")
os.environ["NEO4J_PASSWORD"] = getpass.getpass("NEO4J_PASSWORD: ")
os.environ["AURA_INSTANCEID"] = getpass.getpass("AURA_INSTANCEID: ")
os.environ["AURA_INSTANCENAME"] = getpass.getpass("AURA_INSTANCENAME: ")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API KEY: ")

N.B.: If you are using a Mac, be sure to replace neo4j+s// with neo4j+ssc://

In this initial step, we are configuring environment variables needed to connect to various services securely. The getpass module is used to prompt the user to enter sensitive information without displaying it on the screen, enhancing security. Here’s what each line accomplishes: 1. Setting up Neo4j connection details : NEO4J_URI, NEO4J_USERNAME, and NEO4J_PASSWORD are the essential variables required to establish a secure connection to a Neo4j graph database.

Configuring Aura instance : AURA_INSTANCEID and AURA_INSTANCENAME specify the specific Neo4j Aura instance, often used in cloud-based Neo4j setups.
Setting OpenAI API key : OPENAI_API_KEY is required to authenticate requests to the OpenAI API. By using os.environ, we set these as environment variables, allowing secure, temporary access to sensitive data for the duration of the script.

from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(temperature=0, model="gpt-4o-mini")
Settings.chunk_size = 512

In this section, we will use Deep Lake to access a read-only dataset containing recipe data stored in the Activeloop organization. Here’s a breakdown of what’s happening:

Connecting to the dataset :

data = deeplake.open_read_only("al://activeloop/all_recipes")

This command opens the all_recipes dataset in read-only mode, ensuring the data remains unchanged during our operations. This is the only way we can access a public dataset that is outside our organization.

Dataset Content : The all_recipes dataset includes the following columns, which provide comprehensive information about each recipe:

group : Classification or category of the recipe.
name : Name of the recipe.
rating : Average rating score.
n_rater : Number of individuals who rated the recipe.
n_reviewer : Number of individuals who reviewed the recipe.
recipe_summary : Brief description of the recipe.
process : Step-by-step process or instructions for preparing the recipe.
ingredient : List of ingredients required for the recipe. This setup allows us to efficiently retrieve and analyze recipe data using data, the reference to our dataset.

import deeplake
data = deeplake.open_read_only("al://activeloop/all_recipes")
data

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient), length=6204)

In this snippet, we retrieve the dataset schema to understand its structure. We then extract the names of each column in the schema, printing them to verify the available data fields. Additionally, we store these names in the columns list for easy reference later in the code.

schema = data.schema
columns = []
for el in schema.columns:
    print(el.name)
    columns.append(el.name)

group
name
rating
n_rater
n_reviewer
recipe_summary
process
ingredient

In this section, we prepare the dataset entries in a format compatible with LlamaIndex by creating a list of Document objects: 1. Formatting Each Entry : We iterate over each entry in the dataset (data), assembling its values into a comma-separated string based on the column names we retrieved earlier. This ensures each document has a consistent format with fields matching the dataset schema.

Handling Data Types : To avoid issues with non-string values, we convert any non-string field to a string before adding it. This guarantees that all values are compatible with LlamaIndex’s Document format.
Creating Document Objects : Each formatted entry is then wrapped in a Document object (from llama_index.core) and added to csv_node_documents. This list is now ready for LlamaIndex operations, with each dataset entry represented as a Document.

from llama_index.core import Document
csv_node_documents = []
for el in data:
    elements = ", ".join(columns)
    for col in columns:
        if not isinstance(el[col], str):
            val = str(el[col])
        else:
            val = el[col]
        elements += val + ", "    elements = elements[:-2]
    csv_node_documents.append(Document(text=elements))
len(csv_node_documents)

In this function, we define and configure a knowledge graph (KG) extractor using LlamaIndex to identify entities and relationships within our dataset: 1. Entity and Relation Types : We specify the main entities (RECIPE, INGREDIENT, CATEGORY, and RATING) and the types of relationships (CONTAINS, BELONGS_TO, HAS_RATING) that our extractor should identify. Defining entities and relations in uppercase follows best practices for readability and consistency.

Validation Schema : The validation_schema dictionary enforces rules for valid relationships. For instance, a RECIPE can have relationships like CONTAINS (for ingredients), BELONGS_TO (for categories), and HAS_RATING. Other entities, such as INGREDIENT, may not have associated relationships, indicating they are endpoints in the graph.
SchemaLLMPathExtractor : Using SchemaLLMPathExtractor, we specify the model (in this case, a GPT-4o-mini) and pass our defined entities, relations, and validation schema. The strict=True parameter enforces adherence to the validation schema, preventing invalid relationships. This function, get_kg_extractor, returns a configured knowledge graph extractor, ready to analyze text and identify structured relationships based on the defined schema.

def get_kg_extractor():
    # best practice to use upper-case    entities = Literal["RECIPE", "INGREDIENT", "CATEGORY", "RATING"]
    relations = Literal["CONTAINS", "BELONGS_TO", "HAS_RATING"]
    # define which entities can have which relations    validation_schema = {
        "RECIPE": ["CONTAINS", "BELONGS_TO", "HAS_RATING"],
        "INGREDIENT": [],
        "CATEGORY": ["BELONGS_TO"],
        "RATING": ["HAS_RATING"],
    }
    kg_extractor = SchemaLLMPathExtractor(
        llm=OpenAI(model="gpt-4o-mini", temperature=0.0),
        possible_entities=entities,
        possible_relations=relations,
        kg_validation_schema=validation_schema,
        strict=True,
    )
    return kg_extractor

In this function, we configure and return a Neo4j graph store connection, which will allow us to store and query structured data in a Neo4j database: 1. Neo4jPGStore Initialization : We use Neo4jPGStore to connect to a Neo4j database, passing the necessary credentials and URL from the environment variables (NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_URI).

Returning the Connection : The function returns the configured graph_store instance, enabling other parts of the code to interact with the Neo4j database for storing and managing graph data.

This setup ensures secure and reusable access to Neo4j for handling knowledge graph data.

def get_graph_store():
    graph_store = Neo4jPGStore(
        username=os.environ["NEO4J_USERNAME"],
        password=os.environ["NEO4J_PASSWORD"],
        url=os.environ["NEO4J_URI"],
    )
    return graph_store

In this step, we initialize two key components for building our knowledge graph:

Knowledge Graph Extractor (kg_extractor): By calling get_kg_extractor(), we create an instance of the extractor configured to recognize specified entities and relationships within our dataset. This component will process text data, extracting structured knowledge based on our schema.
Graph Store (graph_store): Calling get_graph_store() establishes a connection to the Neo4j database, allowing us to store, query, and manage the extracted graph data. With kg_extractor and graph_store initialized, we’re ready to extract entities and relationships from our data and save this structured information to Neo4j for analysis.

kg_extractor = get_kg_extractor()
graph_store = get_graph_store()

In this function, we populate the Neo4j graph store with structured data extracted from our dataset:

Creating the Property Graph Index :

We use PropertyGraphIndex.from_documents to create a graph index from nodes, which is our list of Document objects representing each data entry.
The kg_extractors parameter applies kg_extractor to extract entities and relationships from each document, following the schema we defined earlier.
embed_model utilizes an embedding model (in this case, text-embedding-3-large) to create vector representations of the data, supporting advanced graph querying and similarity-based search.

Storing in the Graph Store :

property_graph_store=graph_store specifies Neo4j as the storage backend, allowing the index to be saved directly into our graph database.

Return Values :

The function returns the index, which holds the graph representation of the data. With popolate_graph_store, we’re setting up a structured knowledge graph, storing each document’s entities and relationships in Neo4j for easy access and querying.

def popolate_graph_store(nodes, kg_extractor, graph_store):
    index = PropertyGraphIndex.from_documents(
        nodes,
        kg_extractors=[kg_extractor],
        embed_model=OpenAIEmbedding(model_name="text-embedding-3-large"),
        property_graph_store=graph_store,
        show_progress=True,
    )
    return index

We will ensure that asynchronous code can run within a Jupyter notebook environment, enabling compatibility with libraries that use asynchronous operations:

Installing and Importing nest_asyncio :

nest_asyncio allows nested asynchronous event loops, useful in Jupyter notebooks where a single event loop runs by default. Installing and applying nest_asyncio prevents runtime errors related to nested async functions.

!pip install nest_asyncio

Applying the populate_graph_store Function :

By calling popolate_graph_store(csv_node_documents), we create and store the knowledge graph in Neo4j based on our csv_node_documents. The function returns the index (graph index), which we can use for further data analysis and querying.

This setup ensures compatibility and successful execution of asynchronous tasks in the notebook environment.

import nest_asyncio
nest_asyncio.apply()
index = popolate_graph_store(csv_node_documents, kg_extractor, graph_store)

/Users/emanuele/Work/3 - Advanced course/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1000/1000 [00:00<00:00, 6377.17it/s]
Extracting paths from text with schema: 100%|██████████| 1000/1000 [28:20<00:00,  1.70s/it]
Generating embeddings: 100%|██████████| 10/10 [00:02<00:00,  3.79it/s]
Generating embeddings: 100%|██████████| 201/201 [00:09<00:00, 21.09it/s]
[#EE6A]  _: <CONNECTION> error: Failed to read from defunct connection IPv4Address(('32aeb793.databases.neo4j.io', 7687)) (ResolvedIPv4Address(('34.76.245.87', 7687))): OSError('No data')
Transaction failed and will be retried in 0.9935240983919355s (Failed to read from defunct connection IPv4Address(('32aeb793.databases.neo4j.io', 7687)) (ResolvedIPv4Address(('34.76.245.87', 7687))))
[#EE69]  _: <CONNECTION> error: Failed to read from defunct connection ResolvedIPv4Address(('34.76.245.87', 7687)) (ResolvedIPv4Address(('34.76.245.87', 7687))): OSError('No data')
Unable to retrieve routing information
Transaction failed and will be retried in 1.806379193039045s (Unable to retrieve routing information)
Unable to retrieve routing information
Transaction failed and will be retried in 4.622101175786304s (Unable to retrieve routing information)
Unable to retrieve routing information
Transaction failed and will be retried in 6.896280308415182s (Unable to retrieve routing information)
Unable to retrieve routing information
Transaction failed and will be retried in 16.94231770436945s (Unable to retrieve routing information)

In this screenshot, we’re looking at the Neo4j Aura dashboard, confirming that our knowledge graph data has been successfully populated on the Neo4j server: - Instance Details : The instance is named “Instance01,” and it’s running in the Belgium region under a free plan.

Node and Relationship Counts : We see that 3,804 nodes (out of a 200,000 limit) and 19,388 relationships (out of a 400,000 limit) have been created, indicating our data upload was successful.
Neo4j Version : The instance is running Neo4j version 5.
Connection URI : A unique URI is provided for connecting to this Neo4j instance securely.

These details confirm that our knowledge graph data has been uploaded to the server and is now available in Neo4j for querying and further analysis.

In this function, we retrieve the knowledge graph index from an existing graph store . This approach allows us to access a pre-existing graph structure, rather than creating a new one from raw data: 1. Loading from Existing Graph Store : - Using PropertyGraphIndex.from_existing, we load an index directly from graph_store, which represents our already populated Neo4j database.

Embedding Knowledge Graph Nodes :

The parameter embed_kg_nodes=True specifies that the nodes in the knowledge graph should be embedded, which is useful for similarity searches or further analysis involving embeddings.
The commented vector_store parameter is an optional addition if you have a dedicated vector store. Neo4j can support vector operations directly, so this may be omitted unless a separate vector store is in use. The function returns the index, providing access to the knowledge graph for querying and analysis without re-processing or re-uploading the data. This is efficient when working with an existing dataset already stored in Neo4j.

def get_existing_graph_index(graph_store):
    # load from existing graph/vector store    index = PropertyGraphIndex.from_existing(
        property_graph_store=graph_store,
        # optional, neo4j also supports vectors directly        # vector_store=vector_store,        embed_kg_nodes=True,
    )
    return index

This function provides another method for retrieving the knowledge graph index by re-processing the documents stored in an existing graph store . Here’s how it works: 1. Retrieving Existing Nodes : - We retrieve the stored nodes from graph_store using graph_store.get(). This gives us access to the documents (or nodes) that have already been saved in Neo4j.

Applying Knowledge Graph Extraction :

We initialize kg_extractor by calling get_kg_extractor(), ensuring we use the predefined entity and relation schema to re-extract knowledge from the documents if needed.

Recreating the Property Graph Index :

PropertyGraphIndex.from_documents is used to create a new index from the retrieved nodes, with kg_extractor specified to ensure the correct extraction of entities and relationships.
We also use embed_model to generate embeddings for the nodes, enabling similarity search and enhanced querying.

Returning the Index :

This function returns the index, which is now based on the documents retrieved from the existing graph store and ready for querying.

This approach is useful when you want to refresh or enhance the index by re-embedding or re-extracting knowledge from the existing data in the Neo4j database. It enables flexibility to update the index without re-uploading or reformatting the raw data.

def get_existing_graph_index_from_documents(graph_store):
    # load from existing graph/vector store    kg_extractor = get_kg_extractor()
    nodes = graph_store.get()
    index = PropertyGraphIndex.from_documents(
        nodes,
        kg_extractors=[kg_extractor],
        embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
        property_graph_store=graph_store,
        show_progress=True,
    )
    return index

In this case, saving with index.storage_context.persist("./storage") is used to store the index locally on your machine. This local persistence has several advantages:

Quick Access : By saving the index locally, you can quickly reload it in future sessions without needing to reprocess or re-upload the data from scratch.
Data Integrity : This allows you to maintain a consistent version of your index, useful for backup or for working offline without reconnecting to Neo4j or re-fetching data.
Flexibility : Storing locally means you can experiment, make updates, or analyze the index without affecting the original graph store in Neo4j.

The path ./storage specifies where the index is saved on your local drive, so you can later reload it directly from this path whenever needed.

Once saved, you can load the index directly from this directory in future scripts or sessions, making it a convenient way to manage and maintain your knowledge graph.

index.storage_context.persist("./storage")

At this point, we have two options:

Use the Existing Graph Directly :

If you’ve already populated the graph store (Neo4j database) and created the index, you can proceed with querying and analyzing the graph. This approach allows you to start working with the knowledge graph immediately without additional loading steps.

Load the Index from Local Storage (if needed):

If you previously saved the index locally (e.g., in ./storage), you can reload it to avoid reprocessing data or re-querying Neo4j. Here are the ways to load the index:

Using the Graph Store Directly : Call get_existing_graph_index(graph_store), which loads the index directly from the Neo4j database.
Loading from Local Storage : Use load_index_from_storage with StorageContext.from_defaults(persist_dir="./storage") to load the saved index from the specified local directory (replace ./storage with the correct path if it’s different).

This flexibility allows you to choose the most efficient option based on your needs:

Use the existing graph if you’re actively working with the live data.
Load from local storage if you want to work offline or avoid querying Neo4j repeatedly.

You can now start querying the graph index, whether it’s loaded from Neo4j or from local storage, depending on what best suits your workflow.

from llama_index.core import load_index_from_storage
graph_store = get_graph_store()
index = get_existing_graph_index(graph_store)
index = load_index_from_storage(
    StorageContext.from_defaults(persist_dir="./storage")
)
index = get_existing_graph_index(graph_store)

Querying the Knowledge Graph for Recipe Instructions

In this section, we use the knowledge graph to retrieve specific information about a recipe by asking a natural language question.

Configuring the Storage Context :

storage_context = StorageContext.from_defaults(graph_store=graph_store) sets up a StorageContext, which serves as a container for managing various storage components, such as nodes, indices, vectors, and graph-related stores.
In this instance, the graph_store component connects to our Neo4j graph store, enabling efficient access to stored nodes and relationships.

Initializing the Retriever :

retriever = index.as_retriever(include_text=False) creates a retriever from our knowledge graph index. This retriever allows us to query the graph based on a natural language question.
By setting include_text=False, we exclude any additional source text, so the retriever returns only structured information related to the query.

Formulating the Query :

We pose a question, "How can I cook the Hungarian crepes?", which the retriever uses to find relevant nodes in the graph. This question targets specific recipe instructions or details related to cooking Hungarian crepes.

This approach enables an intuitive way to access recipe instructions and related information by querying the knowledge graph directly with natural language. The StorageContext simplifies data access and management by organizing the different storage layers needed for efficient retrieval.

storage_context = StorageContext.from_defaults(graph_store=graph_store)
retriever = index.as_retriever(
    include_text=False,  # include source text, default True)
question = "How can I cook the Hungarian crepes?"nodes = retriever.retrieve(question)
for node in nodes:
    print(f"Node: {node.text}, Score: {node.score}")

Node: Beef Stroganoff III -> CONTAINS -> condensed beef broth, Score: 0.5570871829986572
Node: Grandma Irena's Palacsinta (Hungarian Crepes) -> CONTAINS -> soda water, Score: 0.5568652153015137
Node: Slow Cooker London Broil -> CONTAINS -> condensed tomato soup, Score: 0.5565750598907471
Node: Gramma's Old Fashioned Chili Mac -> CONTAINS -> condensed tomato soup, Score: 0.5565750598907471
Node: Air Fryer Spanish Tortilla -> CONTAINS -> 1/3 cup chopped fresh flat-leaf parsley, Score: 0.5559759140014648

We can use a query engine to directly query the knowledge graph and obtain a more structured response to our natural language question.

Creating a Query Engine :

query_engine = index.as_query_engine(include_text=True) initializes a query engine based on our existing index.
Setting include_text=True ensures that the full text or content associated with each result node is included in the response. This can be useful if you want to see more detailed explanations or descriptions directly in the output.

Formulating and Executing the Query :

We define our question as "How can I cook the Hungarian crepes?" and pass it to the query_engine.query(question).
The query engine processes this question, searching through the graph to find relevant nodes and compile a structured response based on the query.

This method allows for a more comprehensive and cohesive response, making it easier to access complex information stored in the graph with a single, natural language query. It’s especially useful when you need detailed answers rather than just a list of matching nodes.

query_engine = index.as_query_engine(
    include_text=True,
)
response = query_engine.query(question)
print(str(response))

To cook Hungarian crepes, follow these steps:

1. **Prepare the Batter**: In a mixing bowl, combine 2 cups of all-purpose flour, 2 eggs, 1 cup of milk, 1 cup of soda water, and ½ cup of vegetable oil. Add a pinch of salt and mix until smooth.

2. **Cook the Crepes**: Heat a non-stick skillet over medium heat. Pour a small amount of batter into the skillet, tilting it to spread the batter evenly. Cook for about 1-2 minutes until the edges start to lift and the bottom is lightly golden. Flip and cook for another minute on the other side. Repeat with the remaining batter.

3. **Prepare the Filling**: In a separate bowl, mix 1 cup of chopped almonds, ½ cup of white sugar, ½ cup of chopped bittersweet chocolate, and 2 tablespoons of margarine. Optionally, you can add ½ teaspoon of vanilla extract and 1½ teaspoons of rum for extra flavor.

4. **Assemble the Crepes**: Place a crepe on a plate, add a portion of the filling, and roll it up.

5. **Serve**: Enjoy the crepes warm, either as a breakfast or dessert option.

The total preparation and cooking time is approximately 9 hours and 10 minutes, including an additional resting time.

Discovering Recipes Based on Available Ingredients

In this section, we use the query engine to find potential recipes based on specific ingredients:

Defining the Question :

The question, "What can I cook if I have eggs?", is formulated to ask the knowledge graph for recipes that use eggs as an ingredient. This approach leverages the power of the graph to filter recipes based on ingredient requirements.

Query Execution :

response = query_engine.query(question) sends the question to the query engine, which searches the knowledge graph to identify relevant recipes.

question = "What can I cook if I have eggs?"response = query_engine.query(question)
print(str(response))

You can cook a variety of dishes if you have eggs, including:

1. **Amazing Muffin Cups** - A breakfast dish featuring eggs, sausage, red bell pepper, and cheese baked in hash brown potato cups.
2. **Make-Ahead Air Fryer Breakfast Burritos** - Sausage, egg, and cheese burritos that can be prepared in advance and cooked in an air fryer.
3. **Green Eggs And Hash Omelet** - An omelet made with eggs, spinach, and corned beef hash.
4. **Easy Egg and Avocado Breakfast Burrito** - A burrito filled with scrambled eggs and avocado.
5. **Scrambled Egg Omelet** - A combination of scrambled eggs and an omelet.
6. **Omuraisu (Japanese Rice Omelet)** - A Japanese dish featuring eggs and rice.
7. **Ham and Cheese Omelet Casserole** - A casserole dish with eggs, ham, and cheese.
8. **Easy Egg White Omelet** - A lighter omelet made with egg whites and vegetables.
9. **Simple Italian Omelet** - An omelet cooked with olive oil and goat cheese.
10. **Ultimate Low-Carb Ham and Cheese Omelet for Two** - A low-carb omelet with ham and Swiss cheese.
11. **Parmalet (Crisp Parmesan Omelet)** - A crispy omelet made with Parmesan cheese.
12. **Three Egg Omelet** - A large omelet filled with various meats and vegetables.
13. **Baked Brunch Omelet** - A baked omelet that can be prepared the night before.
14. **Paleo Omelet Muffins** - Muffins made with eggs and various vegetables and meats.

These options provide a range of flavors and styles to suit different tastes and preferences.

Exploring Variants of a Classic Dish

In this section, we use a natural language query to explore possible variations of a specific dish:

Formulating the Question :

The question, "Which is a variant to the classic Muffin Cups?", is designed to query the knowledge graph for alternative versions or variations of the classic Muffin Cups recipe. This type of query is useful for discovering new twists or adaptations of familiar dishes.

Query Execution :

response = query_engine.query(question) sends the question to the query engine, which searches through the graph for nodes that describe recipe variants related to Muffin Cups.

Displaying the Response :

print(str(response)) outputs a structured response that highlights any variations or related recipes found. This can help users quickly identify new ways to make a favorite dish or explore similar recipes.

This approach allows for easy access to recipe variations, enhancing flexibility and creativity in the kitchen by discovering new takes on classic dishes.

question = "Which is a variant to the classic Muffin Cups?"response = query_engine.query(question)
print(str(response))

A variant to the classic Muffin Cups is the Muffin Pan Frittatas.

Finding Recipes with Vector Search

Activeloop Overview

Activeloop is a data infrastructure platform that specializes in managing and optimizing large-scale datasets, particularly those used for machine learning, deep learning, and AI applications. Its primary tool, Deep Lake, is a data lake specifically designed for handling complex, unstructured data such as images, videos, audio, text, and embeddings.

One of Activeloop’s most convenient features is its web-based data upload functionality, which allows users to load datasets directly through the browser without needing command-line tools or additional software. This feature is particularly useful for data scientists and machine learning practitioners who want a quick, accessible way to ingest data into Deep Lake (Activeloop’s data lake) for immediate use in AI and machine learning workflows.

The screenshots demonstrate the process of creating and managing a dataset in Activeloop’s web interface . Here’s a step-by-step overview:

Creating a New Dataset :

In the first screenshot, we see the “Create Dataset” dialog. The user can define the dataset name (in this example, “recipes”) and select the storage type , which in this case is “Activeloop DB.” By clicking Create the dataset is initialized in the Activeloop environment, ready to accept data uploads.

Create a new dataset.

We can then click on the Import from files option to upload our file. Once the upload process is complete, the dataset will appear in the Activeloop interface, ready for exploration and analysis.

Uploading and Managing Data :

In the second screenshot, the newly created dataset (named “recipes”) is displayed with columns representing various attributes, including group, name, rating, n_rater, n_reviewer, recipe_summary, process, and ingredient.
The interface also provides a query panel with options to perform queries on the dataset.
Additionally, users have access to a summary , structure , and analytics tab, offering an overview of the dataset’s content, structure, and analytics features.
Activeloop’s interface allows uploading data in various formats (e.g., PDF, JSON) directly through the web interface, making it versatile for handling diverse data types.

These screenshots illustrate how Activeloop’s platform facilitates easy dataset creation, data management, and interactive querying, providing a user-friendly approach to work with structured data.

This embedding_function generates vector embeddings for text inputs using OpenAI’s API: 1. Input Handling : If texts is a single string, it’s converted to a list to ensure consistency.

Preprocessing : Newline characters (\n) are replaced with spaces to clean up formatting.
Embedding Generation : Calls openai.embeddings.create to get embeddings for each text using the specified model (text-embedding-3-large by default), returning a list of embedding vectors.

This function is useful for obtaining embeddings for tasks like semantic search or similarity matching.

def embedding_function(texts, model="text-embedding-3-large"):
    if isinstance(texts, str):
        texts = [texts]
    try:
        texts = [t.replace("\n", " ") for t in texts]
    except:
        pass    return [
        data.embedding
        for data in openai.embeddings.create(input=texts, model=model).data
    ]

Before creating a dataset, we need to authenticate with Activeloop using an API key:

os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("ACTIVELOOP_TOKEN: ")

Create the dataset from scratch

In this code, we’re creating and configuring a Deep Lake dataset within Activeloop , defining columns to store recipe-related data: 1. Dataset Creation : - deeplake.create(f"al://{org_id}/{dataset_name}") initializes a new dataset in the specified organization (org_id) with a chosen name (dataset_name), allowing us to store and organize our recipe data.

Adding Columns :

Each call to ds.add_column defines a new column in the dataset, specifying the data type and, in some cases, an indexing strategy:

group and name: Text fields, where name has an inverted index for efficient keyword-based searching.
rating, n_rater, n_reviewer: Numeric columns representing ratings and counts, with specific types for optimized storage (Float32 for ratings, Int16 for counts).
recipe_summary and process: Text fields where recipe_summary has an inverted index, and process uses BM25 indexing (a ranking function) to improve search relevancy.
ingredient: A text field for listing ingredients.
embedding_summary: An embedding column with 3,072 dimensions, designed for vector-based similarity searches.

Committing Changes :

ds.commit() saves these configurations, making the dataset structure final and ready for data insertion.

Dataset Summary :

ds.summary() provides an overview of the dataset structure, helping verify the column setup and data types.

This configuration is tailored for storing and efficiently querying recipe data, enabling both keyword-based and embedding-based searches within the dataset.

import deeplake
from deeplake import types
org_id = "<your_org_id>"dataset_name = "<your_dataset_name>"ds = deeplake.create(f"al://{org_id}/{dataset_name}")
ds.add_column("group", types.Text)
ds.add_column("name", types.Text(index_type=types.Inverted))
ds.add_column("rating", types.Float32)
ds.add_column("n_rater", types.Int16)
ds.add_column("n_reviewer", types.Int16)
ds.add_column("recipe_summary", types.Text(index_type=types.Inverted))
ds.add_column("process", types.Text(index_type=types.BM25))
ds.add_column("ingredient", types.Text)
ds.add_column(name="embedding_summary", dtype=types.Embedding(3072))
ds.commit()
ds.summary()

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient,embedding_summary), length=0)
[1m+-----------------+---------------------+
|     column      |        type         |
+-----------------+---------------------+[0m
|      group      |        text         |
+-----------------+---------------------+
|      name       |text (Inverted Index)|
+-----------------+---------------------+
|     rating      |       float32       |
+-----------------+---------------------+
|     n_rater     |        int16        |
+-----------------+---------------------+
|   n_reviewer    |        int16        |
+-----------------+---------------------+
| recipe_summary  |text (Inverted Index)|
+-----------------+---------------------+
|     process     |  text (bm25 Index)  |
+-----------------+---------------------+
|   ingredient    |        text         |
+-----------------+---------------------+
|embedding_summary|   embedding(3072)   |
+-----------------+---------------------+

In this code snippet, we’re preparing a dictionary to organize data extracted from the dataset by column:

Initializing the Dictionary :

dict_values = {} creates an empty dictionary to hold data for each column.
The first for loop iterates over each column name in columns, creating an empty list for each column in dict_values. This list will store values for that column.

dict_values = {}
for col in columns:
    dict_values[col] = []
dict_values

{'group': [],
 'name': [],
 'rating': [],
 'n_rater': [],
 'n_reviewer': [],
 'recipe_summary': [],
 'process': [],
 'ingredient': []}

Populating the Dictionary :

The second for loop iterates over each entry (el) in the dataset (data).
For each entry, it iterates over each column name in columns, retrieves the value (val) for that column, and appends it to the corresponding list in dict_values. After execution, dict_values will contain a list of values for each column, with keys as column names and values as lists of data for each entry. This structure is useful for organizing data for further processing or bulk insertion into another data store.

for el in data:
    for col in columns:
        val = el[col]
        dict_values[col].append(val)
dict_values

This following code appends recipe data to the Deep Lake dataset in batches, including embedding vectors generated from recipe summaries: 1. Extracting Column Data : - Various lists (groups, names, ratings, etc.) are created to hold values for each respective column. These lists are populated from dict_values, which organizes the dataset by columns.

Generating Embeddings :

embedding_summaries is initialized as an empty list to store embeddings of summaries.
The loop processes summaries in batches (of 500 in this case) to optimize the embedding generation process:

embedding_function(summaries[i : i + batch_size]) is called for each batch of summaries to create embeddings.
The resulting embeddings are then appended to embedding_summaries.

groups = dict_values["group"]
names = dict_values["name"]
ratings = dict_values["rating"]
n_raters = dict_values["n_rater"]
n_reviewers = dict_values["n_reviewer"]
summaries = dict_values["recipe_summary"]
processes = dict_values["process"]
ingredients = dict_values["ingredient"]
embedding_summaries = []
batch_size = 500for i in range(0, len(summaries), batch_size):
    embedding_summaries += embedding_function(summaries[i : i + batch_size])

Appending Data to the Dataset :

ds.append() adds the data to the dataset in a structured format:

Each column is matched with the corresponding list of values, including embedding_summary, which holds the generated embeddings.

Committing Changes :

ds.commit() saves the changes to the dataset, making this batch of data additions permanent.

Verifying with Dataset Summary :

ds.summary() displays an updated summary of the dataset, confirming that the data, including embeddings, has been successfully added.

This approach is efficient for handling large datasets and embedding generation, ensuring that all data, including vector representations, is correctly stored in Deep Lake for future querying or analysis.

ds.append(
    {
        "group": groups,
        "name": names,
        "rating": ratings,
        "n_rater": n_raters,
        "n_reviewer": n_reviewers,
        "recipe_summary": summaries,
        "process": processes,
        "ingredient": ingredients,
        "embedding_summary": embedding_summaries,
    }
)
ds.commit()
ds.summary()

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient,embedding_summary), length=6204)
+-----------------+---------------------+
|     column      |        type         |
+-----------------+---------------------+
|      group      |        text         |
+-----------------+---------------------+
|      name       |text (Inverted Index)|
+-----------------+---------------------+
|     rating      |       float32       |
+-----------------+---------------------+
|     n_rater     |        int16        |
+-----------------+---------------------+
|   n_reviewer    |        int16        |
+-----------------+---------------------+
| recipe_summary  |text (Inverted Index)|
+-----------------+---------------------+
|     process     |  text (bm25 Index)  |
+-----------------+---------------------+
|   ingredient    |        text         |
+-----------------+---------------------+
|embedding_summary|   embedding(3072)   |
+-----------------+---------------------+

In this alternative approach, we efficiently handle data by copying an existing dataset from Activeloop , generating embeddings, and appending only the missing embedding_summary column: 1. Copying the Dataset : - deeplake.copy(f"al://activeloop/all_recipes", f"al://{org_id}/{dataset_name}") duplicates the original dataset (all_recipes) into the new location specified by org_id and dataset_name. This is efficient when you want to use a pre-existing dataset as a base.

Opening the Copied Dataset :

my_ds = deeplake.open(f"al://{org_id}/{dataset_name}) opens the copied dataset, allowing us to modify it as needed.

deeplake.copy(f"al://activeloop/all_recipes", f"al://{org_id}/{dataset_name}")

my_ds = deeplake.open(f"al://{org_id}/{dataset_name}")
my_ds

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient), length=6204)

Adding the Missing Column :

my_ds.add_column(name="embedding_summary", dtype=types.Embedding(3072)) adds an embedding_summary column for storing 3,072-dimensional embeddings, which were not present in the original dataset.

Committing and Summarizing :

After adding the column, my_ds.commit() saves this change, and my_ds.summary() confirms the updated dataset structure.

my_ds.add_column(name="embedding_summary", dtype=types.Embedding(3072))
my_ds.commit()
my_ds.summary()

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient,embedding_summary), length=6204)
[1m+-----------------+---------------------+
|     column      |        type         |
+-----------------+---------------------+[0m
|      group      |        text         |
+-----------------+---------------------+
|      name       |text (Inverted Index)|
+-----------------+---------------------+
|     rating      |       float32       |
+-----------------+---------------------+
|     n_rater     |        int16        |
+-----------------+---------------------+
|   n_reviewer    |        int16        |
+-----------------+---------------------+
| recipe_summary  |text (Inverted Index)|
+-----------------+---------------------+
|     process     |  text (bm25 Index)  |
+-----------------+---------------------+
|   ingredient    |        text         |
+-----------------+---------------------+
|embedding_summary|   embedding(3072)   |
+-----------------+---------------------+

Filling the New Embedding Column :

my_ds["embedding_summary"][:emb_len] = embedding_summaries assigns the generated embeddings (embedding_summaries) to the newly created column, up to the current length of embeddings.

Final Commit and Summary :

A final commit and summary confirm that the embeddings have been successfully added to the dataset.

This method saves time by reusing an existing dataset, adding only new embeddings as needed, which minimizes data duplication and makes the process more efficient.

emb_len = len(embedding_summaries)
my_ds["embedding_summary"][:emb_len] = embedding_summaries
my_ds.commit()
my_ds.summary()

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient,embedding_summary), length=6204)
[1m+-----------------+---------------------+
|     column      |        type         |
+-----------------+---------------------+[0m
|      group      |        text         |
+-----------------+---------------------+
|      name       |text (Inverted Index)|
+-----------------+---------------------+
|     rating      |       float32       |
+-----------------+---------------------+
|     n_rater     |        int16        |
+-----------------+---------------------+
|   n_reviewer    |        int16        |
+-----------------+---------------------+
| recipe_summary  |text (Inverted Index)|
+-----------------+---------------------+
|     process     |  text (bm25 Index)  |
+-----------------+---------------------+
|   ingredient    |        text         |
+-----------------+---------------------+
|embedding_summary|   embedding(3072)   |
+-----------------+---------------------+

Recipe Retrieval Based on Similarity Search

The following code retrieves recipes similar to a given query by leveraging embedding-based similarity search.

Embedding the Query :

embedding_function(query)[0] converts the text query ("How can I cook the Hungarian crepes?") into an embedding, creating a vector representation (embed_query).
The embedding values are formatted as a comma-separated string (str_query) to use in the SQL-like query.

query = "How can I cook the Hungarian crepes?"embed_query = embedding_function(query)[0]
str_query = ",".join(str(c) for c in embed_query)

Similarity Query Construction :

The SQL query (query_vs) calculates cosine similarity between the query embedding and each recipe’s embedding in the embedding_summary column.
Recipes are ordered by similarity score (highest first), returning the top 5 most similar recipes.

query_vs = f"""    SELECT *    FROM (        SELECT *, cosine_similarity(embedding_summary, ARRAY[{str_query}]) AS score        FROM (            SELECT *, ROW_NUMBER() AS row_id        )    )    ORDER BY score DESC    LIMIT 5"""view_vs = ds.query(query_vs)
view_vs

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient,embedding_summary,row_id,score), length=5)

Executing the Query and Displaying Results :

view_vs = ds.query(query_vs) runs the query, and each result is printed, showing the recipe_summary and its similarity score.

for el in view_vs:
    print(el["recipe_summary"], el["score"])

This crepe recipe is essential for a fancy breakfast or eye-catching dessert. Sprinkle warm crepes with sugar and lemon, or serve with whipped cream, ice cream, and fruit. 0.56888694
This crepe recipe is essential for a fancy breakfast or eye-catching dessert. Sprinkle warm crepes with sugar and lemon, or serve with whipped cream, ice cream, and fruit. 0.56883913
This French delicacy is extremely versatile, as it can be filled with virtually anything -- fruits, pudding, mousse for desserts as well as vegetables and meats for dinner. No need to add more oil each time unless the pan begins to stick. Freeze extra crepes for later use. 0.54992
A simple crepe recipe which can be filled with whatever your heart desires; fruit, jam, applesauce or powdered sugar. 0.53856164
Umm Umm Ummm! This recipe was given to me by a friend and I just love it. Crepes with chocolate, strawberries and a whipped topping -- what more can I say! 0.52416164
Using this simple 'crepe cake' technique, you can turn any of your favorite cake fillings into visually stunning, multi-layered masterpieces. By the way, I say this is simple, not fast, as it does take some time to make and stack all those crepes, but once you get rolling, it goes pretty quickly. Use the ingredient amounts only as a guide, as crepe sizes and filling amounts will vary. Dust with powdered sugar and garnish with fresh strawberries. 0.51937455

Here we define a function, get_answer, that generates answers to questions based on a given context using OpenAI’s chat model.

Function Purpose :

get_answer(question, context) receives a question and a context, aiming to answer the question based solely on the information provided in the context.

Prompt Setup :

system_prompt is a system-level instruction for the model, specifying that it should generate answers only from the provided context and format the answer in JSON (e.g., {"answer": "your answer here"}).
user_prompt combines the question and context into a single message format for the model to process.

API Call to OpenAI :

The prompt is sent to the OpenAI chat model (gpt-4o-mini) via client.chat.completions.create. The model returns a structured JSON response due to response_format={"type": "json_object"}, which instructs the model to return JSON-formatted answers.

Parsing and Error Handling :

The JSON response is parsed to extract the answer. If successful, it retrieves the answer string; otherwise, it catches and logs any errors.

import json
from openai import OpenAI
client = OpenAI()
def get_answer(question, context):
    system_prompt = """    You are a helpful assistant that answers questions based on the provided context.    Given a question and a context, provide a clear and concise answer using only the information from the context.    Format your response in the following JSON format: {"answer": "your answer here"}.    """    user_prompt = f"Question: {question}\nContext: {context}"    # Sending the prompt to OpenAI's chat model    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        response_format={"type": "json_object"}
    )
    # Parsing the JSON response to retrieve the answer    try:
        response_content = response.choices[0].message.content
        response_json = json.loads(response_content)
        answer = response_json.get("answer", "INSUFFICIENT INFORMATION")
        return answer
    except Exception as e:
        print(f"Error: {e}")
        return False

Now, we format the context by summarizing each recipe’s key details (summary, ingredients, and dish name). Using this structured context, we then call the get_answer function, which generates an answer specifically based on the recipes retrieved. Finally, we display the question alongside the concise, context-based response, ensuring the answer is directly relevant to the recipes provided.

context = [f'Summary: {el["recipe_summary"]}, Ingredients: {el["ingredient"]}, Dish name: {el["name"]}' for el in view_vs]
answers = get_answer(query, context)
print(f"Question: {query}\nAnswer: {answers}")

[('This crepe recipe is essential for a fancy breakfast or eye-catching dessert. Sprinkle warm crepes with sugar and lemon, or serve with whipped cream, ice cream, and fruit.',
  '4 eggs, lightly beaten + 1<U+2009><U+2153> cups milk + 1 cup all-purpose flour + 2 tablespoons butter, melted + 2 tablespoons white sugar + � teaspoon salt'),
 ('This French delicacy is extremely versatile, as it can be filled with virtually anything -- fruits, pudding, mousse for desserts as well as vegetables and meats for dinner. No need to add more oil each time unless the pan begins to stick. Freeze extra crepes for later use.',
  '2 eggs + 1 cup milk + <U+2154> cup all-purpose flour + 1 pinch salt + 1<U+2009>� teaspoons vegetable oil'),
 ('A simple crepe recipe which can be filled with whatever your heart desires; fruit, jam, applesauce or powdered sugar.',
  '1<U+2009>� cups all-purpose flour + 1 tablespoon white sugar + � teaspoon baking powder + � teaspoon salt + 2 cups milk + 2 tablespoons butter, melted + � teaspoon vanilla extract + 2 eggs'),
 ('Umm Umm Ummm! This recipe was given to me by a friend and I just love it. Crepes with chocolate, strawberries and a whipped topping -- what more can I say!',
  '1 egg, beaten + � cup skim milk + <U+2153> cup water + 1 tablespoon vegetable oil + <U+2154> cup all-purpose flour + � teaspoon white sugar + 1 pinch salt + � cup semisweet chocolate chips + 1 cup sliced fresh strawberries + � cup frozen whipped topping, thawed'),
 ("Using this simple 'crepe cake' technique, you can turn any of your favorite cake fillings into visually stunning, multi-layered masterpieces. By the way, I say this is simple, not fast, as it does take some time to make and stack all those crepes, but once you get rolling, it goes pretty quickly. Use the ingredient amounts only as a guide, as crepe sizes and filling amounts will vary. Dust with powdered sugar and garnish with fresh strawberries.",
  '5 large eggs + 2<U+2009>� cups all-purpose flour + 2 tablespoons white sugar + � teaspoon kosher salt + 2<U+2009>� tablespoons vegetable oil + 3<U+2009>� cups whole milk + � teaspoon vanilla extract + 4 tablespoons butter, or as needed + 1 (10 ounce) jar strawberry jam + 2 tablespoons water + � cup mascarpone cheese + 1<U+2009>� cups heavy cream + 3 tablespoons white sugar + � teaspoon vanilla extract')]

Finding Recipes with Specific Ingredients

In this example, we’re identifying recipes that include a particular ingredient, tomato , and ranking the results based on similarity to a natural language query.

Embedding the Query :

We start by embedding the query (“What can I cook if I have tomato?”) to create a vector representation that helps identify similar recipes in the dataset.

query = "What can I cook if I have tomato?"embed_query = embedding_function(query)[0]
str_query = ",".join(str(c) for c in embed_query)

Filtering for Specific Ingredients :

Here, we filter the dataset for recipes containing “tomato” in the ingredient list using the CONTAINS filter, allowing us to retrieve only relevant entries that match the ingredient requirement.

Query Execution and Ranking :

The query ranks the top 5 matching recipes by similarity score, ordering the results so that the most relevant recipes appear first.

word = "tomato"query_vs = f"""    SELECT *    FROM (        SELECT *, cosine_similarity(embedding_summary, ARRAY[{str_query}]) AS score        FROM (            SELECT *, ROW_NUMBER() AS row_id        )    )    WHERE CONTAINS(ingredient, '{word}')    ORDER BY score DESC    LIMIT 5"""view_vs = ds.query(query_vs)
view_vs

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient,embedding_summary,row_id,score), length=5)

Generating a Contextual Answer :

We then format the results into a structured context that includes each recipe’s summary, ingredients, and name, and pass this to get_answer. This function generates a clear answer based on the filtered recipes, indicating dishes you can prepare with tomatoes.

context = [f'Summary: {el["recipe_summary"]}, Ingredients: {el["ingredient"]}, Dish name: {el["name"]}' for el in view_vs]
answers = get_answer(query, context)
print(f"Question: {query}\nAnswer: {answers}")

Finding Dish Variants by Excluding Ingredients

In this example, we’re finding pizza recipes that exclude a specific ingredient , cheese, rather than searching for recipes that contain it. Here’s how we proceed:

Embedding the Query :

We start by embedding the query (“Which pizzas can I cook without cheese?”) to help us locate similar recipes based on vector similarity.

query = "Which pizzas can I cook without cheese?"embed_query = embedding_function(query)[0]
str_query = ",".join(str(c) for c in embed_query)

Filtering Out Ingredients :

Unlike previous examples where we searched for included ingredients, here we use a NOT CONTAINS filter in the SQL-like query. This filter excludes any recipes that list “cheese” in their ingredients, helping us find cheese-free pizza options.

Query Execution and Ranking :

The query retrieves the top 5 results, sorted by similarity to the query, ensuring that the returned recipes are both relevant and match the no-cheese criteria.

filter= "cheese"query_vs = f"""    SELECT *    FROM (        SELECT *, cosine_similarity(embedding_summary, ARRAY[{str_query}]) AS score        FROM (            SELECT *, ROW_NUMBER() AS row_id        )    )    WHERE NOT CONTAINS(ingredient, '{filter}')    ORDER BY score DESC    LIMIT 5"""view_vs = ds.query(query_vs)
view_vs

Dataset(columns=(group,name,rating,n_rater,n_reviewer,recipe_summary,process,ingredient,embedding_summary,row_id,score), length=5)

for el in view_vs:
    print(el["recipe_summary"], el["ingredient"])

A tasty, quick pizza crust that uses no yeast. 1<U+2009><U+2153> cups all-purpose flour + 1 teaspoon baking powder + � teaspoon salt + � cup fat-free milk + 2 tablespoons olive oil
A tasty, quick pizza crust that uses no yeast. 1<U+2009><U+2153> cups all-purpose flour + 1 teaspoon baking powder + � teaspoon salt + � cup fat-free milk + 2 tablespoons olive oil
A simple and easy pizza sauce. No cooking and quick to make. 1 (15 ounce) can tomato sauce + 1 (6 ounce) can tomato paste + 1 tablespoon ground oregano + 1<U+2009>� teaspoons dried minced garlic + 1 teaspoon ground paprika
A simple and easy pizza sauce. No cooking and quick to make. 1 (15 ounce) can tomato sauce + 1 (6 ounce) can tomato paste + 1 tablespoon ground oregano + 1<U+2009>� teaspoons dried minced garlic + 1 teaspoon ground paprika
Great sourdough pizza dough to top with your favorites. 1<U+2009>� cups sourdough starter + 1<U+2009>� cups all-purpose flour + 1 tablespoon olive oil + 1 teaspoon salt

Generating a Contextual Answer :

We format the retrieved recipes’ details into a structured context and pass this to get_answer, which provides a concise response based on the filtered recipes.

context = [f'Summary: {el["recipe_summary"]}, Ingredients: {el["ingredient"]}, Dish name: {el["name"]}' for el in view_vs]
answers = get_answer(query, context)
print(f"Question: {query}\nAnswer: {answers}")

Question: Which pizzas can I cook without cheese?
Answer: You can cook the No-Yeast Pizza Crust and top it with the Easy Pizza Sauce without cheese.

Imagine a recipe discovery tool that combines the precision of a well-organized cookbook with the creativity of a chef’s intuition—this is what our system achieves by blending the power of knowledge graphs with the flexibility of vector search.

In this guide, we explored two complementary approaches—knowledge graphs and vector search —to build a powerful recipe discovery system. Using Graph RAG methods, we tapped into structured data with predefined entities and relationships. This allowed us to organize and query recipes by specific criteria, such as ingredients and dish categories, using the Neo4j knowledge graph as our primary structure. With vector search, we added a layer of flexibility by embedding text-based queries and recipe descriptions, enabling similarity-based searches for even more nuanced recipe recommendations.

Strengths of Each Approach

Knowledge Graphs with Graph RAG :

Strengths :

Provides highly structured, schema-based querying , making it ideal for precise, context-based questions.
Enables relationship-based navigation of data (e.g., finding related recipes by ingredient categories or cuisine type).
Supports clear, well-defined entity extraction and is easy to understand in terms of data relationships.

Weaknesses :

Limited by the rigid structure of predefined entities and relationships; less flexible for broader, exploratory queries.
Requires more upfront configuration and a predefined schema to function effectively, making it less adaptive to unstructured data.

Vector Search :

Strengths :

Flexible for semantic searches , allowing users to search by themes, flavors, or descriptions without exact keyword matches.
Useful for recommendation systems based on similarity, where embeddings capture nuances in recipe descriptions that go beyond structured fields.
Easily adaptable for exploratory queries and can handle unstructured data, such as user queries phrased in natural language.

Weaknesses :

Lacks the structured specificity of a knowledge graph; results may be less interpretable in terms of precise relationships.
Dependent on embedding quality and model choice , which can vary, affecting search accuracy.

Integrating Graph RAG and Exploring Vector Search Separately

In this guide, we began by creating a Property Graph Index that combines the strengths of structured knowledge graphs and vector search . This hybrid approach allows us to store and query structured relationships between entities (using a schema) while also enabling semantic similarity searches through embeddings. 1. Creating the Hybrid Property Graph : - Using the popolate_graph_store function, we populated a Property Graph Index by extracting entities and relationships from the dataset and embedding nodes with vector representations. This setup gives the graph the ability to handle both precise, schema-based queries and open-ended semantic searches.

Why Focus on Vector Search Separately? :

After building the hybrid graph, we shifted our focus to vector search as a standalone approach. The purpose was to deep dive into the mechanics of embedding-based searches , understanding how it works independently of the graph structure.
This allows us to explore scenarios where vector search alone excels, such as finding similar nodes without relying on predefined relationships or schema constraints.

Final Insights :

By starting with Graph RAG , we showcased how a structured graph with embedded vectors can serve as a robust tool for both relational and semantic queries.
Focusing separately on vector search demonstrated its standalone power.

Why This Approach Matters

The hybrid setup with Graph RAG showcases its potential for handling diverse queries, while the deep dive into vector search highlights its flexibility and adaptability. By combining the two, users gain a comprehensive understanding of their individual strengths and weaknesses, empowering them to build tailored solutions for complex data exploration and retrieval tasks.

Comparing Graph RAG and Vector Search for Enhanced Recipe Discovery

Each approach brings unique strengths to recipe discovery. The knowledge graph excels at delivering clear, relationship-based queries, allowing users to navigate structured, ingredient-focused searches with precision. On the other hand, vector search offers flexibility through semantic-based discovery, accommodating open-ended questions and capturing the nuanced similarities between recipes.

While Graph RAG serves as the structured backbone, providing organized, schema-driven insights, vector search enhances the user experience with its adaptability to natural language and exploratory queries. Together, these approaches create a versatile, high-performance recipe recommendation system, capable of meeting both specific and evolving user needs.

Congratulations on completing this journey! You’ve now gained powerful tools and insights to leverage both structured knowledge and flexible search in building advanced, user-friendly applications. Whether organizing complex data or enabling intuitive discovery, your new skills set the stage for innovative solutions that adapt to user needs. Keep experimenting, stay curious, and continue to push the boundaries of what’s possible!

Graph RAG and Vector Search for AI Recipe Discovery