The GraphRAG Python bundle by Neo4j affords a complete resolution for constructing end-to-end workflows, from remodeling unstructured information right into a information graph to enabling information graph retrieval and implementing full GraphRAG pipelines. Whether or not you’re growing information assistants, search APIs, chatbots, or report turbines in Python, this bundle simplifies the combination of data graphs to reinforce the relevance, accuracy, and explainability of retrieval-augmented era (RAG).
On this information, we’ll exhibit find out how to get began with the GraphRAG Python bundle, construct a GraphRAG pipeline from scratch, and discover varied information graph retrieval strategies to customise the conduct of your GenAI software.
GraphRAG: Enhancing GenAI with Data Graphs
By combining information graphs with RAG, GraphRAG addresses widespread challenges of massive language fashions (LLMs), equivalent to hallucinations, whereas enriching responses with domain-specific context for higher high quality and precision than conventional RAG strategies. Data graphs present important contextual information, enabling LLMs to ship dependable solutions and act as trusted brokers in complicated duties. In contrast to standard RAG options that target fragmented textual information, GraphRAG integrates each structured and semi-structured information into the retrieval course of.
With the GraphRAG Python bundle, you may create information graphs and implement superior retrieval strategies, together with graph traversals, question era through text-to-Cypher, vector searches, and full-text searches. The bundle additionally consists of instruments for constructing full RAG pipelines, enabling seamless integration of GraphRAG with Neo4j into GenAI workflows and functions.
Key Elements of the GraphRAG Data Graph Building Pipeline
The GraphRAG information graph (KG) development pipeline consists of a number of parts, every important in remodeling uncooked textual content into structured information for enhanced Retrieval-Augmented Era (RAG)- GraphRAG with Neo4j. These parts work collectively to allow superior retrieval strategies like graph-based searches and context-aware responses. Beneath are the core parts:
- Doc Parser: Extracts textual content from varied doc codecs (e.g., PDFs).
- Doc Chunker: Splits the textual content into smaller items that match throughout the LLM’s token restrict.
- Chunk Embedder (Non-compulsory): Computes vector embeddings for every chunk, enabling semantic matching.
- Schema Builder: Defines the construction of the KG, grounding entity extraction and making certain consistency.
- LexicalGraphBuilder (Non-compulsory): Builds a lexical graph connecting paperwork and chunks.
- Entity and Relation Extractor: Identifies entities (e.g., individuals, dates) and their relationships.
- Data Graph Author: Saves the entities and relations to the graph database for retrieval.
- Entity Resolver: Merges duplicate or comparable entities right into a single node to keep up graph integrity.
Entity Resolver: Merges duplicate or comparable entities right into a single node to keep up graph integrity.
These parts work collectively to create a dynamic information graph that powers GraphRAG, enabling extra correct and context-aware responses from LLMs.
Set Up a Neo4j Database
To start the RAG workflow, step one is to arrange a database for retrieval. Neo4j AuraDB supplies a simple solution to launch a free Graph Database. Relying on the necessities, one can go for AuraDB Free for primary use or strive AuraDB Skilled (Professional), which affords elevated reminiscence and higher efficiency for ingestion and retrieval duties. Whereas the Professional model is right for optimum outcomes as a consequence of its superior options, for this undertaking, I’ll make the most of Neo4j AuraDB’s free Graph Database.It’s a absolutely managed cloud service that gives a scalable and high-performance graph database resolution. With its free tier, customers can simply construct and discover graph-based functions, leveraging highly effective relationships between information factors for insights and evaluation.
Upon logging into Neo4j AuraDB, you may create a free occasion. As soon as the occasion is ready up, you’ll obtain or can obtain the required credentials, together with the username, Neo4j URL, and password, to hook up with your database.
Set up the Required Libraries
We’ll set up a number of libraries utilizing pip, together with Neo4j’s Python Driver and OpenAI to create GraphRAG with Neo4j & Python. That is a necessary step for establishing the environment.
!pip set up fsspec openai numpy torch neo4j-graphrag
Set Up Connection Particulars for Neo4j
NEO4J_URI = ""
username = ""
password = ""
On this part, we now have to outline the connection particulars for Neo4j. Substitute the placeholders together with your precise Neo4j database credentials:
- NEO4J_URI: URI to your Neo4j occasion (e.g., bolt://localhost:7687).
- username and password: Your Neo4j authentication credentials.
Set OpenAI API Key
import os
os.environ['OPENAI_API_KEY'] = ''
Right here, we’re loading OpenAI API key utilizing os.environ. This permits us to make use of OpenAI’s fashions for entity extraction in your information graph.
1. Constructing and Defining the Data Graph Pipeline
To facilitate our analysis on the greenhouse impact to indicate GraphRAG with Neo4j & Python, we are going to rework analysis papers right into a structured information graph and retailer it in a Neo4j database. Utilizing a choice of PDF paperwork targeted on greenhouse impact research; we’ll arrange the domain-specific information these paperwork include right into a graph that enhances AI-driven functions. This strategy permits for higher structuring and retrieval of complicated scientific data.
The information graph will embrace key node sorts:
- Doc: Captures metadata associated to the doc sources.
- Chunk: Represents textual content segments from the paperwork, embedded with vector representations for environment friendly retrieval.
- Entity: Extracted entities from the textual content chunks, offering structured context and connections.
To automate the creation of this data graph, we outline a SimpleKGPipeline class. This class allows seamless information graph development by requiring a couple of important inputs:
- A Neo4j driver to hook up with the Neo4j database.
- An LLM (Language Mannequin) for entity extraction.
- An embedding mannequin to transform textual content into vectors, enabling similarity searches.
By combining the doc transformation with an automatic pipeline, we will construct a complete information graph that effectively organizes and retrieves insights in regards to the greenhouse impact.
Neo4j Driver Initialization
import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(username, password))
Right here, we initialize the Neo4j database driver utilizing the NEO4J_URI, username, and password set earlier. We will additionally import parts wanted for LLM-based entity extraction (OpenAILLM) and embedding (OpenAIEmbeddings).
Initialize LLM and Embedding Mannequin
llm = OpenAILLM(
model_name="gpt-4o-mini",
model_params={"response_format": {"sort": "json_object"}, "temperature": 0},
)
embedder = OpenAIEmbeddings()
We’ve initialized the LLM (OpenAILLM) for entity extraction and set parameters just like the mannequin identify (GPT-4o-mini) and response format. The embedder is initialized with OpenAIEmbeddings, which will probably be used to transform textual content chunks into vectors for similarity search.
Setting Node Labels
Let’s outline completely different classes of nodes primarily based on our use case:
basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]
academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]
climate_change_node_labels = ["GreenhouseGas", "TemperatureRise", "ClimateModel", "CarbonFootprint", "EnergySource"]
node_labels = basic_node_labels + academic_node_labels + climate_change_node_labels
Right here, we’ve grouped our node labels into:
- Fundamental node labels: Generic entity sorts equivalent to “Particular person”, “Group”, and many others.
- Educational node labels: Associated to educational publications like articles or journals.
- Local weather change node labels: Particular to local weather change-related entities.
These labels will assist categorize entities inside your information graph.
Defining Relationship Sorts
rel_types = ["AFFECTS", "CAUSES", "ASSOCIATED_WITH", "DESCRIBES", "PREDICTS", "IMPACTS"]
We’ve outlined doable relationships between nodes within the graph. These relationships describe how entities work together or are related.
Creating the Immediate Template
prompt_template=""'
You're a local weather researcher tasked with extracting data from analysis papers and structuring it in a property graph.
Extract the entities (nodes) and specify their sort from the next textual content.
Additionally extract the relationships between these nodes.
Return the consequence as JSON utilizing the next format:
{{"nodes": [ {{"id": "0", "label": "entity type", "properties": {{"name": "entity name"}} }} ],
"relationships": [{{"type": "RELATIONSHIP_TYPE", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Relationship details"}} }}] }}
Enter textual content:
{textual content}
'''
Right here, we outlined a immediate template for the LLM. The mannequin will probably be given a textual content (analysis paper), and it must extract:
- Entities (nodes): These are recognized by sort (e.g., Particular person, Group) and their properties (e.g., identify).
- Relationships: The LLM will determine how the entities are associated (e.g., “CAUSES”, “ASSOCIATED_WITH”).
Create the Data Graph Pipeline
from neo4j_graphrag.experimental.parts.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
Right here, we’re importing the required lessons:
- FixedSizeSplitter: It will assist break up massive textual content (from PDFs) into smaller chunks.
- SimpleKGPipeline: That is the primary class for constructing your information graph.
Constructing the Data Graph Pipeline
kg_builder_pdf = SimpleKGPipeline(
llm=llm,
driver=driver,
text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
embedder=embedder,
entities=node_labels,
relations=rel_types,
prompt_template=prompt_template,
from_pdf=True
)
- llm: Language mannequin used for entity extraction (you already initialized it with OpenAI’s LLM).
- driver: The Neo4j driver that connects to your Neo4j occasion.
- text_splitter: You utilize FixedSizeSplitter to interrupt down massive textual content from the PDFs into chunks of 500 tokens with an overlap of 100 tokens.
- embedder: Embedding mannequin used to transform the textual content chunks into vector embeddings.
- entities: Specifies the node labels that outline the entities in your information graph.
- relations: Specifies the connection sorts that join the nodes within the graph.
- prompt_template: The template for instructing the LLM to extract nodes and relationships.
- from_pdf=True: Tells the pipeline to extract information from PDF information.
Processing PDFs
On this, we’re utilizing three completely different analysis papers on Greenhouse:
pdf_file_paths = ['/home/janvi/Downloads/ToxipediaGreenhouseEffectArchive.pdf',
'/home/janvi/Downloads/3.1.pdf',
'/home/janvi/Downloads/Shell_Climate_1988.pdf']
for path in pdf_file_paths:
print(f"Processing: {path}")
pdf_result = await kg_builder_pdf.run_async(file_path=path)
print(f"Outcome: {pdf_result}")
This loop processes the three PDF information and feeds them into the SimpleKGPipeline. It makes use of run_async to course of the paperwork asynchronously and prints the consequence for every doc.
As soon as full, you may discover the ensuing information graph. The Unified Console supplies an ideal interface for this.
Go to the Question tab and enter the under question to see a pattern of the graph.
MATCH p=()-->() RETURN p LIMIT 100;
You may see how the Doc, Chunk, and __Entity__ nodes are all related collectively.
To see the “lexical” portion of the graph containing Doc and Chunk nodes, run the next.
MATCH p=(:Chunk)--(:!__Entity__) RETURN p;
Word that these are disconnected parts, one for every doc we ingested. You may as well see the embeddings which have been added to all chunks.
To have a look at simply the area graph of __Entity__ nodes, you may run the next:
MATCH p=(:!Chunk)-->(:!Chunk) RETURN p;
You will notice how completely different ideas have been extracted and the way they join to at least one one other. This area graph connects data between the paperwork.
2. Retrieving Knowledge From Your Data Graph
As soon as the information graph for greenhouse impact analysis is constructed, the following step entails retrieving significant data to help evaluation. The GraphRAG Python bundle supplies versatile retrieval mechanisms tailor-made to your wants. These embrace:
- Vector Retriever: Conducts similarity searches utilizing vector embeddings for environment friendly information retrieval.
- Vector Cypher Retriever: Combines vector search with Cypher queries, Neo4j’s graph question language, enabling graph traversal to incorporate associated nodes and relationships within the retrieval.
- Hybrid Retriever: Merges vector and full-text seek for complete information retrieval.
- Hybrid Cypher Retriever: Combines hybrid search with Cypher queries for superior graph traversal.
- Text2Cypher: Converts pure language queries into Cypher queries, enabling customers to retrieve information instantly from Neo4j with out guide question writing.
- Weaviate & Pinecone Neo4j Retriever: Integrates vector searches from exterior programs like Weaviate or Pinecone with Neo4j nodes utilizing exterior ID properties.
- Customized Retriever: Affords flexibility for implementing tailor-made retrieval strategies for particular wants.
These retrieval mechanisms empower the implementation of numerous retrieval patterns, bettering the relevance and accuracy of retrieval-augmented era (RAG) pipelines.
Vector Retriever and Data Graph Retrieval
For our greenhouse impact analysis information graph, we make the most of the Vector Retriever, which makes use of Approximate Nearest Neighbor (ANN) vector search. This retriever retrieves information by performing similarity searches on embeddings related to textual content chunks saved within the graph.
Setting Up a Vector Index
To allow vector-based retrieval, we create a Vector Index in Neo4j. This index operates on the textual content chunks within the graph, permitting the Vector Retriever to tug again related insights with excessive precision.
By combining Neo4j’s vector search capabilities and these retrieval strategies, we will question the information graph to extract beneficial details about the causes, results, and options associated to the greenhouse impact.
from neo4j_graphrag.indexes import create_vector_index
create_vector_index(driver, identify="text_embeddings", label="Chunk",
embedding_property="embedding", dimensions=1536, similarity_fn="cosine")
create_vector_index: This operate creates a vector index on the Chunk label in Neo4j. The embeddings (generated from the PDF textual content) will probably be saved within the embedding property of every Chunk node. The index relies on cosine similarity, and the embeddings have a dimension of 1536, which is commonplace for OpenAI’s embeddings.
Utilizing the VectorRetriever
from neo4j_graphrag.retrievers import VectorRetriever
vector_retriever = VectorRetriever(
driver,
index_name="text_embeddings",
embedder=embedder,
return_properties=["text"],
)
VectorRetriever: This element queries the Chunk nodes utilizing vector search, which permits us to seek out essentially the most related chunks primarily based on the enter question. The return_properties parameter ensures that the search outcomes will return the textual content of the chunk.
Looking for Info within the Data Graph
import json
vector_res = vector_retriever.get_search_results(
query_text="What are the primary greenhouse gases contributing to the Greenhouse Impact and their impacts as mentioned within the paperwork?",
top_k=3
)
for i in vector_res.information:
print("====n" + json.dumps(i.information(), indent=4))
- get_search_results: This operate performs a vector search with the enter question (on this case, asking about greenhouse gases and their impacts).
- top_k=3: We’re limiting the variety of outcomes to the highest 3 most related chunks.
- The outcomes are printed in a properly formatted JSON construction, which incorporates the related textual content and metadata of the retrieved chunks.
Utilizing the VectorCypherRetriever for Graph Traversal
The VectorCypherRetriever permits for a complicated technique of data graph retrieval by combining vector search with Cypher queries. This allows us to traverse the graph primarily based on semantic similarities discovered within the textual content, exploring associated entities and their relationships.
Establishing the VectorCypherRetriever
from neo4j_graphrag.retrievers import VectorCypherRetriever
vc_retriever = VectorCypherRetriever(
driver,
index_name="text_embeddings",
embedder=embedder,
retrieval_query="""
// 1) Exit 2-3 hops within the entity graph and get relationships
WITH node AS chunk
MATCH (chunk)<-[:FROM_CHUNK]-()-[relList:!FROM_CHUNK]-{1,2}()
UNWIND relList AS rel
// 2) Acquire relationships and textual content chunks
WITH acquire(DISTINCT chunk) AS chunks,
acquire(DISTINCT rel) AS rels
// 3) Format and return context
RETURN '=== textual content ===n' + apoc.textual content.be part of([c in chunks | c.text], 'n---n') + 'nn=== kg_rels ===n' +
apoc.textual content.be part of([r in rels | startNode(r).name + ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' + ' -> ' + endNode(r).name ], 'n---n') AS data
"""
)
- retrieval_query: This Cypher question is used to outline the logic of traversing the graph. Right here, you traverse 2-3 hops away from every chunk and seize the relationships between the chunks.
- Textual content and Relationship Formatting: The outcomes are formatted to return the chunk textual content first, adopted by the relationships encountered throughout the traversal.
Working a Question for Related Info
vc_res = vc_retriever.get_search_results(
query_text="What are the causes and penalties of the Greenhouse Impact as mentioned within the supplied paperwork?",
top_k=3
)
- get_search_results: This technique performs a vector search primarily based on the enter question. It should return the highest 3 most related chunks and their related relationships within the information graph.
Extracting and Printing Outcomes
kg_rel_pos = vc_res.information[0]['info'].discover('nn=== kg_rels ===n')
# Print the outcomes, separating the textual content chunk context and the KG context
print("# Textual content Chunk Context:")
print(vc_res.information[0]['info'][:kg_rel_pos])
print("# KG Context From Relationships:")
print(vc_res.information[0]['info'][kg_rel_pos:])
- kg_rel_pos: This locates the place the relationships begin within the response.
- The outcomes are then printed, separating the textual context from the relationships discovered within the information graph.
3. Setting up a GraphRAG Pipeline
To additional improve the retrieval-augmented era (RAG) course of for our greenhouse impact analysis, we now combine each the VectorRetriever and VectorCypherRetriever right into a GraphRAG pipeline. This integration permits us to retrieve related information and use that context to generate responses which might be strictly primarily based on the information graph, making certain accuracy and reliability within the generated solutions.
Instantiating and Working GraphRAG
The GraphRAG Python bundle simplifies the method of instantiating and operating RAG pipelines. You may simply create a GraphRAG pipeline by using the GraphRAG class. At its core, the category requires two important parts:
- LLM (Language Mannequin): That is chargeable for producing pure language responses primarily based on the retrieved context.
- Retriever: That is used to fetch related data from the information graph (e.g., utilizing VectorRetriever or VectorCypherRetriever).
Establishing the GraphRAG Pipeline
from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.era import RagTemplate
from neo4j_graphrag.era.graphrag import GraphRAG
llm = LLM(model_name="gpt-4o", model_params={"temperature": 0.0})
rag_template = RagTemplate(template=""'Reply the Query utilizing the next Context. Solely reply with data talked about within the Context. Don't inject any speculative data not talked about.
# Query:
{query_text}
# Context:
{context}
# Reply:
''', expected_inputs=['query_text', 'context'])
- RagTemplate: The template ensures that the LLM solely responds primarily based on the supplied context, avoiding any speculative solutions.
- GraphRAG: The GraphRAG class makes use of a language mannequin and a retriever to tug in context to reply the question. It’s initialized with each a vector_retriever and vc_retriever.
Creating the GraphRAG Pipelines
v_rag = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)
- v_rag: Makes use of the VectorRetriever to seek for related textual content chunks and reply questions.
- vc_rag: Makes use of the VectorCypherRetriever to each seek for related textual content and traverse relationships within the information graph.
Now we will probably be executing queries utilizing each the VectorRetriever and VectorCypherRetriever by way of the GraphRAG pipeline to retrieve context and generate solutions from the information graph. Right here’s a breakdown of the code:
Question 1: “Record the causes, results, and options for the Greenhouse Impact.”This question checks the solutions supplied by each the vector-based retrieval and vector + Cypher graph traversal strategies:
q = "Record the causes, results, and options for the Greenhouse Impact."
print(f"Vector Response: n{v_rag.search(q, retriever_config={'top_k':5}).reply}")
print("n===========================n")
print(f"Vector + Cypher Response: n{vc_rag.search(q, retriever_config={'top_k':5}).reply}")
Question 2: “Clarify the Greenhouse Impact intimately. Embrace its pure course of, human-induced causes, international warming impacts, and local weather change results as mentioned within the supplied paperwork.”Right here, we’re asking for a extra detailed clarification. The return_context=True flag is used to return the context together with the reply:
q = "Clarify the Greenhouse Impact intimately. Embrace its pure course of, human-induced causes, impacts on international warming, and its results on local weather change as mentioned within the supplied paperwork."
v_rag_result = v_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
vc_rag_result = vc_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
print(f"Vector Response: n{v_rag_result.reply}")
print("n===========================n")
print(f"Vector + Cypher Response: n{vc_rag_result.reply}")
Exploring Retrieved Content material: After getting the context outcomes, we’re printing and parsing the contents from the vector and Cypher retrievers:
for i in v_rag_result.retriever_result.gadgets:
print(json.dumps(eval(i.content material), indent=1))
For the vc_rag_result, we’re splitting the content material and filtering for any textual content containing the key phrase “deal with”:
vc_ls = vc_rag_result.retriever_result.gadgets[0].content material.break up('n---n')
for i in vc_ls:
if "deal with" in i:
print(i)
Question 3: “Are you able to summarize the Greenhouse Impact?”Lastly, we’re summarizing the data requested by the person in checklist format. Just like earlier queries, we’re retrieving the outcomes and printing the solutions:
q = "Are you able to summarize the Greenhouse Impact? Embrace its pure course of, greenhouse gases concerned, impacts on the setting and human well being, and challenges in addressing local weather change. Present in checklist format with particulars for every merchandise."
print(f"Vector Response: n{v_rag.search(q, retriever_config={'top_k': 5}).reply}")
print("n===========================n")
print(f"Vector + Cypher Response: n{vc_rag.search(q, retriever_config={'top_k': 5}).reply}")
Conclusion
This text explored how the GraphRAG Python bundle (GraphRAG with Neo4j) can successfully improve the retrieval-augmented era (RAG) course of by integrating information graphs with massive language fashions (LLMs). We demonstrated find out how to create a information graph from analysis paperwork associated to the Greenhouse Impact and find out how to retailer and handle this graph utilizing Neo4j(GraphRAG with Neo4j). By defining the information graph pipeline and leveraging varied retrieval strategies, equivalent to VectorRetriever and VectorCypherRetriever, we confirmed find out how to retrieve related data from the graph to generate correct and contextually related responses.
Combining information graphs with RAG helps tackle widespread points equivalent to hallucinations and supplies domain-specific context that improves the standard of responses. Moreover, by incorporating a number of retrieval methods, we enhanced the accuracy and relevance of the generated content material, making it extra dependable and helpful for answering complicated questions associated to the greenhouse impact.
Total, GraphRAG with Neo4j affords a robust toolset for constructing knowledge-powered functions that require each correct information retrieval and pure language era. Incorporating Neo4j’s graph capabilities ensures that responses are contextually grounded and knowledgeable by structured and semi-structured information, providing a extra strong resolution than conventional RAG strategies.
Regularly Requested Questions
Ans. GraphRAG is a Python bundle combining information graphs with retrieval-augmented era (RAG) to reinforce the accuracy and relevance of responses to massive language fashions (LLMs). It retrieves related data from information graphs, processes it, and makes use of it to supply contextually grounded solutions to queries. This mix helps mitigate points like hallucinations, that are widespread in conventional LLM-based options.
Ans. Neo4j is a robust graph database that effectively shops and manages relationships between entities, making it a super platform for creating information graphs. It helps superior graph queries utilizing Cypher, which permits for highly effective information retrieval and graph traversal. GraphRAG with Neo4j lets you leverage its capabilities to combine each structured and semi-structured information into your RAG workflows.
Ans. GraphRAG affords a number of retrievers for varied information retrieval patterns:
Vector Retriever
Vector Cypher Retriever
Hybrid Retriever
Hybrid Cypher Retriever
Text2Cypher
Customized Retriever
Ans. GraphRAG addresses the difficulty of hallucinations by offering LLMs with structured, domain-specific information from information graphs. As an alternative of relying solely on the language mannequin’s inside information, GraphRAG ensures that the mannequin generates responses primarily based on dependable and related data saved within the graph. This makes the responses extra correct and contextually grounded.
Ans. The Hybrid Retriever combines vector search and full-text search to retrieve information extra comprehensively. This technique permits GraphRAG to tug each vector-based comparable information and conventional textual data, bettering the retrieval course of’s accuracy and depth. It’s notably helpful when coping with complicated queries requiring numerous context information sources.