The Flow of GraphRAG
Motivation
The current RAG techniques can not answer the global questions about the corpus. For example, we may want to know what is the topic of the corpus. Usually, the answer does not exist in the corpus but needs to understand the whole corpus and give summarization. Such global questions are called query-focused summarization (QFS) problems in this paper1. A naive RAG technique can not handle such a situation.
And it’s unrealistic to put all the text in the corpus in the context window of LLM. Even if we could, the LLM may miss the information in the middle of the context window.
So the authors proposed GraphRAG1 to solve the aforementioned problems.
GraphRAG
To get a better understanding of the GraphRAG algorithm, you’d better have a background in the Knowledge Graph (KG), Community Detection, etc.
Document -> TextChunk
flowchart LR Document --> TextChunk1 & TextChunk2 & ...
This step is quite intuitive, the document is split into multiple text chunks for further processing.
TextChunk -> Entity & Relationship & Claim
flowchart LR TextChunk --> Entity & Relationship & Claim
By leveraging the LLMs, we can extract the following information
- Entity with
title, type, description
- Relationship with
source, description, target
- Claim
If we find duplicate entities/relationships (excluding description
), then they will be merged by creating a description array. That is to say, an entity/relationship may have multiple descriptions.
The method of extraction is writing a prompt with the Few-shot learning and In-Context learning techniques. Apparently, the key to extraction is the prompt itself
Besides crafting the prompt, the authors1 also suggest that the extraction can be performed in multiple rounds to encourage the LLM to find the missing entities in the prior round. The trick is telling the LLM that MANY entities were missed in the last extraction.
Notice: A larger text chunk size results in fewer LLM interactions, making the extraction process significantly faster. However, the authors1 found that the count of entities/relationships would also decrease. To apply the GraphRAG technique, the size of the text chunk should be set to a reasonable value. You may refer to the Figure 2 in the origin paper.
Graph Communities Detection
flowchart LR Document(Document in the form of entity, relationships, and claims) subgraph community_tree root(Community1) root2(Community2) root3(Community3) root --> root2 & root3 root --> dots(...) root2 --> dots1(...) root3 --> dots2(...) dots --> dots3(...) end Document --> community_tree
In the previous round, we actually built a KG for each document. Next, we can apply the Leiden communities detection algorithm to this KG. The output of the Leiden algorithm is a list of communities. Note that these communities are tree-structured (hereinafter referred to as the community tree), which means that they have a hierarchical structure.
Graph Communities -> Community Summaries
flowchart LR Community --> CommuniySummary
The community summary is also generated by LLM. For different types of communities, we have different strategies.
- Leaf-level Community
- Sort all relationships in the decreasing order of combined
source
andtarget
degrees. 🤔 I guess the value of degrees indicates the importance of a relationship. - For each relationship, add the following text to the context window of LLM until the token limit is reached
- The
description
of thesource
- The
description
of thetarget
- linked claims
- The
description
of the relationship
- The
- Sort all relationships in the decreasing order of combined
- Higher-level communities (its children are sub-communities)
- Pretend that the sub-communities do not exist: process them according to the method above, as long as the token limit is not reached.
- Otherwise, sort sub-communities by their
description
length in decreasing order (🤔 I guess more text carries more information). And then substituting each community summary (short) with the merged description just like in the leaf-level community. In my opinion, the purpose of substitution is to keep more fine-grained information.
To summarize, the methodology of generating community summaries follows a bottom-up manner.
Community Summaries -> Community Answers -> Global Answer
flowchart LR a(Random shuffled community summaries) a --> Chunk1 & Chunk2 & annoy(...) Chunk1 --> p(Intermediate Answer1) Chunk2 --> q(Intermediate Answer2) annoy --> r(...) window(Context Window) p & q & r --> window window --> ans(Global Answer)
Now, each community is associated with a community summary. They can be exploited to answer the user’s query with the following steps:
- Shuffle the community summaries randomly and split them into multiple chunks with a pre-specified token size. This step ensures that the relevant information is distributed across chunks.
- Map phase: Generate intermediate answers in parallel, one for each chunk. The LLM is also asked to give the score of the generated answer which evaluates the helpfulness.
- Reduce phase: Sort all the intermediate answers in decreasing order of score and put them in the context windows until the token limit is hit. The LLM will give the global answer based on the information of all communities.
Wrap-up
That’s all for this post about the GraphRAG technique. In my opinion, the flow of GraphRAG is quite intuitive and reasonable. Microsoft also provides the official implementation. However, it only supports the OpenAI’s LLM. After some searching on Google, I found a LangChain implementation called langchain-graphrag and prepare to check this out to see if it works :)
-
Edge, Darren, et al. “From local to global: A graph rag approach to query-focused summarization.” arXiv preprint arXiv:2404.16130 (2024). ↩︎ ↩︎ ↩︎ ↩︎