The Flow of GraphRAG

The current RAG techniques can not answer the global questions about the corpus. For example, we may want to know what is the topic of the corpus. Usually, the answer does not exist in the corpus but needs to understand the whole corpus and give summarization. Such global questions are called query-focused summarization (QFS) problems in this paper1. A naive RAG technique can not handle such a situation.

And it’s unrealistic to put all the text in the corpus in the context window of LLM. Even if we could, the LLM may miss the information in the middle of the context window.

So the authors proposed GraphRAG1 to solve the aforementioned problems.

Tip

To get a better understanding of the GraphRAG algorithm, you’d better have a background in the Knowledge Graph (KG), Community Detection, etc.

flowchart LR
    Document --> TextChunk1 & TextChunk2 & ...

This step is quite intuitive, the document is split into multiple text chunks for further processing.

flowchart LR
    TextChunk --> Entity & Relationship & Claim

By leveraging the LLMs, we can extract the following information

  • Entity with title, type, description
  • Relationship with source, description, target
  • Claim
Tip

If we find duplicate entities/relationships (excluding description), then they will be merged by creating a description array. That is to say, an entity/relationship may have multiple descriptions.

The method of extraction is writing a prompt with the Few-shot learning and In-Context learning techniques. Apparently, the key to extraction is the prompt itself

Besides crafting the prompt, the authors1 also suggest that the extraction can be performed in multiple rounds to encourage the LLM to find the missing entities in the prior round. The trick is telling the LLM that MANY entities were missed in the last extraction.

Tip

Notice: A larger text chunk size results in fewer LLM interactions, making the extraction process significantly faster. However, the authors1 found that the count of entities/relationships would also decrease. To apply the GraphRAG technique, the size of the text chunk should be set to a reasonable value. You may refer to the Figure 2 in the origin paper.

flowchart LR
    Document(Document in the form of entity, relationships, and claims)
    subgraph community_tree
        root(Community1)
        root2(Community2)
        root3(Community3)
        root --> root2 & root3
        root --> dots(...)
        root2 --> dots1(...)
        root3 --> dots2(...)
        dots --> dots3(...)
    end
    Document --> community_tree

In the previous round, we actually built a KG for each document. Next, we can apply the Leiden communities detection algorithm to this KG. The output of the Leiden algorithm is a list of communities. Note that these communities are tree-structured (hereinafter referred to as the community tree), which means that they have a hierarchical structure.

flowchart LR
    Community --> CommuniySummary

The community summary is also generated by LLM. For different types of communities, we have different strategies.

  • Leaf-level Community
    • Sort all relationships in the decreasing order of combined source and target degrees. 🤔 I guess the value of degrees indicates the importance of a relationship.
    • For each relationship, add the following text to the context window of LLM until the token limit is reached
      • The description of the source
      • The description of the target
      • linked claims
      • The description of the relationship
  • Higher-level communities (its children are sub-communities)
    • Pretend that the sub-communities do not exist: process them according to the method above, as long as the token limit is not reached.
    • Otherwise, sort sub-communities by their description length in decreasing order (🤔 I guess more text carries more information). And then substituting each community summary (short) with the merged description just like in the leaf-level community. In my opinion, the purpose of substitution is to keep more fine-grained information.
Tip

To summarize, the methodology of generating community summaries follows a bottom-up manner.

flowchart LR
    a(Random shuffled community summaries)
    a --> Chunk1 & Chunk2 & annoy(...)
    Chunk1 --> p(Intermediate Answer1)
    Chunk2 --> q(Intermediate Answer2)
    annoy --> r(...)
    window(Context Window)
    p & q & r --> window
    window --> ans(Global Answer)

Now, each community is associated with a community summary. They can be exploited to answer the user’s query with the following steps:

  1. Shuffle the community summaries randomly and split them into multiple chunks with a pre-specified token size. This step ensures that the relevant information is distributed across chunks.
  2. Map phase: Generate intermediate answers in parallel, one for each chunk. The LLM is also asked to give the score of the generated answer which evaluates the helpfulness.
  3. Reduce phase: Sort all the intermediate answers in decreasing order of score and put them in the context windows until the token limit is hit. The LLM will give the global answer based on the information of all communities.

That’s all for this post about the GraphRAG technique. In my opinion, the flow of GraphRAG is quite intuitive and reasonable. Microsoft also provides the official implementation. However, it only supports the OpenAI’s LLM. After some searching on Google, I found a LangChain implementation called langchain-graphrag and prepare to check this out to see if it works :)


  1. Edge, Darren, et al. “From local to global: A graph rag approach to query-focused summarization.” arXiv preprint arXiv:2404.16130 (2024). ↩︎ ↩︎ ↩︎ ↩︎