
Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text – a phenomenon known as “closed-domain hallucination.”1
Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response).
In our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability,” we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need traceability, which has two components:
- Provenance: if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source.
- Error Localization: if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.
Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail’s design and performance.2
VeriTrail’s hallucination detection process
A key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a directed acyclic graph (DAG). Each node in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each edge from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a stage reflecting its position in the generative process.
An example of a process with multiple generative steps is GraphRAG. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.3

VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using Claimify. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1.
Case study 1: A “Fully Supported” claim

Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated:
- In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.
- An LM then performed Evidence Selection, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.
- Next, an LM performed Verdict Generation. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a “Not Fully Supported” verdict. Instead, an LM was prompted to classify the claim as “Fully Supported,” “Not Fully Supported,” or “Inconclusive” based on the evidence. In this case, the verdict was “Fully Supported.”
- Since the verdict in Iteration 1 was “Fully Supported,” VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was “Fully Supported.” This process – identifying candidate nodes, performing Evidence Selection and Verdict Generation – was repeated in Iteration 3, where the verdict was still “Fully Supported,” and likewise in Iteration 4.
- In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.
Case study 2: A “Not Fully Supported” claim

Figure 3 provides an example of a claim where VeriTrail identified hallucination:
- In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was “Not Fully Supported.” Users can configure the maximum number of consecutive “Not Fully Supported” verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let’s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.
- Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was “Fully Supported,” VeriTrail only checked the input nodes for Node 15. Why was the “Not Fully Supported” claim handled differently? If the Evidence Selection step overlooked relevant evidence, the “Not Fully Supported” verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.
- In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was “Not Fully Supported.” Since this was the second consecutive “Not Fully Supported” verdict, verification terminated and the verdict was deemed final.
Spotlight: AI-POWERED EXPERIENCE
Microsoft research copilot experience
Discover more about research at Microsoft through our AI-powered experience
Providing traceability
In addition to assigning a final “Fully Supported,” “Not Fully Supported,” or “Inconclusive” verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability:
- Provenance: For “Fully Supported” and “Inconclusive” claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.
- Error Localization: For “Not Fully Supported” claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur.
The evidence trail also helps users verify the verdict: instead of reading through all nodes – which may be infeasible for processes that generate large amounts of text – users can simply review the evidence sentences and summaries.
Key design features
VeriTrail’s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include:
- During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach guarantees that the sentences included in the evidence trail are not hallucinated.
- After a claim is assigned an interim “Fully Supported” or “Inconclusive” verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected – not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps reduce computational cost.
- VeriTrail is designed to handle input graphs with any number of nodes, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.
- The configurable maximum number of consecutive “Not Fully Supported” verdicts (introduced in Case Study 2) allows the user to find their desired balance between computational cost and how conservative VeriTrail is in flagging hallucinations. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the “Not Fully Supported” verdict.
Evaluating VeriTrail’s performance
We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization4 and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).
We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (AlignScore and INFUSE); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.5
Most importantly, VeriTrail traces claims through intermediate outputs – unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs.
For an in-depth discussion of VeriTrail, please see our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability.”
1 (opens in new tab) The term “closed-domain hallucination” was introduced by OpenAI in the GPT-4 Technical Report (opens in new tab).
2 VeriTrail is currently used for research purposes only and is not available commercially.
3 We focus on GraphRAG’s global search method.
4 (opens in new tab) In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (Wu et al., 2021 (opens in new tab); Chang et al., 2023 (opens in new tab)).
5 The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.