VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

Alt text: Two white icons on a blue-to-green gradient background—one showing a central figure linked to others, representing a network, and the other depicting lines connecting to a document, symbolizing data flow.

Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text – a phenomenon known as “closed-domain hallucination.”¹

Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response). 

In our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability,” we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need traceability, which has two components:

Provenance: if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source.
Error Localization: if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.

Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail’s design and performance.²

VeriTrail’s hallucination detection process

A key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a directed acyclic graph (DAG). Each node in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each edge from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a stage reflecting its position in the generative process. 

An example of a process with multiple generative steps is GraphRAG. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.³

A GraphRAG run is depicted as a directed acyclic graph. The Stage 1 nodes represent source text chunks. Each Stage 1 node has an edge pointing to a Stage 2 node, which corresponds to an entity or a relationship. Entity 3 was extracted from two source text chunks, so its descriptions are summarized. The summarized entity description forms a Stage 3 node. The Stage 2 and 3 nodes have edges pointing to Stage 4 nodes, which represent community reports. The Stage 4 nodes have edges pointing to Stage 5 nodes, which correspond to map-level answers. The Stage 5 nodes each have an edge pointing to the terminal node, which represents the final answer. The terminal node is the only node in Stage 6. — Figure 1: GraphRAG splits the source text into chunks (Stage 1). For each chunk, an LM extracts entities and relationships (the latter are denoted by “⭤ “), along with short descriptions (Stage 2). If an entity or a relationship was extracted from multiple chunks, an LM summarizes the descriptions (Stage 3). A knowledge graph is constructed from the final set of entities and relationships, and a community detection algorithm, such as Leiden clustering, groups entities into communities. For each community, an LM generates a “community report” that summarizes the entities and relationships (Stage 4). To answer a user’s question, an LM generates “map-level answers” based on groups of community reports (Stage 5), then synthesizes them into a final answer (Stage 6).

VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using Claimify. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1. 

Case study 1: A “Fully Supported” claim

An example of VeriTrail's claim verification process where the claim is found “Fully Supported.” A claim extracted from the terminal node, Node 17, is “Legislative efforts have been made to address the high cost of diabetes-related supplies in the US.” In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. The sentence “The general assembly in North Carolina is considering legislation to set a cap on insulin prices, which indicates that high insulin prices are a contributing factor to the high cost of diabetes-related supplies in the US” is selected as evidence from Node 15. The tentative verdict is “Fully Supported.” In Iteration 2, VeriTrail checks Nodes 12 and 13, which are the source nodes of Node 15. The sentence “The General Assembly in North Carolina is considering legislation to set a cap on insulin prices” is selected as evidence from Node 13. The verdict remains “Fully Supported.” In Iteration 3, VeriTrail checks Nodes 4, 5, and 11, which are the source nodes of Node 13. The sentence “The General Assembly is the legislative body in North Carolina considering legislation to cap insulin prices” is selected as evidence from Node 4. The verdict is still “Fully Supported.” In Iteration 4, VeriTrail checks Node 1, which is the source node of Node 4. The selected evidence is “‘There’s actually legislation in North Carolina at the General Assembly to set a cap on insulin…’ Stein said.” The corresponding verdict is “Fully Supported.” Since Node 1 represents a raw text chunk, it does not have any source nodes to check. Therefore, verification terminates and the “Fully Supported” verdict is deemed final. — Figure 2: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Fully Supported” claim.

Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated: 

In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.
- An LM then performed Evidence Selection, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.
- Next, an LM performed Verdict Generation. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a “Not Fully Supported” verdict. Instead, an LM was prompted to classify the claim as “Fully Supported,” “Not Fully Supported,” or “Inconclusive” based on the evidence. In this case, the verdict was “Fully Supported.”
Since the verdict in Iteration 1 was “Fully Supported,” VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was “Fully Supported.” This process – identifying candidate nodes, performing Evidence Selection and Verdict Generation – was repeated in Iteration 3, where the verdict was still “Fully Supported,” and likewise in Iteration 4. 
In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.

Case study 2: A “Not Fully Supported” claim

An example of VeriTrail's claim verification process where the claim is found “Not Fully Supported.” We assume that the maximum number of consecutive “Not Fully Supported” verdicts was set to 2. A claim extracted from the terminal node, Node 17, is “Challenges related to electric vehicle battery repairability contribute to sluggish retail auto sales in China.” In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. Two sentences are selected as evidence. The first sentence is “Challenges with electric vehicle (EV) battery disposal and repair may also contribute to the sluggishness in retail auto sales.” The second sentence is “Junkyards are accumulating discarded EV battery packs, while collision shops face limitations in repairing EV battery packs, which could affect consumer confidence and demand.” These sentences are both from Node 15. The tentative verdict is “Not Fully Supported.” In Iteration 2, VeriTrail checks Nodes 12, 13, and 14. Nodes 12 and 13 are the source nodes of Node 15. Node 14 is the source node of Node 16, which was checked in Iteration 1. The sentence “The electric vehicle market in China is influenced by challenges associated with EV battery disposal and repair” is selected as evidence from Node 12. The verdict remains “Not Fully Supported.” Since two consecutive “Not Fully Supported” verdicts have been reached, which was the maximum, verification terminates and the final verdict is “Not Fully Supported.” — Figure 3: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Not Fully Supported” claim, where the maximum number of consecutive “Not Fully Supported” verdicts was set to 2.

Figure 3 provides an example of a claim where VeriTrail identified hallucination:

In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was “Not Fully Supported.” Users can configure the maximum number of consecutive “Not Fully Supported” verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let’s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.
Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was “Fully Supported,” VeriTrail only checked the input nodes for Node 15. Why was the “Not Fully Supported” claim handled differently? If the Evidence Selection step overlooked relevant evidence, the “Not Fully Supported” verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.
In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was “Not Fully Supported.” Since this was the second consecutive “Not Fully Supported” verdict, verification terminated and the verdict was deemed final.

Providing traceability

In addition to assigning a final “Fully Supported,” “Not Fully Supported,” or “Inconclusive” verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability:

Provenance: For “Fully Supported” and “Inconclusive” claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.
Error Localization: For “Not Fully Supported” claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur.

The evidence trail also helps users verify the verdict: instead of reading through all nodes – which may be infeasible for processes that generate large amounts of text – users can simply review the evidence sentences and summaries. 

Key design features

VeriTrail’s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include: 

During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach guarantees that the sentences included in the evidence trail are not hallucinated.
After a claim is assigned an interim “Fully Supported” or “Inconclusive” verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected – not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps reduce computational cost.
VeriTrail is designed to handle input graphs with any number of nodes, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.  
The configurable maximum number of consecutive “Not Fully Supported” verdicts (introduced in Case Study 2) allows the user to find their desired balance between computational cost and how conservative VeriTrail is in flagging hallucinations. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the “Not Fully Supported” verdict.

Evaluating VeriTrail’s performance

We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization⁴ and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).

We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (AlignScore and INFUSE); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.⁵

Most importantly, VeriTrail traces claims through intermediate outputs – unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs. 

For an in-depth discussion of VeriTrail, please see our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability.”

¹ (opens in new tab) The term “closed-domain hallucination” was introduced by OpenAI in the GPT-4 Technical Report (opens in new tab).

² VeriTrail is currently used for research purposes only and is not available commercially.

³ We focus on GraphRAG’s global search method.

^{4 (opens in new tab)} In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (Wu et al., 2021 (opens in new tab); Chang et al., 2023 (opens in new tab)).

⁵ The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.

Source link

What's Hot

WATCH: US Coast Guard releases final report on deadly Titan submersible implosion

OpenAI and NVIDIA Propel AI Innovation With New Open Models Optimized for the World’s Largest AI Inference Infrastructure

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

Navigating medical education in the era of generative AI

Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

Technical approach for classifying human-AI interactions at scale

AI Testing and Evaluation: Reflections

CollabLLM: Teaching LLMs to collaborate with users

AI Testing and Evaluation: Learnings from cybersecurity

What's Hot

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

VeriTrail’s hallucination detection process

Case study 1: A “Fully Supported” claim

Case study 2: A “Not Fully Supported” claim

Microsoft research copilot experience

Providing traceability

Key design features

Evaluating VeriTrail’s performance

Keep Reading

Subscribe to Updates