Error Types in RAG for Research Automation and Fact Verification

By Wojciech Gryc on July 25, 2024

A key part of Emerging Trajectories is our ability to link individual facts or pieces of information to a query, question, or statement being verified. This is similar to how Gemini or Perplexity cite website URLs when generating content, but we tend to go a few steps further:

We do not cite at the web page level, but cite the specific piece of information that supports or disproves the assertion being made. This is critical as a web page could include 1,000s of words, and saying the web page justifies an assertion puts the onus on the user to actually review everything. This might be untenable or impossible to do, especially with dozens of hundreds of citations.
Tools like Perplexity and ChatGPT will do a web search, obtain the first “x” (typically 5 or 10) results, and see if those inform the query. Emerging Trajectories, on the other hand, searches through thousands of articles or web pages to answer questions or content. This requires us to crawl and integrate the content beforehand, but makes whatever content we generate significantly more holistic or comprehensive for end users.
We might generate facts from other content, and need this to be auditable. Most products will cite content verbatim from the original sources, and this is a great starting point. However, our “facts” can actually be generated from other sources, observations, etc… For example, if we generate code to analyze a data set and come to a conclusion, we might cite that conclusion in other content. This recursive generation of facts (or, “research automation”) is unique to Emerging Trajectories. Our requirement here is that all such facts need to be auditable so that humans can review the original sources, code, or whatever else led to the conclusion being cited. You should never have blind faith in the large language model (LLM) being used.

Problems with Retrieval Augmented Generation (RAG)

The most obvious approach to building an Emerging Trajectories-like product is using RAG. In this scenario, you insert “facts” or “observations” into the prompts sent to the LLM and ask it to use that information in generating a response. This can be a decent starting point, but can also include numerous types of errors and “hallucinations” if done improperly. We typically look for for four types of errors.

Error Type #1: misattribution of fact content to the query

RAG approaches tend to struggle due to the way they split information to be cited (i.e., their “chunking” strategy). For example, if you have a 100-page report, at what level do you review and cite it? Would you take individual sentences, paragraphs, pages, or something else? When you do break down the document, you might lose valuable context. For example, a report on Europe’s AI labor shortage might mention specific countries in its first paragraph, but only reference “the aforementioned countries” in paragraph 2. If you are passing paragraphs as facts, then the LLM won’t know which countries are actually being referred to.

Broadly speaking, we define this as an attribution error — you want to know which entities are being discussed in facts/statements so you can reference them properly.

Note that some researchers suggest summarizing reports or building knowledge graphs to address this issue. This could be a useful approach, but assumes that the context of your queries or documents is the same as when the summary was generated. You might lose valuable information when building the summary. In the case of a knowledge graph, you might have to ensure you design the right edges and nodes for the knowledge graph to cite information properly later.

Error Type #2: picking facts that are not “as good” as other facts

A “softer” version of Error Type #1 is not using the best facts for the statement or query in question. Suppose you are researching political candidates and their policies in a popular election. In such a scenario, a RAG approach applied to media sources can generate thousands of relevant statements. Even if you address Error Type #1 and get dozens or hundreds of properly attributed facts or statements, you want to ensure the LLM picks the best ones for citations.

Tracking this error can be difficult, so we tend to review fact citations to determine whether they directly support an assertion or indirectly support it.

Another interesting variant here is the source of the fact or citation. Many of our users prefer high credibility sources to be prioritized, so if you have two equivalent citations and one comes from an official government website and another from social media, you'll want to prioritize the former.

Error Type #3: verifiable versus unverifiable information

A third issue we have is with the queries that users put into the system itself. Tools like Emerging Trajectories, ChatGPT, and Perplexity are built to address many different queries from users, and not all queries are easy to verify — or in some cases, not possible to verify.

For example, suppose you are doing research on the fiscal sustainability of G7 countries (i.e., can they keep paying their debts). Here are three queries any reasonable user might put into an LLM-powered software tool:

Will France default on its national debt in 2030?
What are France’s interest payments expected to be in 2025?
Does the French government report on its annual interest payments?

All three queries could be put into an LLM-powered system, but (a) is significantly less verifiable today in 2024, while (c) is a relatively easy question for any RAG system to answer. Ensuring that the RAG system (and associated LLM) actually pushes back when facts are not present to support an assertion is critical.

Error Type #4: fundamental logic errors

Suppose we have a verifiable question submitted, and we obtain the right facts... A final risk is that the facts will be combined to generate a conclusion that does not reasonably flow from the facts themselves.

Suppose we’re revisiting the query, “Will France default on its national debt in 2030?” We could have a response like, “No, France will not be able to service its national debt based on its increasing debt burden[1] and likelihood of high interest rates[2].”

Suppose the two facts are...

France’s debt to GDP ratio is expected to surpass 110% in 2026.
US interest rates are expected to stay above 5% per year in 2026.

You can see here that the two facts might contribute to a risk that France won’t be able to service its debt, but this is not a foregone conclusion based on the facts themselves. This sort of fundamental logic error can occur quite often with RAG-powered LLM systems.

Considerations for User Requirements

The above can, at least to a reasonable extent, be addressed with today’s technologies, but also comes with the added challenge of user expectations and requirements.

For example, suppose you have 10,000 articles on global economics. A user doesn’t want you to re-analyze all 10,000 articles every time they have a question. Understanding the user’s use case will allow you to...

Determine precision versus speed. Does the user want a fast and imprecise answer, or a precise answer that is trustworthy?
Optimize for cost. Does the user want to spend $0 on answering the question, or is this such a business critical question that spending $1,000 is reasonable? The latter enables you to rerun queries, recrawl websites, and generate new facts... The former means you’re likely stuck with whatever imperfect data you’ve already collected.

It’s important to note — “low cost and imprecise” is not fundamentally bad. You could argue that Google search queries or ChatGPT questions are low cost and imprecise — in both cases, you get an answer quickly and for (nearly) free, but you can’t simply trust whatever content you’re provided.

This is where building fact bases on specific content is such a valuable and interesting middle ground. In the case of Emerging Trajectories, we focus on data sources around global events, economics, and geopolitical risk... This makes us fast and precise for specific use cases, but don’t depend on us for general knowledge about anything and everything.

How Emerging Trajectories Uses the Above

The above error types, challenges, and user requirements are what makes Emerging Trajectories different from other systems, like Perplexity, ChatGPT, Claude, and so on. We are neither “better” nor “worse”, and just right for specific types of use cases.

With Emerging Trajectories, we do the following...

We assume that the underlying source documents and associated “fact base” is critical to driving the right conclusions.
We break down the work flows our users depend on to address the four types of errors we discuss above.
We limit our work to important enough questions and frameworks (e.g., geopolitical risk analysis, commodities trading, etc.) that are valuable enough to invest more resources in answering them.

As the LLM space matures, we also see opportunities for more specialized technologies to enter our own processes and frameworks. For example...

On the user experience side: enabling users to select sources explicitly is a great way to ensure people get answers they understand and trust.
On the infrastructure side: new approaches to RAG, such as GraphRAG, which builds a hierarchical summary of all content before being queried, show promise (though they are also much too expensive to run in production right now). Such RAG approaches are not as relevant for general consumer use cases, but very relevant when building customized fact bases for specific business challenges or repetitive research tasks.
On the research side: the error typology outlined above is being used to create a training set for our machine learning systems. We’ll use this typology and associated data set to improve our prompts, fine-tune our LLMs, and train new models.

If you're interested in learning more, please contact us!