Why RAG won’t solve generative AI’s hallucination problem

Hallucinations – the lies that generative AI models actually tell – pose a major problem for companies looking to integrate the technology into their operations.

Because models have no real intelligence and simply predict words, images, speech, music and other data according to a private schedule, they sometimes get it wrong. Very wrong. In a recent piece in The Wall Street Journal, a source talks about an example where Microsoft’s generative AI invented meeting participants and suggested that conference calls were about topics that weren’t actually discussed during the call.

As I wrote a while ago, hallucinations can be an unsolvable problem with current transformer-based model architectures. But a number of generative AI vendors suggest so can are more or less abolished through a technical approach called Retrieval Augmented Generation or RAG.

Here’s how one supplier, Squirro, pitches it:

At the core of the offering is the concept of Retrieval Augmented LLMs or Retrieval Augmented Generation (RAG) embedded in the solution… (our generative AI) is unique in its promise of zero hallucinations. Every piece of information it generates is traceable to a source, which guarantees credibility.

Here’s a similar pitch from SiftHub:

Using RAG technology and refined large language models with industry-specific knowledge training, SiftHub enables companies to generate personalized responses without hallucinations. This guarantees more transparency and less risk and instills absolute confidence to use AI for all their needs.

RAG was developed by data scientist Patrick Lewis, a researcher at Meta and University College London, and lead author of the 2020 paper that coined the term. Applied to a model, RAG fetches documents that might be relevant to a question – for example, a Wikipedia page about the Super Bowl – using what is essentially a keyword search and then asks the model to generate answers given this additional context .

“When you interact with a generative AI model like ChatGPT or Llama and you ask a question, the default is for the model to answer from its ‘parametric memory’ – that is, from the knowledge stored in its parameters as a result of training on the Internet’s vast data,” explains David Wadden, a researcher at AI2, the AI-focused research arm of the nonprofit Allen Institute, “But just as you’re likely to give more accurate answers if you have a reference (. such as a book or a file), this also applies to models in some cases.”

RAG is undeniably useful: it allows things a model generates to be attributed to retrieved documents to verify their factuality (and, as a side benefit, avoid potentially copyright infringement). RAG also offers companies that do not want their documents used the opportunity to train a model (for example, companies in highly regulated industries such as healthcare and law) to allow models to use those documents in a more secure and temporary way .

But RAG for sure can not prevent a model from hallucinating. And it has limitations that many vendors overlook.

Wadden says RAG is most effective in “knowledge-intensive” scenarios where a user wants to use a model to meet an “information need” – for example, to find out who won the Super Bowl last year. In these scenarios, the document answering the question likely contains many of the same keywords as the question (e.g., “Super Bowl,” “last year”), making it relatively easy to find via a keyword search.

Things get trickier in ‘reasoning-intensive’ tasks like coding and arithmetic, where it’s harder to specify in a keyword-based search the concepts needed to answer a request – let alone identify which documents might be relevant are.

Even for basic questions, models can be ‘distracted’ by irrelevant content in documents, especially in long documents where the answer is not clear. Or they may – for reasons yet unknown – simply ignore the contents of the retrieved documents and choose to rely on their parametric memory.

RAG is also expensive in terms of the hardware required to deploy it at scale.

That’s because retrieved documents, whether they come from the Internet, an internal database, or elsewhere, must be stored in memory—at least temporarily—so that the model can reference them. Another expense is calculating the increased context a model must process before generating a response. For a technology already notorious for the amount of computing power and electricity it requires for even basic operations, this amounts to a serious consideration.

That does not mean that RAG cannot be improved. Wadden noted that many efforts are being made to train models to make better use of documents retrieved by the RAG.

Some of these efforts involve models that can “decide” when to make use of the documents, or models that can choose not to perform the retrieval at all if they deem it unnecessary. Others are focusing on ways to index massive data sets of documents more efficiently, and on improving search through better representations of documents – representations that go beyond keywords.

“We’re pretty good at retrieving documents based on keywords, but not so good at retrieving documents based on more abstract concepts, such as a proof technique needed to solve a mathematical problem,” Wadden said . “Research is needed to develop document representations and search techniques that can identify relevant documents for more abstract generation tasks. I think this is mainly an open question at this point.”

So RAG can help reduce a model’s hallucinations, but it is not the answer to all of AI’s hallucinatory problems. Beware of sellers who try to claim otherwise.