Finding out whether the AI is uncertain about facts or phrasing is the key.

Researchers describe how to tell if ChatGPT is confabulating
Enlarge

It’s one of the world’s worst-kept secrets that large language models give blatantly false answers to queries and do so with a confidence that’s indistinguishable from when they get things right. There are a number of reasons for this. The AI could have been trained on misinformation; the answer could require some extrapolation from facts that the LLM isn’t capable of; or some aspect of the LLM’s training might have incentivized a falsehood.

But perhaps the simplest explanation is that an LLM doesn’t recognize what constitutes a correct answer but is compelled to provide one. So it simply makes something up, a habit that has been termed confabulation.

Figuring out when an LLM is making something up would obviously have tremendous value, given how quickly people have started relying on them for everything from college essays to job applications. Now, researchers from the University of Oxford say they’ve found a relatively simple way to determine when LLMs appear to be confabulating that works with all popular models and across a broad range of subjects. And, in doing so, they develop evidence that most of the alternative facts LLMs provide are a product of confabulation.

Catching confabulation

The new research is strictly about confabulations, and not instances such as training on false inputs. As the Oxford team defines them in their paper describing the work, confabulations are where “LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed.”Advertisement

The reasoning behind their work is actually quite simple. LLMs aren’t trained for accuracy; they’re simply trained on massive quantities of text and learn to produce human-sounding phrasing through that. If enough text examples in its training consistently present something as a fact, then the LLM is likely to present it as a fact. But if the examples in its training are few, or inconsistent in their facts, then the LLMs synthesize a plausible-sounding answer that is likely incorrect.

But the LLM could also run into a similar situation when it has multiple options for phrasing the right answer. To use an example from the researchers’ paper, “Paris,” “It’s in Paris,” and “France’s capital, Paris” are all valid answers to “Where’s the Eiffel Tower?” So, statistical uncertainty, termed entropy in this context, can arise either when the LLM isn’t certain about how to phrase the right answer or when it can’t identify the right answer.

This means it’s not a great idea to simply force the LLM to return “I don’t know” when confronted with several roughly equivalent answers. We’d probably block a lot of correct answers by doing so.

So instead, the researchers focus on what they call semantic entropy. This evaluates all the statistically likely answers evaluated by the LLM and determines how many of them are semantically equivalent. If a large number all have the same meaning, then the LLM is likely uncertain about phrasing but has the right answer. If not, then it is presumably in a situation where it would be prone to confabulation and should be prevented from doing so.

ARS VIDEO

How Scientists Respond to Science Deniers

Extracting meaning

How does this work in practice? The description is remarkably straightforward:

Our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally. That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster.

If a single cluster predominates, then the AI is selecting an answer from within one collection of options that has a similar factual content. If there are multiple clusters, then the AI is selecting among different collections that all have different factual content—a situation that’s likely to result in confabulation.

Beyond its conceptual simplicity, implementing a system based on the ideas is also straightforward. Most major LLMs will produce a set of statistically likely answers to queries, which are needed to evaluate semantic entropy. There are already LLMs and software called natural language inference tools that are set up to determine whether two sentences imply each other. And, since those tools exist, there’s no supervised training needed, meaning that the system doesn’t have to be fed examples of confabulations to learn to determine the semantic entropy of a set of potential answers.

The researchers develop a measure to determine the improvement in accuracy a user would experience thanks to their semantic entropy filter. And then they test it and a number of other error-catching approaches on a huge range of topics: trivia and general knowledge, biology, and a set of Google search queries.Advertisement

Two things became apparent during these tests. One is that, except for a few edge cases, semantic entropy caught more false answers than any other methods. The second is that most errors produced by LLMs appear to be confabulations. That can be inferred from the fact that some of the other methods catch a variety of error types, yet they were outperformed by semantic entropy tests, even though these tests only catch confabulations.

Beyond simple facts

The researchers also demonstrate that the system can be adapted to work with more than basic factual statements by altering to handle biographies, which are a large collection of individual facts. So they developed software that broke down biographical information into a set of individual factual statements and evaluated each of these using semantic entropy. This worked on a short biography with as many as 150 individual factual claims.

Overall, this seems to be a highly flexible system that doesn’t require major new developments to put into practice and could provide some significant improvements in LLM performance. And, since it only catches confabulations and not other types of errors, it might be possible to combine it with other methods to boost performance even further.

As the researchers note, the work also implies that, buried in the statistics of answer options, LLMs seem to have all the information needed to know when they’ve got the right answer; it’s just not being leveraged. As they put it, “The success of semantic entropy at detecting errors suggests that LLMs are even better at ‘knowing what they don’t know’ than was argued… they just don’t know they know what they don’t know.”

Nature, 2024. DOI: 10.1038/s41586-024-07421-0  (About DOIs).

LEAVE A REPLY

Please enter your comment!
Please enter your name here