What Evidence Do Language Models Find Convincing? (Paper Summary)

What is the best way to improve documents for search

What happens when RAG models are provided with documents that have conflicting information?

This is one of the most difficult challenges to resolve in deploying RAG systems, especially on top of constantly changing/amended datasets. The paper, “What Evidence Do Language Models Find Convincing?”, sets out to answer this question. Their analysis is worth paying attention to.

Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as “is aspartame linked to cancer”. To resolve these ambiguous queries, one must search through a large range of websites and consider “which, if any, of this evidence do I find convincing?”. In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.

As the old saying goes: Garbage In, Garbage Out. High-quality data is the most important variable in developing your AI Systems. With RAG, this is as true for inference as it is for training. Consider all the common searches, and restructuring your data/chunks to make retrieving the information easier.

“We find that the question-paragraph similarity correlates most strongly with convincingness.”

For example, we find that the question-paragraph similarity correlates most strongly with convincingness…We find that current models rely heavily on a website’s relevance, while largely ignoring stylistic features that humans find important.”- This means that an LLM will prioritize a source saying that explicitly states, “This article is whether aspartame is linked to cancer”, while ignoring important aspects like citations.

This goes well with my own subjective experience and what I’ve written so far wrt to LLMs and RAG → The best way to improve performance is to restructure your data to make it more searchable. Teams will throw money into Vector DBs, LLMs, and 20 Engineers where similar results could be gained by employing a bunch of people to reannotate data, make your user inputs match templates, and chunking your embeddings better.

