According to a new study published in the journal Royal Society Open Science, Large Language Models (LLMs) such as ChatGPT and DeepSeek often exaggerate scientific findings while summarising research papers. Researchers Uwe Peters from Utrecht University and Benjamin Chin-Yee from Western University and the University of Cambridge analysed 4900 AI-generated summaries from ten leading LLMs. Their findings revealed that up to 73 percent of summaries contained overgeneralised or inaccurate conclusions. Surprisingly, the problem worsened when users explicitly prompted the models to prioritise accuracy, and newer models like ChatGPT 4 performed worse than older versions.
What are the findings of the study
The study assessed how accurately leading LLMS summarised abstracts and full-length articles from prestigious science and medical journals, including Nature, Science, and The Lancet. Over a period of one year, the researchers collected and analysed 4,900 summaries generated by AI systems such as ChatGPT, Claude, DeepSeek, and LLaMA.
Six out of ten models routinely exaggerated claims, often by changing cautious, study-specific statements like “The treatment was effective in this study” into broader, definitive assertions like “The treatment is effective.” These subtle shifts in tone and tense can mislead readers into thinking that scientific findings apply more broadly than they actually do.
Why are these exaggerations happening
The tendency of AI models to exaggerate scientific findings appears to stem from both the data they are trained on and the behaviour they learn from user interactions. According to the study’s authors, one major reason is that overgeneralizations are already common in scientific literature. When LLMs are trained on this content, they learn to replicate the same patterns, often reinforcing existing flaws rather than correcting them.
Another contributing factor is user preference. Language models are optimised to generate responses that sound helpful, fluent, and widely applicable. As co-author Benjamin Chin-Yee explained, the models may learn that generalisations are more pleasing to users, even if they distort the original meaning. This results in summaries that may appear authoritative but fail to accurately represent the complexities and limitations of the research.
Accuracy prompts backfire
Contrary to expectations, prompting the models to be more accurate actually made the problem worse. When instructed to avoid inaccuracies, the LLMs were nearly twice as likely to produce summaries with exaggerated or overgeneralised conclusions compared to when given a simple, neutral prompt.
“This effect is concerning,” said Peters. “Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they’ll get a more reliable summary. Our findings prove the opposite.”
Humans still do better
To compare AI and human performance directly, the researchers analysed summaries written by people alongside those generated by chatbots. The results showed that AI was nearly five times more likely to make broad generalisations than human writers. This gap underscores the need for careful human oversight when using AI tools in scientific or academic contexts.
Recommendations for safer use
To mitigate these risks, the researchers recommend using models like Claude, which demonstrated the highest generalisation accuracy in their tests. They also suggest setting LLMs to a lower "temperature" to reduce creative embellishments and using prompts that encourage past-tense, study-specific reporting.
“If we want AI to support science literacy rather than undermine it,” Peters noted, “we need more vigilance and testing of these systems in science communication contexts.”
What are the findings of the study
The study assessed how accurately leading LLMS summarised abstracts and full-length articles from prestigious science and medical journals, including Nature, Science, and The Lancet. Over a period of one year, the researchers collected and analysed 4,900 summaries generated by AI systems such as ChatGPT, Claude, DeepSeek, and LLaMA.
Six out of ten models routinely exaggerated claims, often by changing cautious, study-specific statements like “The treatment was effective in this study” into broader, definitive assertions like “The treatment is effective.” These subtle shifts in tone and tense can mislead readers into thinking that scientific findings apply more broadly than they actually do.
Why are these exaggerations happening
The tendency of AI models to exaggerate scientific findings appears to stem from both the data they are trained on and the behaviour they learn from user interactions. According to the study’s authors, one major reason is that overgeneralizations are already common in scientific literature. When LLMs are trained on this content, they learn to replicate the same patterns, often reinforcing existing flaws rather than correcting them.
Another contributing factor is user preference. Language models are optimised to generate responses that sound helpful, fluent, and widely applicable. As co-author Benjamin Chin-Yee explained, the models may learn that generalisations are more pleasing to users, even if they distort the original meaning. This results in summaries that may appear authoritative but fail to accurately represent the complexities and limitations of the research.
Accuracy prompts backfire
Contrary to expectations, prompting the models to be more accurate actually made the problem worse. When instructed to avoid inaccuracies, the LLMs were nearly twice as likely to produce summaries with exaggerated or overgeneralised conclusions compared to when given a simple, neutral prompt.
“This effect is concerning,” said Peters. “Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they’ll get a more reliable summary. Our findings prove the opposite.”
Humans still do better
To compare AI and human performance directly, the researchers analysed summaries written by people alongside those generated by chatbots. The results showed that AI was nearly five times more likely to make broad generalisations than human writers. This gap underscores the need for careful human oversight when using AI tools in scientific or academic contexts.
Recommendations for safer use
To mitigate these risks, the researchers recommend using models like Claude, which demonstrated the highest generalisation accuracy in their tests. They also suggest setting LLMs to a lower "temperature" to reduce creative embellishments and using prompts that encourage past-tense, study-specific reporting.
“If we want AI to support science literacy rather than undermine it,” Peters noted, “we need more vigilance and testing of these systems in science communication contexts.”
You may also like
Labour U-Turn: How UK PM Keir Starmer went hard right on immigration
Tamil Nadu CM M.K. Stalin inaugurates annual flower show in Ooty
Tahawwur Rana trial: Govt appoints special public prosecutors, SG Tushar Mehta to lead
'My sister killed our parents and lived with their mummified bodies for four years'
Trump wants Apple to stop production of iPhones in India co assures New Delhi plans intact