New chatbot ‘outperforms PhDs on literature reviews’

Cheap research round-ups by ‘hallucination-free’ OpenScholar model preferred by experts, says Nature study

Published on
February 4, 2026
Last updated
February 4, 2026
Source: istock: Max Zolotukhin

A new chatbot designed by scholars can outperform PhD students and postdocs in undertaking scientific literature reviews, according to a Nature study that says the large language model (LLM) is capable of producing reliable summaries for less than a penny.

Evaluating a new model designed to stop ChatGPT’s frequent “hallucinations” when it conducts literature reviews, US researchers asked experts in computer science, physics, neuroscience and biomedicine to assess summaries written by OpenScholar and a spin-off version ScholarQABench against reviews written by PhD students.

According to the study, published on 4 February, the domain-level experts – also PhDs and postdocs – preferred OpenScholar and ScholarQABench responses either 51 per cent or 70 per cent of the time respectively.

Their advantage is “primarily attributed to their ability to provide a greater breadth and depth of information”, having produced reviews that were twice or three times as long as PhD-written summaries (1,447 or 706 words long on average compared with the 424-word average of human-written reviews), notes the paper.

ADVERTISEMENT

By contrast, ChatGPT-written summaries were preferred over human-written responses in nearly a third of cases (31 per cent) as they “struggled with information coverage”, says the study titled “Synthesizing scientific literature with retrieval-augmented language models”.

Importantly, OpenScholar did not hallucinate in the same way as ChatGPT-4 or other LLMs such as Llama which produce false citations in “78 to 90 per cent of cases when asked to cite recent literature across fields such as computer science and biomedicine”, explains the paper, which notes no hallucinations were found for reviews created for computer science or biomedicine by either OpenScholar LLM.

ADVERTISEMENT

In contrast, other LLMs produced “plausible-looking reference lists” yet “78–98 per cent of titles are fabricated, with the worst rates in biomedicine”, says the study, which found “even when citations refer to real papers, most of them are not substantiated by the corresponding abstracts, resulting in near-zero citation accuracy”.

Unlike other LLMs trained on the entire internet, OpenScholar’s 8B model is based on a corpus of 45 million scientific papers which is designed to create a “self-feedback loop to improve factuality, coverage and citation accuracy”, explains the Nature paper. The LLM has been used by more than 30,000 people since its demo was launched, collecting nearly 90,000 user enquiries.

According to the study, OpenScholar’s literature reviews cost between either 1 cent (0.7p) or 5 cents (3.5p) based on pricing models, which allows scholars thousands of searches every month.

Introducing the OpenScholar models the paper’s authors state the study’s “results, together with the substantial reduction in citation hallucinations, demonstrate the potential of OpenScholar to support and accelerate future research efforts”.

ADVERTISEMENT

The authors add that while the “system still has limitations and emphasize that language model-based systems cannot fully automate scientific literature synthesis”, they are making both ScholarQABench and OpenScholar available to the community to encourage ongoing research and refinement.

jack.grove@timeshighereducation.com

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Please
or
to read this article.

Related articles

Sponsored

Featured jobs

See all jobs
ADVERTISEMENT