ChatGPT can pass US medical licence exams, study claims

AI-generated answers showed ‘new, non-obvious and clinically valid’ insights in tests usually taken by students after years of study

February 9, 2023
Source: iStock

Answers generated by artificial intelligence can pass the examinations needed to be granted a medical licence in the US, a new study has claimed.

Researchers said OpenAI’s software ChatGPT scored at or around the 60 per cent threshold in the series of three tests that make up the Medical Licensing Exam (USMLE) with “coherent” responses that “contained frequent insights”.

Achieving a pass in the “notoriously difficult” assessments – usually taken by medical students after at least two years of study – was seen as a “milestone” for the development of AI tools that could have wide-reaching implications for medical education, according to the study’s authors.

But other academics questioned the validity of the findings, published in the open access journal Plos Digital Health, and called the study a publicity stunt for the healthcare company that backed the researchers involved.

Author Tiffany Kung – a clinical fellow in anaesthesia at Massachusetts General Hospital, part of Harvard Medical School – and colleagues used 350 questions from the June 2022 USMLE, incorporating most medical disciplines from biochemistry to diagnostic reasoning.

Their paper found that, after indeterminate responses were removed, ChatGPT scored between 52.4 per cent and 75 per cent across the exams, which usually have a pass threshold of around 60 per cent.

THE Campus resource: ChatGPT has arrived – and nothing has changed

They add that ChatGPT also demonstrated 94.6 per cent concordance across all its responses and produced at least one significant insight – defined as “something that was new, non-obvious and clinically valid” – for 88.9 per cent of its responses.

These were higher scores than those achieved by another AI chatbot, PubMedGPT, which had been trained exclusively on biomedical domain literature. It scored 50.8 per cent on an older dataset of USMLE-style questions.

The authors note that the sample size of questions used was relatively small but feel their study provides “a glimpse of ChatGPT’s potential to enhance medical education, and eventually, clinical practice”.

A preprint of the article circulated on social media had listed ChatGPT as an author as the researchers had asked it to “synthesise, simplify and offer counterpoints to drafts in progress”. The chatbot’s citation was removed ahead of final publication, but Dr Tung stressed that it had “contributed substantially to the writing of [our] manuscript”.

Reacting to the study, Peter Bannister, executive chair of the Institution of Engineering and Technology, said ChatGPT “continues to demonstrate an impressive ability to generate logical content in numerous settings” and the results “serve to highlight the limitations of written tests as the only way of assessing performance in complex and multidisciplinary professions such as medicine”.

“While the results may be of great interest, the study has important limitations that call for caution,” warned Lucía Ortiz de Zárate Alcarazo, a pre-doctoral researcher in the ethics and governance of artificial intelligence at the Autonomous University of Madrid.

“We will have to wait and see what results are obtained when ChatGPT is applied to a larger number of questions and, in turn, is trained with a larger volume of data and more specialised content,” she said.

Ms Ortiz de Zárate Alcarazo added that the results had only been evaluated by two doctors and further studies would need to employ a larger number of qualified evaluators to be able to endorse the findings. 

Collin Bjork, senior lecturer in science communication at Massey University, said the claim that ChatGPT could pass the exams was “overblown and should come with a lengthy series of asterisks”.

He noted that all but one of the authors work for Ansible Health, a Silicon Valley-based healthcare start-up that would soon be likely to need more investment capital. “The media splash from this well-timed journal article will certainly help fund their next round of growth,” Dr Bjork said.

He added claims about the insight shown by the chatbot were “misleading” due to the “vague” definition used by researchers for what constituted this. Claims that AI would one day be able to teach medicine were “naive”, Dr Bjork said. “How can an unaware learner distinguish between true and false insights, especially when ChatGPT only offers ‘accurate’ answers on the USMLE a little more than half the time?”

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Please Login or Register to read this article.

Related articles

Reader's comments (2)

Anyone reading the actual paper with a medical background will realise that there is zero visibility on the MCQ sample that ChatGPT is supposed to have successfully answered. Most likely, there is a strong bias towards those not requiring differential diagnosis or pathophysiological reasoning, namely those for whom the answer exists under a near-litteral form in one of the corpora crawled by the LLM.
So back to viva voce exams, in person, with no external links or practical exams in labs?