Clinicians and Chatbots: Boston Hospital Study Compares Their Clinical Reasoning
Although artificial intelligence has the potential to improve and transform health care, doctors aren’t likely to leave the decision-making to chatbots anytime soon.
Physician-scientists at Beth Israel Deaconess Medical Center (BIDMC) in Boston studied how well an AI chatbot program processed medical data and demonstrated clinical reasoning compared with clinicians completing the same tasks. Results from the study, shared in a research letter published in JAMA Internal Medicine, showed that ChatGPT-4, a large language model, can make a clinical diagnosis as well as humans — and in some cases, better — but the chatbot had more instances of incorrect clinical reasoning.
“The finding underscores the notion that AI will likely be most useful as a tool to augment, not replace, the human reasoning process,” according to a BIDMC blog.
The BIDMC investigators recruited 21 attending physicians and 18 residents to each work through one of 20 clinical cases with four sequential stages of diagnostic reasoning. The chatbot received a prompt with identical instructions and ran all 20 clinical cases. To score answers, the researchers used a validated tool to assess clinical reasoning called the revised-IDEA and several other measures.
Researchers found that ChatGPT-4 earned the highest scores for r-IDEA assessments, with a median score of 10 out of 10 for the large language model, 9 out of 10 for attending physicians and 8 out of 10 for residents. However, diagnostic accuracy — “how high up the correct diagnosis was on the list of diagnoses they provided” — and correct clinical reasoning were “more of a draw,” the researchers observed. The chatbot had more instances of incorrect reasoning — i.e., was “just plain wrong”— in its answers “significantly more often” than residents, the study found.