Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators
Presentation Time: 03:30 PM - 03:45 PM
Abstract Keywords: Clinical Decision Support, Large Language Models (LLMs), Human-computer Interaction, Information Extraction, Evaluation, Data Mining
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI’s o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.
Speaker(s):
Nicholas Wan, Bachelor of Engineering
National Institutes of Health
Author(s):
Nicholas Wan, Bachelor of Engineering - National Institutes of Health; Qiao Jin, M.D. - National Institutes of Health; Joey Chan, M.S. - National Library of Medicine; Guangzhi Xiong, BA - University of Virginia; Serina Applebaum, BS - Yale School of Medicine; Aidan Gilson, MD - Massachusetts Eye and Ear; Reid McMurry, MD - Boston Medical Center; Richard Taylor, MD - University of Virginia School of Medicine, Department of Emergency Medicine; Aidong Zhang, PhD - University of Virginia; Qingyu Chen, PhD - Yale University; Zhiyong Lu, PhD - National Library of Medicine, NIH;
Presentation Time: 03:30 PM - 03:45 PM
Abstract Keywords: Clinical Decision Support, Large Language Models (LLMs), Human-computer Interaction, Information Extraction, Evaluation, Data Mining
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI’s o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.
Speaker(s):
Nicholas Wan, Bachelor of Engineering
National Institutes of Health
Author(s):
Nicholas Wan, Bachelor of Engineering - National Institutes of Health; Qiao Jin, M.D. - National Institutes of Health; Joey Chan, M.S. - National Library of Medicine; Guangzhi Xiong, BA - University of Virginia; Serina Applebaum, BS - Yale School of Medicine; Aidan Gilson, MD - Massachusetts Eye and Ear; Reid McMurry, MD - Boston Medical Center; Richard Taylor, MD - University of Virginia School of Medicine, Department of Emergency Medicine; Aidong Zhang, PhD - University of Virginia; Qingyu Chen, PhD - Yale University; Zhiyong Lu, PhD - National Library of Medicine, NIH;
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators
Category
Paper - Student