Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine
Poster Number: P26
Presentation Time: 05:00 PM - 06:30 PM
Abstract Keywords: Large Language Models (LLMs), Imaging Informatics, Deep Learning
Primary Track: Applications
We conducted a comprehensive evaluation of GPT-4V’s rationales when solving NEJM Image Challenges. We show that GPT-4V achieves promising results compared to expert physicians regarding multi-choice accuracy (88.0% vs. 77.0%). However, we discovered that GPT-4V frequently presents flawed rationales even in cases where it makes the correct final choices (27.3%), mostly in image comprehension. As such, our findings emphasize the necessity for in-depth evaluations before integrating such models into clinical workflows.
Speaker(s):
Qiao Jin, M.D.
National Institutes of Health
Author(s):
Qiao Jin, M.D. - National Institutes of Health; Fangyuan Chen, B.S. - University of Pittsburgh; Yiliang Zhou, Master - Weill Cornell Medicine; Ziyang Xu, MD, PhD - New York University Grossman School of Medicine; Justin Cheung, MD - Harvard Medical School and Massachusetts General Hospital; Robert Chen, MD - Weill Cornell Medicine; Ronald Summers, MD, PhD - National Institutes of Health; Justin Rousseau, MD, MMSc - University of Texas Southwestern Medical Center; Peiyun Ni, MD - Harvard Medical School and Massachusetts General Hospital; Marc Landsman, MD - Case Western Reserve University School of Medicine; Sally Baxter, MD, MSc - University of California - San Diego; Subhi Al'Aref, MD - University of Arkansas for Medical Sciences; Yijia Li, MD - University of Pittsburgh Medical Center; Michael Chiang, MD - National Institutes of Health, National Eye Institute; Yifan Peng, PhD - Weill Cornell Medicine; Dept of Population Health Sciences; Div of Health Informatics; Zhiyong Lu, PhD - National Library of Medicine, NIH;
Poster Number: P26
Presentation Time: 05:00 PM - 06:30 PM
Abstract Keywords: Large Language Models (LLMs), Imaging Informatics, Deep Learning
Primary Track: Applications
We conducted a comprehensive evaluation of GPT-4V’s rationales when solving NEJM Image Challenges. We show that GPT-4V achieves promising results compared to expert physicians regarding multi-choice accuracy (88.0% vs. 77.0%). However, we discovered that GPT-4V frequently presents flawed rationales even in cases where it makes the correct final choices (27.3%), mostly in image comprehension. As such, our findings emphasize the necessity for in-depth evaluations before integrating such models into clinical workflows.
Speaker(s):
Qiao Jin, M.D.
National Institutes of Health
Author(s):
Qiao Jin, M.D. - National Institutes of Health; Fangyuan Chen, B.S. - University of Pittsburgh; Yiliang Zhou, Master - Weill Cornell Medicine; Ziyang Xu, MD, PhD - New York University Grossman School of Medicine; Justin Cheung, MD - Harvard Medical School and Massachusetts General Hospital; Robert Chen, MD - Weill Cornell Medicine; Ronald Summers, MD, PhD - National Institutes of Health; Justin Rousseau, MD, MMSc - University of Texas Southwestern Medical Center; Peiyun Ni, MD - Harvard Medical School and Massachusetts General Hospital; Marc Landsman, MD - Case Western Reserve University School of Medicine; Sally Baxter, MD, MSc - University of California - San Diego; Subhi Al'Aref, MD - University of Arkansas for Medical Sciences; Yijia Li, MD - University of Pittsburgh Medical Center; Michael Chiang, MD - National Institutes of Health, National Eye Institute; Yifan Peng, PhD - Weill Cornell Medicine; Dept of Population Health Sciences; Div of Health Informatics; Zhiyong Lu, PhD - National Library of Medicine, NIH;
Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine
Category
Poster - Regular
Description
Date: Monday (11/11)
Time: 05:00 PM to 06:30 PM
Room: Grand Ballroom (Posters)
Time: 05:00 PM to 06:30 PM
Room: Grand Ballroom (Posters)