Times are displayed in (UTC-08:00) Pacific Time (US & Canada) Change
11/12/2024 |
10:30 AM – 12:00 PM |
Franciscan B
S75: AI for Medical Diagnosis - Patho-logical
Presentation Type: Oral
Session Chair:
Majid Afshar, MD, MSCR - University of Wisconsin - Madison
Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study
Presentation Time: 10:30 AM - 10:45 AM
Abstract Keywords: Large Language Models (LLMs), Clinical Decision Support, Natural Language Processing, Diagnostic Systems, Machine Learning, Interoperability and Health Information Exchange, Deep Learning, Human-computer Interaction
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
Question: Does the use of a large language model (LLM) improve diagnostic reasoning performance among physicians compared to conventional diagnostic resources?
Findings: The use of GPT-4 by physicians did not significantly enhance diagnostic reasoning performance compared to use of conventional resources. GPT-4 alone performed better than both groups of physicians.
Meaning: While GPT-4 alone demonstrates substantial diagnostic reasoning performance, the use of GPT-4 as a diagnostic aid may not enhance diagnostic reasoning beyond conventional resources.
Speaker(s):
Robert Gallo, MD
VA Palo Alto Health Care System
Author(s):
Ethan Goh, MD, MS - Stanford University; Robert Gallo, MD - VA Palo Alto Health Care System; Jason Hom, MD - Stanford University School of Medicine; Eric Strong, MD - Stanford University School of Medicine; Yingjie Weng, MHS - Stanford University School of Medicine; Hannah Kerman, MD - Beth Israel Deaconess Medical Center; Josephine Cool, MD - Beth Israel Deaconess Medical Center; Zahir Kanjee, MD, MPH - Beth Israel Deaconess Medical Center; Andrew S. Parsons, MD, MPH - University of Virginia School of Medicine; Neera Ahuja, MD - Stanford University School of Medicine; Eric Horvitz, PhD, MD - Microsoft Research; Daniel Yang, MD - Kaiser Permanente; Arnold Milstein, MD, MPH - Clinical Excellence Research Center, Stanford University School of Medicine; Andrew Olson - University of Minnesota Medical School Twin Cities; Adam Rodman, MD, MPH - Beth Israel Deaconess Medical Center; Jonathan Chen - Stanford University Hospital;
Presentation Time: 10:30 AM - 10:45 AM
Abstract Keywords: Large Language Models (LLMs), Clinical Decision Support, Natural Language Processing, Diagnostic Systems, Machine Learning, Interoperability and Health Information Exchange, Deep Learning, Human-computer Interaction
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
Question: Does the use of a large language model (LLM) improve diagnostic reasoning performance among physicians compared to conventional diagnostic resources?
Findings: The use of GPT-4 by physicians did not significantly enhance diagnostic reasoning performance compared to use of conventional resources. GPT-4 alone performed better than both groups of physicians.
Meaning: While GPT-4 alone demonstrates substantial diagnostic reasoning performance, the use of GPT-4 as a diagnostic aid may not enhance diagnostic reasoning beyond conventional resources.
Speaker(s):
Robert Gallo, MD
VA Palo Alto Health Care System
Author(s):
Ethan Goh, MD, MS - Stanford University; Robert Gallo, MD - VA Palo Alto Health Care System; Jason Hom, MD - Stanford University School of Medicine; Eric Strong, MD - Stanford University School of Medicine; Yingjie Weng, MHS - Stanford University School of Medicine; Hannah Kerman, MD - Beth Israel Deaconess Medical Center; Josephine Cool, MD - Beth Israel Deaconess Medical Center; Zahir Kanjee, MD, MPH - Beth Israel Deaconess Medical Center; Andrew S. Parsons, MD, MPH - University of Virginia School of Medicine; Neera Ahuja, MD - Stanford University School of Medicine; Eric Horvitz, PhD, MD - Microsoft Research; Daniel Yang, MD - Kaiser Permanente; Arnold Milstein, MD, MPH - Clinical Excellence Research Center, Stanford University School of Medicine; Andrew Olson - University of Minnesota Medical School Twin Cities; Adam Rodman, MD, MPH - Beth Israel Deaconess Medical Center; Jonathan Chen - Stanford University Hospital;
Clinician Perspectives on Adoption and Usability of a Digital Clinical Image Education and Differential Diagnosis Tool: A Mixed-Methods Study
Presentation Time: 10:45 AM - 11:00 AM
Abstract Keywords: Clinical Decision Support, Health Equity, Qualitative Methods, Workflow, Surveys and Needs Analysis
Primary Track: Applications
Programmatic Theme: Clinical Informatics
We designed a quality improvement study to understand clinicians’ perspectives on the use of the VisualDx™ tool across the M Health Fairview (MHFV) system. Using surveys and interviews, we identified the usage, associated barriers, benefits, and suggestions for adoption of the tool. VisualDx™ has a multidimensional functionality that benefits clinicians in their direct patient care and patient/student education. However, it needs effective dissemination, easier access, and more training for the end-users.
Speaker(s):
Sameepya Thatipelli, B.S.
University of Minnesota Medical School-Twin Cities
Author(s):
Rubina Rizvi, MD, PhD - University of Minnesota; Matthew Loth, PhD - Institute for Health Informatics (IHI), University of Minnesota; Elizabeth Lindemann, MHA - Fairview Health Services; Tamara Kasal, MBA - Fairview Health Services; leyla Warsame, MD - M Health Fairview; Iva Ninkovic, MPH - Fairview Health Services; Rebecca Markowitz, MD - Fairview Health System; Sonja Short, MD - Fairview Health Ses; Genevieve Melton-Meaux, MD, PhD - University of Minnesota;
Presentation Time: 10:45 AM - 11:00 AM
Abstract Keywords: Clinical Decision Support, Health Equity, Qualitative Methods, Workflow, Surveys and Needs Analysis
Primary Track: Applications
Programmatic Theme: Clinical Informatics
We designed a quality improvement study to understand clinicians’ perspectives on the use of the VisualDx™ tool across the M Health Fairview (MHFV) system. Using surveys and interviews, we identified the usage, associated barriers, benefits, and suggestions for adoption of the tool. VisualDx™ has a multidimensional functionality that benefits clinicians in their direct patient care and patient/student education. However, it needs effective dissemination, easier access, and more training for the end-users.
Speaker(s):
Sameepya Thatipelli, B.S.
University of Minnesota Medical School-Twin Cities
Author(s):
Rubina Rizvi, MD, PhD - University of Minnesota; Matthew Loth, PhD - Institute for Health Informatics (IHI), University of Minnesota; Elizabeth Lindemann, MHA - Fairview Health Services; Tamara Kasal, MBA - Fairview Health Services; leyla Warsame, MD - M Health Fairview; Iva Ninkovic, MPH - Fairview Health Services; Rebecca Markowitz, MD - Fairview Health System; Sonja Short, MD - Fairview Health Ses; Genevieve Melton-Meaux, MD, PhD - University of Minnesota;
Automated Stratification of Trauma Injury Severity Across Multiple Body Regions Using Multimodal, Multiclass Machine Learning Models
Presentation Time: 11:00 AM - 11:15 AM
Abstract Keywords: Diagnostic Systems, Machine Learning, Natural Language Processing, Deep Learning, Clinical Decision Support
Primary Track: Foundations
The timely stratification of trauma injury severity can enhance the quality of trauma care but it requires intense manual annotation from certified trauma coders. The objective of this study is to develop machine learning models for the stratification of trauma injury severity across various body regions using clinical text and structured electronic health records (EHR) data. Our study utilized clinical documents and structured EHR variables linked with the trauma registry data to create two machine learning models with different approaches to representing text. The first one fuses concept unique identifiers (CUIs) extracted from free text with structured EHR variables, while the second one integrates free text with structured EHR variables. Temporal validation was undertaken to ensure the models' temporal generalizability. Additionally, analyses to assess the variable importance were conducted. Both models demonstrated impressive performance in categorizing leg injuries, achieving high accuracy with macro-F1 scores of over 0.8. They also showed considerable accuracy, with macro-F1 scores exceeding or near 0.7, in assessing injuries in the areas of the chest and head. We showed in our variable importance analysis that the most important features in the model have strong face validity in determining clinically relevant trauma injuries. Both models can provide accurate stratification of trauma injury severity and clinically relevant interpretations. The CUI-based model achieves comparable performance, if not higher, compared to the free-text-based model, with reduced complexity. Furthermore, integrating structured EHR data improves performance, particularly when the text modalities are insufficiently indicative.
Speaker(s):
Jifan Gao, MS
University of Wisconsin-Madison
Author(s):
Jifan Gao, MS - University of Wisconsin-Madison; Guanhua Chen, PhD - University of Wisconsin-Madison; Ann O’Rourke, MD, MPH, FACS - University of Wisconsin-Madison; John Caskey - University of Wisconsin-Madison; Kyle Carey, MPH - University of Chicago; Madeline Oguss, MS - University of Wisconsin - Madison; Anne Stey, MD, MSc - Northwestern University; Dmitriy Dligach, Ph.D. - Loyola University Chicago; Tim Miller, PhD - Children's Hospital Boston/Harvard Medical School; Anoop Mayampurath, PhD - University of Wisconsin - Madison; Matthew Churpek, MD, MPH, PhD - University of Wisconsin-Madison; Majid Afshar, MD, MSCR - University of Wisconsin - Madison;
Presentation Time: 11:00 AM - 11:15 AM
Abstract Keywords: Diagnostic Systems, Machine Learning, Natural Language Processing, Deep Learning, Clinical Decision Support
Primary Track: Foundations
The timely stratification of trauma injury severity can enhance the quality of trauma care but it requires intense manual annotation from certified trauma coders. The objective of this study is to develop machine learning models for the stratification of trauma injury severity across various body regions using clinical text and structured electronic health records (EHR) data. Our study utilized clinical documents and structured EHR variables linked with the trauma registry data to create two machine learning models with different approaches to representing text. The first one fuses concept unique identifiers (CUIs) extracted from free text with structured EHR variables, while the second one integrates free text with structured EHR variables. Temporal validation was undertaken to ensure the models' temporal generalizability. Additionally, analyses to assess the variable importance were conducted. Both models demonstrated impressive performance in categorizing leg injuries, achieving high accuracy with macro-F1 scores of over 0.8. They also showed considerable accuracy, with macro-F1 scores exceeding or near 0.7, in assessing injuries in the areas of the chest and head. We showed in our variable importance analysis that the most important features in the model have strong face validity in determining clinically relevant trauma injuries. Both models can provide accurate stratification of trauma injury severity and clinically relevant interpretations. The CUI-based model achieves comparable performance, if not higher, compared to the free-text-based model, with reduced complexity. Furthermore, integrating structured EHR data improves performance, particularly when the text modalities are insufficiently indicative.
Speaker(s):
Jifan Gao, MS
University of Wisconsin-Madison
Author(s):
Jifan Gao, MS - University of Wisconsin-Madison; Guanhua Chen, PhD - University of Wisconsin-Madison; Ann O’Rourke, MD, MPH, FACS - University of Wisconsin-Madison; John Caskey - University of Wisconsin-Madison; Kyle Carey, MPH - University of Chicago; Madeline Oguss, MS - University of Wisconsin - Madison; Anne Stey, MD, MSc - Northwestern University; Dmitriy Dligach, Ph.D. - Loyola University Chicago; Tim Miller, PhD - Children's Hospital Boston/Harvard Medical School; Anoop Mayampurath, PhD - University of Wisconsin - Madison; Matthew Churpek, MD, MPH, PhD - University of Wisconsin-Madison; Majid Afshar, MD, MSCR - University of Wisconsin - Madison;
Improving the Performance of LLM-Based Semi-Automated Psychiatric Case Diagnosis using Decision Tree-Based Prompting
Presentation Time: 11:15 AM - 11:30 AM
Abstract Keywords: Diagnostic Systems, Natural Language Processing, Large Language Models (LLMs), Rule-based artificial intelligence
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
The knowledge and reasoning capacity of LLMs in psychiatry is an active topic of interest. In this work, we demonstrate a decision tree-based approach to prompting LLMs to provide semi-automated case diagnoses for standardized psychiatric scenarios, drawing from previously developed structured diagnostic pathways within psychiatry. We found that this approach improved the recall and significantly improved the precision of LLM-predicted diagnoses from case vignettes, suggesting that LLMs have promise in detecting and reasoning about psychiatric symptoms and diagnoses.
Speaker(s):
Kaitlin Hanss, MD, MPH
UCSF
Author(s):
Karthik Sarma, MD PhD - UCSF; Anne Glowinski, MD MPE - UCSF; Atul Butte, MD, PhD - University of California, San Francisco; Andrew Halls, MD - UCSF;
Presentation Time: 11:15 AM - 11:30 AM
Abstract Keywords: Diagnostic Systems, Natural Language Processing, Large Language Models (LLMs), Rule-based artificial intelligence
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
The knowledge and reasoning capacity of LLMs in psychiatry is an active topic of interest. In this work, we demonstrate a decision tree-based approach to prompting LLMs to provide semi-automated case diagnoses for standardized psychiatric scenarios, drawing from previously developed structured diagnostic pathways within psychiatry. We found that this approach improved the recall and significantly improved the precision of LLM-predicted diagnoses from case vignettes, suggesting that LLMs have promise in detecting and reasoning about psychiatric symptoms and diagnoses.
Speaker(s):
Kaitlin Hanss, MD, MPH
UCSF
Author(s):
Karthik Sarma, MD PhD - UCSF; Anne Glowinski, MD MPE - UCSF; Atul Butte, MD, PhD - University of California, San Francisco; Andrew Halls, MD - UCSF;
Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses
Presentation Time: 11:30 AM - 11:45 AM
Abstract Keywords: Evaluation, Large Language Models (LLMs), Natural Language Processing, Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Clinical Informatics
In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality
remains challenging, as existing methods often overlook generative task complexities. This work aimed to
examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well validated
baseline with which to examine the alignment of these metrics, we created a comprehensive human
evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments
with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score—a Unified
Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating
domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for
generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts
should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics,
particularly focusing on refining the SapBERT score for improved assessments.
Speaker(s):
Emma Croxford, PhD Student
University of Wisconsin Madison
Author(s):
Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Yanjun Gao, PhD - University of Wisconsin Madison; Brian Patterson, MD MPH - University of Wisconsin-Madison; Daniel To - University of Wisconsin, Madison - UW Health; Samuel Tesch, Medical Student/MD - University of Wisconsin School of Medicine and Public Health; Anoop Mayampurath, PhD - University of Wisconsin - Madison; Matthew Churpek, MD, MPH, PhD - University of Wisconsin-Madison; Dmitriy Dligach, Ph.D. - Loyola University Chicago;
Presentation Time: 11:30 AM - 11:45 AM
Abstract Keywords: Evaluation, Large Language Models (LLMs), Natural Language Processing, Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Clinical Informatics
In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality
remains challenging, as existing methods often overlook generative task complexities. This work aimed to
examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well validated
baseline with which to examine the alignment of these metrics, we created a comprehensive human
evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments
with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score—a Unified
Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating
domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for
generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts
should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics,
particularly focusing on refining the SapBERT score for improved assessments.
Speaker(s):
Emma Croxford, PhD Student
University of Wisconsin Madison
Author(s):
Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Yanjun Gao, PhD - University of Wisconsin Madison; Brian Patterson, MD MPH - University of Wisconsin-Madison; Daniel To - University of Wisconsin, Madison - UW Health; Samuel Tesch, Medical Student/MD - University of Wisconsin School of Medicine and Public Health; Anoop Mayampurath, PhD - University of Wisconsin - Madison; Matthew Churpek, MD, MPH, PhD - University of Wisconsin-Madison; Dmitriy Dligach, Ph.D. - Loyola University Chicago;
Pheval: Evaluating the Use of LLMs to Improve Differential Diagnosis
Presentation Time: 11:45 AM - 12:00 PM
Abstract Keywords: Large Language Models (LLMs), Diagnostic Systems, Bioinformatics, Clinical Decision Support
Working Group: Genomics and Translational Bioinformatics Working Group
Primary Track: Applications
Programmatic Theme: Translational Bioinformatics
Differential diagnosis is crucial in rare disease research, and state-of-the-art software such as Exomiser performs this task. It is challenging to systematically compare Exomiser with other strategies. We constructed Pheval to compare the performance using a set of solved cases. We evaluated the performance of LLMs for this task using Pheval. Our results indicate that LLMs currently do not outperform state-of-the-art software for differential diagnosis, and demonstrate the utility of Pheval in evaluating different strategies.
Speaker(s):
Justin Reese, PhD
Lawrence Berkeley National Laboratory
Author(s):
Julius Jacobsen, PhD - Queen Mary University of London; Yasemin Bridges, M Sc - Queen Mary University of London; Carlo Kroll, BS - Queen Mary University of London; Harry Caufield, PhD - Lawrence Berkeley National Laboratory,; Harshad Hegde, MS - Lawrence Berkeley National Laboratory; Nicolas Matentzoglu, PhD - Semanticly; Melissa Haendel, PhD - CU Anschutz; Damian Smedley, PhD - Queen Mary University of London; Peter Robinson, MD, PhD - Berlin Institute of Health, Charité Universitätsmedizin; Christopher Mungall, PhD - Lawrence Berkeley National Laboratory; Justin Reese - Lawrence Berkeley National Laboratory;
Presentation Time: 11:45 AM - 12:00 PM
Abstract Keywords: Large Language Models (LLMs), Diagnostic Systems, Bioinformatics, Clinical Decision Support
Working Group: Genomics and Translational Bioinformatics Working Group
Primary Track: Applications
Programmatic Theme: Translational Bioinformatics
Differential diagnosis is crucial in rare disease research, and state-of-the-art software such as Exomiser performs this task. It is challenging to systematically compare Exomiser with other strategies. We constructed Pheval to compare the performance using a set of solved cases. We evaluated the performance of LLMs for this task using Pheval. Our results indicate that LLMs currently do not outperform state-of-the-art software for differential diagnosis, and demonstrate the utility of Pheval in evaluating different strategies.
Speaker(s):
Justin Reese, PhD
Lawrence Berkeley National Laboratory
Author(s):
Julius Jacobsen, PhD - Queen Mary University of London; Yasemin Bridges, M Sc - Queen Mary University of London; Carlo Kroll, BS - Queen Mary University of London; Harry Caufield, PhD - Lawrence Berkeley National Laboratory,; Harshad Hegde, MS - Lawrence Berkeley National Laboratory; Nicolas Matentzoglu, PhD - Semanticly; Melissa Haendel, PhD - CU Anschutz; Damian Smedley, PhD - Queen Mary University of London; Peter Robinson, MD, PhD - Berlin Institute of Health, Charité Universitätsmedizin; Christopher Mungall, PhD - Lawrence Berkeley National Laboratory; Justin Reese - Lawrence Berkeley National Laboratory;