3/12/2025 |
1:30 PM – 3:00 PM |
Urban
S27: Wrangling Clinical Documentation with LLMs
Presentation Type: Podium Abstract
Not the Models You Are Looking For: An Evaluation of Performance, Privacy, and Fairness of LLMs in EHR Tasks
2025 Informatics Summit On Demand
Presentation Time: 01:30 PM - 01:45 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Data Security and Privacy, Data Mining and Knowledge Discovery
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
We use a private dataset derived from Vanderbilt University Medical Center’s EHR and GPT-3.5, GPT-4, and traditional ML, measuring predictive performance, output calibration, privacy-utility tradeoff, and algorithmic fairness. Traditional ML vastly outperformed GPT-3.5 and GPT-4 with respect to predictive performance and output probability classification. We find that traditional ML is much more robust to efforts to generalize demographic information compared to GPT-3.5 and GPT-4. Surprisingly, GPT-4 is the fairest model according to our selected metrics. These findings imply additional research into LLMs is necessary before deploying as clinical prediction models.
Speaker(s):
Katherine Brown, PhD
Vanderbilt University Medical Center
Author(s):
Katherine Brown, PhD - Vanderbilt University Medical Center; Chao Yan, PhD - Vanderbilt University Medical Center; Zhuohang Li, MS - Vanderbilt University; Xinmeng Zhang, BS - Vanderbilt University; Benjamin Collins, MD - Vanderbilt University Medical Center; You Chen, PhD - Vanderbilt University; Ellen Wright Clayton, MD, JD - Vanderbilt Medical Center; Murat Kantarcioglu, PhD - Virginia Tech; Yevgeniy Vorobeychik, PhD - Washington University; Bradley Malin, PhD - Vanderbilt University Medical Center;
2025 Informatics Summit On Demand
Presentation Time: 01:30 PM - 01:45 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Data Security and Privacy, Data Mining and Knowledge Discovery
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
We use a private dataset derived from Vanderbilt University Medical Center’s EHR and GPT-3.5, GPT-4, and traditional ML, measuring predictive performance, output calibration, privacy-utility tradeoff, and algorithmic fairness. Traditional ML vastly outperformed GPT-3.5 and GPT-4 with respect to predictive performance and output probability classification. We find that traditional ML is much more robust to efforts to generalize demographic information compared to GPT-3.5 and GPT-4. Surprisingly, GPT-4 is the fairest model according to our selected metrics. These findings imply additional research into LLMs is necessary before deploying as clinical prediction models.
Speaker(s):
Katherine Brown, PhD
Vanderbilt University Medical Center
Author(s):
Katherine Brown, PhD - Vanderbilt University Medical Center; Chao Yan, PhD - Vanderbilt University Medical Center; Zhuohang Li, MS - Vanderbilt University; Xinmeng Zhang, BS - Vanderbilt University; Benjamin Collins, MD - Vanderbilt University Medical Center; You Chen, PhD - Vanderbilt University; Ellen Wright Clayton, MD, JD - Vanderbilt Medical Center; Murat Kantarcioglu, PhD - Virginia Tech; Yevgeniy Vorobeychik, PhD - Washington University; Bradley Malin, PhD - Vanderbilt University Medical Center;
Leveraging Open-Source Large-Language Model-Enabled Identification of Undiagnosed Patients with Rare Genetic Aortopathies
2025 Informatics Summit On Demand
Presentation Time: 01:45 PM - 02:00 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Clinical Decision Support for Translational/Data Science Interventions, Patient-centered Research and Care
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Hereditary aortopathies are often underdiagnosed, with many patients not receiving genetic testing until after a cardiac event. In this pilot study, we investigate the use of open-source LLMs for recommending genetic testing based on clinical notes. We evaluate the utility of injecting disease-specific knowledge into retrieval augmentation generation-based and finetuned models. Our result of 93% accuracy using a base model alone surprisingly suggests that incorporating domain knowledge may sometimes hinder clinical model performance.
Speaker(s):
Zilinghan Li, Master of Science
Argonne National Laboratory
Author(s):
Anurag Verma - University of Pennsylvania; Theodore Drivas, MD, PhD - University of Pennsylvania Perelman School of Medicine; Ravi Madduri, MS - Argonne National Laboratory; Zilinghan Li, MS - Argonne National Laboratory; Ze Yang, MS - University of Illinois Urbana-Champaign; Tarak Nandi, PhD - Argonne National Laboratory; Colleen Morse, PT, DPT - University of Pennsylvania; Zachary Rodriguez, PhD - University of Pennsylvania; Reed Pyeritz, MD, PhD - University of Pennsylvania; Giorgio Sirugo, MD, PhD - University of Pennsylvania; Alex Rodriguez, PhD - Argonne National Laboratory; Pankhuri Singhal, PhD in Genetics - University of Pennsylvania;
2025 Informatics Summit On Demand
Presentation Time: 01:45 PM - 02:00 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Clinical Decision Support for Translational/Data Science Interventions, Patient-centered Research and Care
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Hereditary aortopathies are often underdiagnosed, with many patients not receiving genetic testing until after a cardiac event. In this pilot study, we investigate the use of open-source LLMs for recommending genetic testing based on clinical notes. We evaluate the utility of injecting disease-specific knowledge into retrieval augmentation generation-based and finetuned models. Our result of 93% accuracy using a base model alone surprisingly suggests that incorporating domain knowledge may sometimes hinder clinical model performance.
Speaker(s):
Zilinghan Li, Master of Science
Argonne National Laboratory
Author(s):
Anurag Verma - University of Pennsylvania; Theodore Drivas, MD, PhD - University of Pennsylvania Perelman School of Medicine; Ravi Madduri, MS - Argonne National Laboratory; Zilinghan Li, MS - Argonne National Laboratory; Ze Yang, MS - University of Illinois Urbana-Champaign; Tarak Nandi, PhD - Argonne National Laboratory; Colleen Morse, PT, DPT - University of Pennsylvania; Zachary Rodriguez, PhD - University of Pennsylvania; Reed Pyeritz, MD, PhD - University of Pennsylvania; Giorgio Sirugo, MD, PhD - University of Pennsylvania; Alex Rodriguez, PhD - Argonne National Laboratory; Pankhuri Singhal, PhD in Genetics - University of Pennsylvania;
Identifying Opioid Overdose and Opioid Use Disorder and Related Information from Clinical Narratives Using Large Language Models
2025 Informatics Summit On Demand
Presentation Time: 02:00 PM - 02:15 PM
Abstract Keywords: Secondary Use of EHR Data, Natural Language Processing, Machine Learning, Generative AI, and Predictive Modeling
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Opioid overdose and opioid use disorder (OUD) remain a growing public health issue in the United States, affecting 6.1 million individuals in 2022, more than doubling the 2.5 million from 2021. Accurately identifying the opioid overdose and OUD related information is critical to study the outcomes and develop interventions. This study aims to identify opioid overdose and OUD mentions and their related information from clinical narratives. We compared encoder-based large language models (LLMs) and decoder-based generative LLMs in extracting nine crucial concepts related with opioid overdose and OUD including problematic opioid use. Through a cost-effective p-tuning algorithm, our decoder-based generative LLM, GatorTronGPT, achieved the best strict/lenient F1-score of 0.8637, and 0.9057, demonstrating the efficient of using generative LLMs for opioid overdose/OUD related information extraction. This study provided a tool to systematically extract opioid overdose, OUD, and their related information to facilitate opioid-related studies using clinical narratives.
Speaker(s):
Daniel Paredes, MS
University of Florida
Author(s):
Sankalp Talankar, MS - University of Florida; Cheng Peng, PhD - University of Florida; Patrick Balian, DBA - University of Florida; Motomori Lewis, PhD - University of Florida; Shunhua Yan, MEd - University of Florida; Wen-Shan Tsai, PharmD - National Cheng Kung University Hospital; Ching-Yuan Chang, PhD - University of Florida; Debbie Wilson; Weihsuan Jenny Lo-Ciganic, PhD - University of Florida; Yonghui Wu, PhD - University of Florida;
2025 Informatics Summit On Demand
Presentation Time: 02:00 PM - 02:15 PM
Abstract Keywords: Secondary Use of EHR Data, Natural Language Processing, Machine Learning, Generative AI, and Predictive Modeling
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Opioid overdose and opioid use disorder (OUD) remain a growing public health issue in the United States, affecting 6.1 million individuals in 2022, more than doubling the 2.5 million from 2021. Accurately identifying the opioid overdose and OUD related information is critical to study the outcomes and develop interventions. This study aims to identify opioid overdose and OUD mentions and their related information from clinical narratives. We compared encoder-based large language models (LLMs) and decoder-based generative LLMs in extracting nine crucial concepts related with opioid overdose and OUD including problematic opioid use. Through a cost-effective p-tuning algorithm, our decoder-based generative LLM, GatorTronGPT, achieved the best strict/lenient F1-score of 0.8637, and 0.9057, demonstrating the efficient of using generative LLMs for opioid overdose/OUD related information extraction. This study provided a tool to systematically extract opioid overdose, OUD, and their related information to facilitate opioid-related studies using clinical narratives.
Speaker(s):
Daniel Paredes, MS
University of Florida
Author(s):
Sankalp Talankar, MS - University of Florida; Cheng Peng, PhD - University of Florida; Patrick Balian, DBA - University of Florida; Motomori Lewis, PhD - University of Florida; Shunhua Yan, MEd - University of Florida; Wen-Shan Tsai, PharmD - National Cheng Kung University Hospital; Ching-Yuan Chang, PhD - University of Florida; Debbie Wilson; Weihsuan Jenny Lo-Ciganic, PhD - University of Florida; Yonghui Wu, PhD - University of Florida;
Exploring ChatGPT 3.5 for structured data extraction from oncological notes
2025 Informatics Summit On Demand
Presentation Time: 02:15 PM - 02:30 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Secondary Use of EHR Data, Natural Language Processing, Data Security and Privacy, Data Sharing/Interoperability
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
In large-scale clinical informatics, there is a need to maximize the amount of usable data from electronic health records. With the adoption of large language models in HIPAA secure environments, there is potential to use them to extract structured data from unstructured clinical notes. We explored how ChatGPT 3.5 could be used to supplement data in cancer research. We assessed how GPT used clinical notes to answer six relevant clinical questions. Four prompt engineering strategies were used: zero-shot, zero-shot with context, few-shot, and few-shot with context. Few-shot prompting often decreased the accuracy of GPT outputs and context did not consistently improve accuracy. GPT extracted patients’ Gleason scores and ages with an F1 score of 0.99 and it identified if patients received palliative care with and if patients were in pain with an F1 score of 0.86. This has potential to increase interoperability between healthcare and clinical research.
Speaker(s):
Ty Skyles, BS candidate
Brigham Young University
Author(s):
Adam Wilcox, PhD - Washington University in St. Louis; Kendall Kiser, MD, MS - Washington University in St. Louis; Isaac Freeman, Bachelor's of Science in Data Science - Washington University Department of Bioinformatics; David Davila-Garcia, BS - Columbia University Department of Biomedical Informatics; Silpa Raju, MD - Washington University in St. Louis; Georgewilliam Kalibbala, BS - Washington University in St. Louis;
2025 Informatics Summit On Demand
Presentation Time: 02:15 PM - 02:30 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Secondary Use of EHR Data, Natural Language Processing, Data Security and Privacy, Data Sharing/Interoperability
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
In large-scale clinical informatics, there is a need to maximize the amount of usable data from electronic health records. With the adoption of large language models in HIPAA secure environments, there is potential to use them to extract structured data from unstructured clinical notes. We explored how ChatGPT 3.5 could be used to supplement data in cancer research. We assessed how GPT used clinical notes to answer six relevant clinical questions. Four prompt engineering strategies were used: zero-shot, zero-shot with context, few-shot, and few-shot with context. Few-shot prompting often decreased the accuracy of GPT outputs and context did not consistently improve accuracy. GPT extracted patients’ Gleason scores and ages with an F1 score of 0.99 and it identified if patients received palliative care with and if patients were in pain with an F1 score of 0.86. This has potential to increase interoperability between healthcare and clinical research.
Speaker(s):
Ty Skyles, BS candidate
Brigham Young University
Author(s):
Adam Wilcox, PhD - Washington University in St. Louis; Kendall Kiser, MD, MS - Washington University in St. Louis; Isaac Freeman, Bachelor's of Science in Data Science - Washington University Department of Bioinformatics; David Davila-Garcia, BS - Columbia University Department of Biomedical Informatics; Silpa Raju, MD - Washington University in St. Louis; Georgewilliam Kalibbala, BS - Washington University in St. Louis;
Enhancing Disease Detection in Radiology Reports Through Fine-tuning Lightweight LLM on Weak Labels
2025 Informatics Summit On Demand
Presentation Time: 02:30 PM - 02:45 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Natural Language Processing, EHR-based Phenotyping
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4-o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
Speaker(s):
Yishu Wei, PhD
Department of Population Health Sciences, Weill Cornell Medicine
Author(s):
2025 Informatics Summit On Demand
Presentation Time: 02:30 PM - 02:45 PM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Natural Language Processing, EHR-based Phenotyping
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4-o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
Speaker(s):
Yishu Wei, PhD
Department of Population Health Sciences, Weill Cornell Medicine
Author(s):
Predicting Antibiotic Resistance Patterns Using Sentence-BERT: A Machine Learning Approach
2025 Informatics Summit On Demand
Presentation Time: 02:45 PM - 03:00 PM
Abstract Keywords: Infectious Disease Modeling, EHR-based Phenotyping, Natural Language Processing, Clinical Decision Support for Translational/Data Science Interventions, Patient-centered Research and Care
Primary Track: Translation Bioinformatics/Precision Medicine
Programmatic Theme: Health Data Science and Artificial Intelligence Innovation: From Single-Center to Multi-Site
Abstract:
Antibiotic resistance poses a significant threat in in-patient settings with high mortality. Using MIMIC-III data, we generated Sentence-BERT embeddings from clinical notes and applied Neural Networks and XGBoost to predict antibiotic susceptibility. XGBoost achieved an average F1 score of 0.86, while Neural Networks scored 0.84. This study is among the first to use document embeddings for predicting antibiotic resistance, offering a novel pathway for improving antimicrobial stewardship.
Speaker(s):
Mahmoud Alwakeel, MD
Duke University
Author(s):
Mahmoud Alwakeel, MD - Duke University; Michael Yarrington, MD, MMCi - Duke University Health System; Rebekah Wrenn, PharmD - Duke University; Ethan Fang, Ph.D - Duke University; Jian Pei, Ph.D - Duke University; Anand Chowdhury, MD - Duke University Health System; An-Kwok Ian Wong, MD, Ph.D - Duke University;
2025 Informatics Summit On Demand
Presentation Time: 02:45 PM - 03:00 PM
Abstract Keywords: Infectious Disease Modeling, EHR-based Phenotyping, Natural Language Processing, Clinical Decision Support for Translational/Data Science Interventions, Patient-centered Research and Care
Primary Track: Translation Bioinformatics/Precision Medicine
Programmatic Theme: Health Data Science and Artificial Intelligence Innovation: From Single-Center to Multi-Site
Abstract:
Antibiotic resistance poses a significant threat in in-patient settings with high mortality. Using MIMIC-III data, we generated Sentence-BERT embeddings from clinical notes and applied Neural Networks and XGBoost to predict antibiotic susceptibility. XGBoost achieved an average F1 score of 0.86, while Neural Networks scored 0.84. This study is among the first to use document embeddings for predicting antibiotic resistance, offering a novel pathway for improving antimicrobial stewardship.
Speaker(s):
Mahmoud Alwakeel, MD
Duke University
Author(s):
Mahmoud Alwakeel, MD - Duke University; Michael Yarrington, MD, MMCi - Duke University Health System; Rebekah Wrenn, PharmD - Duke University; Ethan Fang, Ph.D - Duke University; Jian Pei, Ph.D - Duke University; Anand Chowdhury, MD - Duke University Health System; An-Kwok Ian Wong, MD, Ph.D - Duke University;
Exploring ChatGPT 3.5 for structured data extraction from oncological notes
Category
Paper - Student