Poor Detection, Strong Rephrasing: ChatGPT Divergent Performance with Stigmatizing Language in EHRs
Presentation Time: 02:48 PM - 03:00 PM
Abstract Keywords: Fairness and Elimination of Bias, Large Language Models (LLMs), Diversity, Equity, Inclusion, and Accessibility, Natural Language Processing
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
Stigmatizing language in electronic health records can undermine patient trust and exacerbate health disparities. This study evaluated ChatGPT-4o’s ability to identify and rephrase stigmatizing language in 140 clinical notes from birth admission. While ChatGPT demonstrated strong rephrasing performance (average scores ≥2.7/3 for de-stigmatization, clarity, and faithfulness), its identification accuracy was limited, with low precision (0.41) and occasional hallucinations. These findings underscore the need for clinician oversight and improved detection approaches before clinical integration.
Speaker(s):
Zhihong Zhang, PhD
Columbia University
Author(s):
Zhihong Zhang, PhD - Columbia University; Jihye Kim Scroggins, PhD - School of Nursing, University of North Carolina at Chapel Hill; Sarah Harkins, MPhil, BSN, RN - Columbia University School of Nursing; Ismael Ibrahim Hulchafo, MD, MS - Columbia University School of Nursing; Hans Moen, Department of Computer Science, Aalto University - Aalto University Department of Computer Science; Michele Tadiello, Master's degree - Columbia University Irving Medical Center; Veronica Barcelona, PhD - Columbia University School of Nursing; Maxim Topaz, PhD - Columbia University School of Nursing;
Presentation Time: 02:48 PM - 03:00 PM
Abstract Keywords: Fairness and Elimination of Bias, Large Language Models (LLMs), Diversity, Equity, Inclusion, and Accessibility, Natural Language Processing
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
Stigmatizing language in electronic health records can undermine patient trust and exacerbate health disparities. This study evaluated ChatGPT-4o’s ability to identify and rephrase stigmatizing language in 140 clinical notes from birth admission. While ChatGPT demonstrated strong rephrasing performance (average scores ≥2.7/3 for de-stigmatization, clarity, and faithfulness), its identification accuracy was limited, with low precision (0.41) and occasional hallucinations. These findings underscore the need for clinician oversight and improved detection approaches before clinical integration.
Speaker(s):
Zhihong Zhang, PhD
Columbia University
Author(s):
Zhihong Zhang, PhD - Columbia University; Jihye Kim Scroggins, PhD - School of Nursing, University of North Carolina at Chapel Hill; Sarah Harkins, MPhil, BSN, RN - Columbia University School of Nursing; Ismael Ibrahim Hulchafo, MD, MS - Columbia University School of Nursing; Hans Moen, Department of Computer Science, Aalto University - Aalto University Department of Computer Science; Michele Tadiello, Master's degree - Columbia University Irving Medical Center; Veronica Barcelona, PhD - Columbia University School of Nursing; Maxim Topaz, PhD - Columbia University School of Nursing;
Poor Detection, Strong Rephrasing: ChatGPT Divergent Performance with Stigmatizing Language in EHRs
Category
Podium Abstract