Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses
Presentation Time: 11:30 AM - 11:45 AM
Abstract Keywords: Evaluation, Large Language Models (LLMs), Natural Language Processing, Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Clinical Informatics
In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality
remains challenging, as existing methods often overlook generative task complexities. This work aimed to
examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well validated
baseline with which to examine the alignment of these metrics, we created a comprehensive human
evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments
with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score—a Unified
Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating
domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for
generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts
should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics,
particularly focusing on refining the SapBERT score for improved assessments.
Speaker(s):
Emma Croxford, PhD Student
University of Wisconsin Madison
Author(s):
Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Yanjun Gao, PhD - University of Wisconsin Madison; Brian Patterson, MD MPH - University of Wisconsin-Madison; Daniel To - University of Wisconsin, Madison - UW Health; Samuel Tesch, Medical Student/MD - University of Wisconsin School of Medicine and Public Health; Anoop Mayampurath, PhD - University of Wisconsin - Madison; Matthew Churpek, MD, MPH, PhD - University of Wisconsin-Madison; Dmitriy Dligach, Ph.D. - Loyola University Chicago;
Presentation Time: 11:30 AM - 11:45 AM
Abstract Keywords: Evaluation, Large Language Models (LLMs), Natural Language Processing, Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Clinical Informatics
In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality
remains challenging, as existing methods often overlook generative task complexities. This work aimed to
examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well validated
baseline with which to examine the alignment of these metrics, we created a comprehensive human
evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments
with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score—a Unified
Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating
domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for
generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts
should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics,
particularly focusing on refining the SapBERT score for improved assessments.
Speaker(s):
Emma Croxford, PhD Student
University of Wisconsin Madison
Author(s):
Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Yanjun Gao, PhD - University of Wisconsin Madison; Brian Patterson, MD MPH - University of Wisconsin-Madison; Daniel To - University of Wisconsin, Madison - UW Health; Samuel Tesch, Medical Student/MD - University of Wisconsin School of Medicine and Public Health; Anoop Mayampurath, PhD - University of Wisconsin - Madison; Matthew Churpek, MD, MPH, PhD - University of Wisconsin-Madison; Dmitriy Dligach, Ph.D. - Loyola University Chicago;
Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses
Category
Paper - Student