- Home
- 2026 Annual Symposium Gallery
- Confidence Calibration of Large Language Models under Medical Predictions
Custom CSS
double-click to edit, do not edit in source
S104: Tall Fences & Safe Models: Keeping AI on the Ranch (Oral Presentations)
11/11/2026 |
8:00 AM – 9:15 AM |
Room 10
Presentation Type: Oral Presentations
Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses
Presentation Type: Paper - Regular
Presentation Time: 08:00 AM - 08:12 AM
Abstract Keywords: Large Language Models (LLMs), Delivering Health Information and Knowledge to the Public, Patient Safety, Evaluation, Fairness and elimination of bias, Artificial Intelligence
Programmatic Theme: Consumer Health Informatics
Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce $UTCO$ (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.
Speaker(s):
Congning Ni, Ph.D.
Vanderbilt University Medical Center
Author(s):
Congning Ni, Ph.D. - Vanderbilt University Medical Center; Sarvech Qadir, MS - Vanderbilt University; Bryan Steitz, PhD - Vanderbilt University Medical Center; Mihir Sachin Vaidya, MS - Vanderbilt University Medical Center; Qingyuan Song, Master of Engineering - Vanderbilt University; Lantian Xia, B.A. - Vanderbilt Univerisity; Shelagh Mulvaney, PhD, FAMIA - Vanderbilt University; Siru Liu, PhD - Vanderbilt University Medical Center; Hyeyoung Ryu, PhD - Vanderbilt University Medical Center; Leah Hecht, Ph.D. - Lirio, LLC; Amy Bucher, Ph.D. - Lirio, LLC; Christopher Symons, Ph.D. - Lirio, LLC; Laurie Novak, PhD, MHSA - Vanderbilt University Medical Center Dept of Biomedical Informatics; Susannah Rose, PhD - Vanderbilt University Medical Center; Murat Kantarcioglu, Ph.D. - Virginia Tech; Bradley Malin, PhD - Vanderbilt University Medical Center; Zhijun Yin, Ph.D. - Vanderbilt University Medical Center;
Presentation Type: Paper - Regular
Presentation Time: 08:00 AM - 08:12 AM
Abstract Keywords: Large Language Models (LLMs), Delivering Health Information and Knowledge to the Public, Patient Safety, Evaluation, Fairness and elimination of bias, Artificial Intelligence
Programmatic Theme: Consumer Health Informatics
Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce $UTCO$ (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.
Speaker(s):
Congning Ni, Ph.D.
Vanderbilt University Medical Center
Author(s):
Congning Ni, Ph.D. - Vanderbilt University Medical Center; Sarvech Qadir, MS - Vanderbilt University; Bryan Steitz, PhD - Vanderbilt University Medical Center; Mihir Sachin Vaidya, MS - Vanderbilt University Medical Center; Qingyuan Song, Master of Engineering - Vanderbilt University; Lantian Xia, B.A. - Vanderbilt Univerisity; Shelagh Mulvaney, PhD, FAMIA - Vanderbilt University; Siru Liu, PhD - Vanderbilt University Medical Center; Hyeyoung Ryu, PhD - Vanderbilt University Medical Center; Leah Hecht, Ph.D. - Lirio, LLC; Amy Bucher, Ph.D. - Lirio, LLC; Christopher Symons, Ph.D. - Lirio, LLC; Laurie Novak, PhD, MHSA - Vanderbilt University Medical Center Dept of Biomedical Informatics; Susannah Rose, PhD - Vanderbilt University Medical Center; Murat Kantarcioglu, Ph.D. - Virginia Tech; Bradley Malin, PhD - Vanderbilt University Medical Center; Zhijun Yin, Ph.D. - Vanderbilt University Medical Center;
Congning
Ni,
Ph.D. - Vanderbilt University Medical Center
When Models Hesitate: Using High Token Entropy for Auditing Evidence-Grounded Health Reasoning in Long Contexts
Presentation Type: Paper - Student
Presentation Time: 08:12 AM - 08:24 AM
Abstract Keywords: Artificial Intelligence, Large Language Models (LLMs), Delivering Health Information and Knowledge to the Public, Knowledge Representation & Information Modeling, Evaluation, Delivering Health Information and Knowledge to the Public
Working Group: Natural Language Processing Working Group
Programmatic Theme: Public Health Informatics
In evidence-grounded health QA, large language models must avoid unsupported text generation. While rubric-style evaluations identify post-hoc errors, they struggle to localize where error decisions emerge during generation. To address this, we propose a token-entropy auditing framework to pinpoint these possible errors. Using 103 CL-Bench healthcare cases, we extracted high-entropy peak spans from reasoning traces, assigning them interpretable fork types via an LLM-as-judge pipeline. We found that high-entropy mass concentrates in commitment and discourse regions rather than lexical completion. Notably, failing cases disproportionately allocate entropy to directive/action forks, whereas passing cases focus on attribution/coverage. Entropy features yield modest predictive discrimination for fatal failures (AUC 0.605) compared to a chance-level shuffled control (AUC 0.472), confirming non-random failure signals. Ultimately, token-level entropy serves as a practical diagnostic tool, suggesting a targeted alignment paradigm: focusing reinforcement learning on high-entropy decision nodes to improve medical evidence grounding.
Speaker(s):
Jiarong Qian, BEng Candidate
Shandong University
Author(s):
Jiarong Qian, BEng Candidate - Shandong University; Jing Huang, PhD - University of Pennsylvania;
Presentation Type: Paper - Student
Presentation Time: 08:12 AM - 08:24 AM
Abstract Keywords: Artificial Intelligence, Large Language Models (LLMs), Delivering Health Information and Knowledge to the Public, Knowledge Representation & Information Modeling, Evaluation, Delivering Health Information and Knowledge to the Public
Working Group: Natural Language Processing Working Group
Programmatic Theme: Public Health Informatics
In evidence-grounded health QA, large language models must avoid unsupported text generation. While rubric-style evaluations identify post-hoc errors, they struggle to localize where error decisions emerge during generation. To address this, we propose a token-entropy auditing framework to pinpoint these possible errors. Using 103 CL-Bench healthcare cases, we extracted high-entropy peak spans from reasoning traces, assigning them interpretable fork types via an LLM-as-judge pipeline. We found that high-entropy mass concentrates in commitment and discourse regions rather than lexical completion. Notably, failing cases disproportionately allocate entropy to directive/action forks, whereas passing cases focus on attribution/coverage. Entropy features yield modest predictive discrimination for fatal failures (AUC 0.605) compared to a chance-level shuffled control (AUC 0.472), confirming non-random failure signals. Ultimately, token-level entropy serves as a practical diagnostic tool, suggesting a targeted alignment paradigm: focusing reinforcement learning on high-entropy decision nodes to improve medical evidence grounding.
Speaker(s):
Jiarong Qian, BEng Candidate
Shandong University
Author(s):
Jiarong Qian, BEng Candidate - Shandong University; Jing Huang, PhD - University of Pennsylvania;
Jiarong
Qian,
BEng Candidate - Shandong University
Confidence Calibration of Large Language Models under Medical Predictions
Presentation Type: Podium Abstract
Presentation Time: 08:24 AM - 08:36 AM
Abstract Keywords: Large Language Models (LLMs), Natural Language Processing, Artificial Intelligence, Deep Learning, Machine Learning, Evaluation, Clinical Decision Support, Patient Safety
Programmatic Theme: Clinical Research Informatics
Generating reliable confidence is critical for the deployment of AI in clinical applications. Recent studies show contradictory results on how well LLMs' explicit and implicit confidence calibrate compared with earlier works. This work systematically examined the LLMs' explicit and implicit confidence calibration using two main metrics of calibration, showing that as LLMs evolve, their confidence calibration gets better, with the implicit confidence calibration matching or getting better than the explicit confidence.
Speaker(s):
Bowen Gu, MS
Brigham and Women's Hospital
Author(s):
Bowen Gu, MS - Brigham and Women's Hospital; Richard Wyss, PhD - Brigham and Women’s Hospital; Kueiyu Joshua Lin, MD, ScD - Brigham and Women’s Hospital; Jie Yang, PhD, FACMI, FAMIA - Harvard Medical School;
Presentation Type: Podium Abstract
Presentation Time: 08:24 AM - 08:36 AM
Abstract Keywords: Large Language Models (LLMs), Natural Language Processing, Artificial Intelligence, Deep Learning, Machine Learning, Evaluation, Clinical Decision Support, Patient Safety
Programmatic Theme: Clinical Research Informatics
Generating reliable confidence is critical for the deployment of AI in clinical applications. Recent studies show contradictory results on how well LLMs' explicit and implicit confidence calibrate compared with earlier works. This work systematically examined the LLMs' explicit and implicit confidence calibration using two main metrics of calibration, showing that as LLMs evolve, their confidence calibration gets better, with the implicit confidence calibration matching or getting better than the explicit confidence.
Speaker(s):
Bowen Gu, MS
Brigham and Women's Hospital
Author(s):
Bowen Gu, MS - Brigham and Women's Hospital; Richard Wyss, PhD - Brigham and Women’s Hospital; Kueiyu Joshua Lin, MD, ScD - Brigham and Women’s Hospital; Jie Yang, PhD, FACMI, FAMIA - Harvard Medical School;
Bowen
Gu,
MS - Brigham and Women's Hospital
Hallucination Resistance in Clinical Laboratory Test Interpretation: Evaluating Large Language Models Using Fabricated Case Reports and Retrieval-Augmented Generation
Presentation Type: Podium Abstract
Presentation Time: 08:36 AM - 08:48 AM
Abstract Keywords: Large Language Models (LLMs), Clinical Decision Support, Artificial Intelligence, Information Extraction
Programmatic Theme: Clinical Research Informatics
Large language models (LLMs) often hallucinate when interpreting clinical laboratory results. We evaluated five LLMs using 200 clinical vignettes injected with fabricated laboratory tests. Under default and mitigation prompting, hallucination rates remained substantial. However, employing Retrieval-Augmented Generation (RAG) to ground models in external medical knowledge significantly improved extraction accuracy and reduced hallucinations. This highlights the necessity of knowledge retrieval for clinical LLM deployment.
Speaker(s):
Balu Bhasuran, Ph.D
Florida State University
Author(s):
Balu Bhasuran, Ph.D - Florida State University; Zhe He, PhD, FIAHSI, FAMIA - Florida State University; Dhruv Kale, Masters - Florida State University; Nancy Chen, High School - Florida State University; Joseph Massa, High School - Florida State University;
Presentation Type: Podium Abstract
Presentation Time: 08:36 AM - 08:48 AM
Abstract Keywords: Large Language Models (LLMs), Clinical Decision Support, Artificial Intelligence, Information Extraction
Programmatic Theme: Clinical Research Informatics
Large language models (LLMs) often hallucinate when interpreting clinical laboratory results. We evaluated five LLMs using 200 clinical vignettes injected with fabricated laboratory tests. Under default and mitigation prompting, hallucination rates remained substantial. However, employing Retrieval-Augmented Generation (RAG) to ground models in external medical knowledge significantly improved extraction accuracy and reduced hallucinations. This highlights the necessity of knowledge retrieval for clinical LLM deployment.
Speaker(s):
Balu Bhasuran, Ph.D
Florida State University
Author(s):
Balu Bhasuran, Ph.D - Florida State University; Zhe He, PhD, FIAHSI, FAMIA - Florida State University; Dhruv Kale, Masters - Florida State University; Nancy Chen, High School - Florida State University; Joseph Massa, High School - Florida State University;
Balu
Bhasuran,
Ph.D - Florida State University
Reliability and Systematic Bias in LLM-as-a-Judge Systems for Clinical PICO Extraction
Presentation Type: Podium Abstract
Presentation Time: 08:48 AM - 09:00 AM
Abstract Keywords: Fairness and Elimination of Bias, Large Language Models (LLMs), Evaluation
Programmatic Theme: Clinical Research Informatics
LLM-as-a-Judge (LaaJ) systems are deployed in clinical AI evaluation without reliability validation. We evaluated ten judges, eight open-source and two proprietary, on 1,200 evaluations from 30 papers. All fell below clinical thresholds (best κ=0.209, ICC=0.297, α=0.273); 61% of large-small score gaps reflected leniency artifacts. Rubric choice shifted scores up to 21.7%; order-swapping reversed 35.7% of winner designations. Findings validate mandatory IRR reporting and bias auditing for clinical LaaJ deployment.
Speaker(s):
Chenyu Li, M.S.
University of Pittsburgh
Author(s):
Chenyu Li, M.S. - University of Pittsburgh; Harold Lehmann, MD, PhD - Johns Hopkins University; Yanshan Wang, PhD - University of Pittsburgh;
Presentation Type: Podium Abstract
Presentation Time: 08:48 AM - 09:00 AM
Abstract Keywords: Fairness and Elimination of Bias, Large Language Models (LLMs), Evaluation
Programmatic Theme: Clinical Research Informatics
LLM-as-a-Judge (LaaJ) systems are deployed in clinical AI evaluation without reliability validation. We evaluated ten judges, eight open-source and two proprietary, on 1,200 evaluations from 30 papers. All fell below clinical thresholds (best κ=0.209, ICC=0.297, α=0.273); 61% of large-small score gaps reflected leniency artifacts. Rubric choice shifted scores up to 21.7%; order-swapping reversed 35.7% of winner designations. Findings validate mandatory IRR reporting and bias auditing for clinical LaaJ deployment.
Speaker(s):
Chenyu Li, M.S.
University of Pittsburgh
Author(s):
Chenyu Li, M.S. - University of Pittsburgh; Harold Lehmann, MD, PhD - Johns Hopkins University; Yanshan Wang, PhD - University of Pittsburgh;
Chenyu
Li,
M.S. - University of Pittsburgh
First, Do NOHARM: Towards Clinically Safe Large Language Models
Presentation Type: Podium Abstract
Presentation Time: 09:00 AM - 09:12 AM
Abstract Keywords: Large Language Models (LLMs), Clinical Decision Support, Evaluation, Patient Safety, Artificial Intelligence, Natural Language Processing
Programmatic Theme: Clinical Informatics
Large language models (LLMs) are widely used in clinical decision support, yet their safety profiles remain poorly characterized. We present NOHARM, a specialist-validated benchmark of 100 real consultation cases with 12,747 expert annotations. Across 31 LLMs, severe harm occurs in up to 22.2% of cases, with omission errors predominating. Multi-agent orchestration reduces harm, underscoring clinical safety as a distinct evaluation dimension.
Speaker(s):
Fateme Nateghi Haredasht, PhD
Stanford University
Author(s):
David Wu, MD, PhD - Harvard Medical School; Fateme Nateghi Haredasht, PhD - Stanford University; Saloni Maharaj, MD - Stanford University; Priyank Jain, MD - Harvard Medical School; Jessica Tran, MD - Stanford University School of Medicine; Arjun Rustagi, MD - UCSF; Liam G. McCoy, MD - University of Alberta; Yingjie Weng, MHS - Stanford University School Of Medicine - - Stanford, CA; Vishnu Ravi, MD - Stanford University School of Medicine; David Wu, MD - Stanford; April Liang, MD - Stanford University; Kevin Schulman; Nigam Shah, MBBS - Stanford University; Jason Hom, MD - Stanford University School of Medicine; Arnold Milstein, MD, MPH - Clinical Excellence Research Center, Stanford University School of Medicine; Adam Rodman, MD - Harvard Medical School; Jonathan Chen, MD, PhD - Stanford University Hospital; Ethan Goh, MD, MS - Stanford University;
Presentation Type: Podium Abstract
Presentation Time: 09:00 AM - 09:12 AM
Abstract Keywords: Large Language Models (LLMs), Clinical Decision Support, Evaluation, Patient Safety, Artificial Intelligence, Natural Language Processing
Programmatic Theme: Clinical Informatics
Large language models (LLMs) are widely used in clinical decision support, yet their safety profiles remain poorly characterized. We present NOHARM, a specialist-validated benchmark of 100 real consultation cases with 12,747 expert annotations. Across 31 LLMs, severe harm occurs in up to 22.2% of cases, with omission errors predominating. Multi-agent orchestration reduces harm, underscoring clinical safety as a distinct evaluation dimension.
Speaker(s):
Fateme Nateghi Haredasht, PhD
Stanford University
Author(s):
David Wu, MD, PhD - Harvard Medical School; Fateme Nateghi Haredasht, PhD - Stanford University; Saloni Maharaj, MD - Stanford University; Priyank Jain, MD - Harvard Medical School; Jessica Tran, MD - Stanford University School of Medicine; Arjun Rustagi, MD - UCSF; Liam G. McCoy, MD - University of Alberta; Yingjie Weng, MHS - Stanford University School Of Medicine - - Stanford, CA; Vishnu Ravi, MD - Stanford University School of Medicine; David Wu, MD - Stanford; April Liang, MD - Stanford University; Kevin Schulman; Nigam Shah, MBBS - Stanford University; Jason Hom, MD - Stanford University School of Medicine; Arnold Milstein, MD, MPH - Clinical Excellence Research Center, Stanford University School of Medicine; Adam Rodman, MD - Harvard Medical School; Jonathan Chen, MD, PhD - Stanford University Hospital; Ethan Goh, MD, MS - Stanford University;
Fateme
Nateghi Haredasht,
PhD - Stanford University
Confidence Calibration of Large Language Models under Medical Predictions
Category
Podium Abstract
Description
Custom CSS
double-click to edit, do not edit in source
11/11/2026 09:15 AM (Central Time (US & Canada))