ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Presentation Time: 02:48 PM - 03:00 PM
Abstract Keywords: Artificial Intelligence, Large Language Models (LLMs), Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Academic Informatics / LIEAF
Large Language Models (LLMs) show great promise in clinical systems due to their superior medical text processing
capabilities. However, traditional ML models like SVM and XGBoost remain dominant in clinical prediction tasks.
This raises the question: Can LLMs outperform traditional ML models in clinical prediction? To answer this, we
introduce a new benchmark, ClinicalBench, which evaluates 14 general-purpose LLMs, 8 medical LLMs, and 11
traditional ML models across three clinical tasks and two datasets. Our extensive empirical study reveals that both
general-purpose and medical LLMs, regardless of model scale, prompting, or fine-tuning strategies, still fail to
surpass traditional ML models in clinical prediction, highlighting their surprising limitations in clinical reasoning.
Speaker(s):
Kai Shu, PhD
Emory University
Author(s):
Canyu Chen, Bachelor - Illinois Institute of Technology; Jian Yu, BS - N/A; Shan Chen, M.S - Havard-MGB; Che Liu, BS - Imperial College London; Zhongwei Wan, BS - Ohio State University; Danielle Bitterman, MD - Harvard Medical School; Fei Wang, PhD - Weill Cornell Medicine; Kai Shu, PhD - Emory University;
Presentation Time: 02:48 PM - 03:00 PM
Abstract Keywords: Artificial Intelligence, Large Language Models (LLMs), Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Academic Informatics / LIEAF
Large Language Models (LLMs) show great promise in clinical systems due to their superior medical text processing
capabilities. However, traditional ML models like SVM and XGBoost remain dominant in clinical prediction tasks.
This raises the question: Can LLMs outperform traditional ML models in clinical prediction? To answer this, we
introduce a new benchmark, ClinicalBench, which evaluates 14 general-purpose LLMs, 8 medical LLMs, and 11
traditional ML models across three clinical tasks and two datasets. Our extensive empirical study reveals that both
general-purpose and medical LLMs, regardless of model scale, prompting, or fine-tuning strategies, still fail to
surpass traditional ML models in clinical prediction, highlighting their surprising limitations in clinical reasoning.
Speaker(s):
Kai Shu, PhD
Emory University
Author(s):
Canyu Chen, Bachelor - Illinois Institute of Technology; Jian Yu, BS - N/A; Shan Chen, M.S - Havard-MGB; Che Liu, BS - Imperial College London; Zhongwei Wan, BS - Ohio State University; Danielle Bitterman, MD - Harvard Medical School; Fei Wang, PhD - Weill Cornell Medicine; Kai Shu, PhD - Emory University;
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Category
Podium Abstract