Evaluating Medical Knowledge in Large Language Models through Probing with the UMLS
Presentation Time: 08:45 AM - 09:00 AM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Diagnostic Systems
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
This study investigates the representation of medical knowledge in Large Language Models (LLMs) like ChatGPT and Llama-2, using the Unified Medical Language System (UMLS) as a benchmark. It introduces a novel probing method to assess LLMs' ability to predict medical concepts within UMLS-defined knowledge paths. The evaluation, involving a comparison with a baseline model using dice coefficients, reveals ChatGPT's superior performance in understanding and interpreting medical relationships, albeit with modest F-scores. These findings underscore the potential of UMLS as a resource for evaluating LLMs in medical contexts and highlight the challenges in leveraging LLMs for medical diagnosis, pointing towards the need for further refinement and tuning of LLMs for biomedical applications.
Speaker(s):
Majid Afshar, MD, MSCR
University of Wisconsin - Madison
Author(s):
Deepak Gupta; Yanjun Gao, PhD - University of Wisconsin Madison; Emma Croxford, PhD Student - University of Wisconsin Madison; Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Dina Demner-Fushman, MD - National Library of Medicine;
Presentation Time: 08:45 AM - 09:00 AM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Diagnostic Systems
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
This study investigates the representation of medical knowledge in Large Language Models (LLMs) like ChatGPT and Llama-2, using the Unified Medical Language System (UMLS) as a benchmark. It introduces a novel probing method to assess LLMs' ability to predict medical concepts within UMLS-defined knowledge paths. The evaluation, involving a comparison with a baseline model using dice coefficients, reveals ChatGPT's superior performance in understanding and interpreting medical relationships, albeit with modest F-scores. These findings underscore the potential of UMLS as a resource for evaluating LLMs in medical contexts and highlight the challenges in leveraging LLMs for medical diagnosis, pointing towards the need for further refinement and tuning of LLMs for biomedical applications.
Speaker(s):
Majid Afshar, MD, MSCR
University of Wisconsin - Madison
Author(s):
Deepak Gupta; Yanjun Gao, PhD - University of Wisconsin Madison; Emma Croxford, PhD Student - University of Wisconsin Madison; Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Dina Demner-Fushman, MD - National Library of Medicine;
Evaluating Medical Knowledge in Large Language Models through Probing with the UMLS
Category
Podium Abstract