Times are displayed in (UTC-07:00) Pacific Time (US & Canada) Change
11/12/2024 |
8:30 AM – 10:00 AM |
Franciscan B
S66: Large Language Models, Hype or Hope - Como Se LLaMA
Presentation Type: Oral
Session Chair:
Jin Chen, PhD - University of Alabama at Birmingham
Toward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy
Presentation Time: 08:30 AM - 08:45 AM
Abstract Keywords: Large Language Models (LLMs), Evaluation, Behavioral Change, Usability
Primary Track: Applications
While Large Language Models (LLMs) are being quickly adapted to many domains, including healthcare, many of their
strengths and pitfalls remain under-explored. In our study, we examine the effects of employing in-context learning to
guide Large Language Models (LLMs) in delivering parts of a Problem-Solving Therapy (PST) session, particularly
during the symptom identification and assessment phase for personalized goal setting. We present an evaluation of the
model’s performance by automatic metrics and experienced medical professionals. We demonstrate that the model’s
capability to deliver protocolized therapy can be improved with the proper use of prompt engineering methods, albeit
with limitations. To our knowledge, this study is among the first to assess the effects of various prompting techniques in
enhancing a model’s ability to deliver psychotherapy, focusing on overall quality, consistency, and empathy. Exploring
LLMs’ potential in delivering psychotherapy holds promise with the current shortage of mental health professionals
amid significant needs, enhancing the potential utility of AI-based or AI-supported care services
Speaker(s):
Daniil Filienko, PhD Student
University of Washington Tacoma
Author(s):
Daniil Filienko, PhD in Computer Science and Systems - University of Washington Tacoma; Yinzhou Wang; Caroline El Jazmi, B.S. - University of Washington; Serena Jinchen Xie, Masters - Biomedical Informatics and Medical Education, University of Washington; Trevor Cohen, MBChB, PhD - Biomedical Informatics and Medical Education, University of Washington; Martine De Cock, Ph.D. - University of Washington Tacoma; Weichao Yuwen, PhD, RN - University of Washington Tacoma;
Presentation Time: 08:30 AM - 08:45 AM
Abstract Keywords: Large Language Models (LLMs), Evaluation, Behavioral Change, Usability
Primary Track: Applications
While Large Language Models (LLMs) are being quickly adapted to many domains, including healthcare, many of their
strengths and pitfalls remain under-explored. In our study, we examine the effects of employing in-context learning to
guide Large Language Models (LLMs) in delivering parts of a Problem-Solving Therapy (PST) session, particularly
during the symptom identification and assessment phase for personalized goal setting. We present an evaluation of the
model’s performance by automatic metrics and experienced medical professionals. We demonstrate that the model’s
capability to deliver protocolized therapy can be improved with the proper use of prompt engineering methods, albeit
with limitations. To our knowledge, this study is among the first to assess the effects of various prompting techniques in
enhancing a model’s ability to deliver psychotherapy, focusing on overall quality, consistency, and empathy. Exploring
LLMs’ potential in delivering psychotherapy holds promise with the current shortage of mental health professionals
amid significant needs, enhancing the potential utility of AI-based or AI-supported care services
Speaker(s):
Daniil Filienko, PhD Student
University of Washington Tacoma
Author(s):
Daniil Filienko, PhD in Computer Science and Systems - University of Washington Tacoma; Yinzhou Wang; Caroline El Jazmi, B.S. - University of Washington; Serena Jinchen Xie, Masters - Biomedical Informatics and Medical Education, University of Washington; Trevor Cohen, MBChB, PhD - Biomedical Informatics and Medical Education, University of Washington; Martine De Cock, Ph.D. - University of Washington Tacoma; Weichao Yuwen, PhD, RN - University of Washington Tacoma;
Evaluating Medical Knowledge in Large Language Models through Probing with the UMLS
Presentation Time: 08:45 AM - 09:00 AM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Diagnostic Systems
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
This study investigates the representation of medical knowledge in Large Language Models (LLMs) like ChatGPT and Llama-2, using the Unified Medical Language System (UMLS) as a benchmark. It introduces a novel probing method to assess LLMs' ability to predict medical concepts within UMLS-defined knowledge paths. The evaluation, involving a comparison with a baseline model using dice coefficients, reveals ChatGPT's superior performance in understanding and interpreting medical relationships, albeit with modest F-scores. These findings underscore the potential of UMLS as a resource for evaluating LLMs in medical contexts and highlight the challenges in leveraging LLMs for medical diagnosis, pointing towards the need for further refinement and tuning of LLMs for biomedical applications.
Speaker(s):
Majid Afshar, MD, MSCR
University of Wisconsin - Madison
Author(s):
Deepak Gupta; Yanjun Gao, PhD - University of Wisconsin Madison; Emma Croxford, PhD Student - University of Wisconsin Madison; Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Dina Demner-Fushman, MD - National Library of Medicine;
Presentation Time: 08:45 AM - 09:00 AM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Diagnostic Systems
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
This study investigates the representation of medical knowledge in Large Language Models (LLMs) like ChatGPT and Llama-2, using the Unified Medical Language System (UMLS) as a benchmark. It introduces a novel probing method to assess LLMs' ability to predict medical concepts within UMLS-defined knowledge paths. The evaluation, involving a comparison with a baseline model using dice coefficients, reveals ChatGPT's superior performance in understanding and interpreting medical relationships, albeit with modest F-scores. These findings underscore the potential of UMLS as a resource for evaluating LLMs in medical contexts and highlight the challenges in leveraging LLMs for medical diagnosis, pointing towards the need for further refinement and tuning of LLMs for biomedical applications.
Speaker(s):
Majid Afshar, MD, MSCR
University of Wisconsin - Madison
Author(s):
Deepak Gupta; Yanjun Gao, PhD - University of Wisconsin Madison; Emma Croxford, PhD Student - University of Wisconsin Madison; Majid Afshar, MD, MSCR - University of Wisconsin - Madison; Dina Demner-Fushman, MD - National Library of Medicine;
Benchmarking Retrieval-Augmented Generation for Medicine
Presentation Time: 09:00 AM - 09:15 AM
Abstract Keywords: Large Language Models (LLMs), Evaluation, Natural Language Processing
Primary Track: Applications
Retrieval-augmented generation (RAG) is a promising solution to the problems of hallucinations and outdated knowledge in large language models, but there is a lack of best practices regarding the optimal RAG setting for various medical purposes. We propose MIRAGE, a first-of-its-kind benchmark, to systematically evaluate medical RAG systems. Large-scale experiments were conducted on MIRAGE using our MedRAG toolkit. We provide practical guidelines for future implementation based on our comprehensive evaluations.
Speaker(s):
Guangzhi Xiong, BA
University of Virginia
Author(s):
Guangzhi Xiong, BA - University of Virginia; Qiao Jin, M.D. - National Institutes of Health; Zhiyong Lu, PhD - National Library of Medicine, NIH; Aidong Zhang, PhD - University of Virginia;
Presentation Time: 09:00 AM - 09:15 AM
Abstract Keywords: Large Language Models (LLMs), Evaluation, Natural Language Processing
Primary Track: Applications
Retrieval-augmented generation (RAG) is a promising solution to the problems of hallucinations and outdated knowledge in large language models, but there is a lack of best practices regarding the optimal RAG setting for various medical purposes. We propose MIRAGE, a first-of-its-kind benchmark, to systematically evaluate medical RAG systems. Large-scale experiments were conducted on MIRAGE using our MedRAG toolkit. We provide practical guidelines for future implementation based on our comprehensive evaluations.
Speaker(s):
Guangzhi Xiong, BA
University of Virginia
Author(s):
Guangzhi Xiong, BA - University of Virginia; Qiao Jin, M.D. - National Institutes of Health; Zhiyong Lu, PhD - National Library of Medicine, NIH; Aidong Zhang, PhD - University of Virginia;
Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-Form Medical Question Answering in Ophthalmology
Presentation Time: 09:15 AM - 09:30 AM
Abstract Keywords: Information Retrieval, Large Language Models (LLMs), Data Mining
Primary Track: Applications
Programmatic Theme: Consumer Health Informatics
Objectives:
Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. This study develops a domain-specific Retrieval Augment Generation (RAG) approach and systematically evaluates response accuracy and completeness, evidence factuality, selection, and attribution.
Materials and Methods:
We conducted a case study on long-form question answering in ophthalmology. A RAG pipeline with ~70,000 ophthalmology-specific documents was developed. The study compared LLM responses with and without RAG on 100 consumer health questions with ten clinicians.
Results:
Without RAG, 45.3% of the 252 references were hallucinated and 34.1% were erroneous. With RAG, hallucinated references were reduced to 18.8%. However, only 62.5% of the documents by RAG were selected as the top references in the LLM response. In addition, RAG significantly improved evidence attribution from 1.85 to 2.49 (on a scale from 1 to 5), with slight decreases in accuracy and completeness.
Discussion:
LLMs exhibited prevalent hallucinated and erroneous evidence in the responses. RAG substantially reduced the proportion of such evidence but encountered challenges. The results highlight that (1) LLMs may not select documents by RAG, (2) LLMs may miss top-ranked documents by RAG, and (3) irrelevant documents by RAG downgrade the response accuracy and completeness, especially in challenging tasks.
Conclusion:
Despite their potential, LLMs in medicine require improved evidence factuality and relevance. Through a case investigation in long-form medical question answering, RAG demonstrated effectiveness but encountered challenges, highlighting the need for further development in domain-specific LLM and RAG techniques.
Speaker(s):
Qingyu Chen, PhD
Yale University
Author(s):
Presentation Time: 09:15 AM - 09:30 AM
Abstract Keywords: Information Retrieval, Large Language Models (LLMs), Data Mining
Primary Track: Applications
Programmatic Theme: Consumer Health Informatics
Objectives:
Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. This study develops a domain-specific Retrieval Augment Generation (RAG) approach and systematically evaluates response accuracy and completeness, evidence factuality, selection, and attribution.
Materials and Methods:
We conducted a case study on long-form question answering in ophthalmology. A RAG pipeline with ~70,000 ophthalmology-specific documents was developed. The study compared LLM responses with and without RAG on 100 consumer health questions with ten clinicians.
Results:
Without RAG, 45.3% of the 252 references were hallucinated and 34.1% were erroneous. With RAG, hallucinated references were reduced to 18.8%. However, only 62.5% of the documents by RAG were selected as the top references in the LLM response. In addition, RAG significantly improved evidence attribution from 1.85 to 2.49 (on a scale from 1 to 5), with slight decreases in accuracy and completeness.
Discussion:
LLMs exhibited prevalent hallucinated and erroneous evidence in the responses. RAG substantially reduced the proportion of such evidence but encountered challenges. The results highlight that (1) LLMs may not select documents by RAG, (2) LLMs may miss top-ranked documents by RAG, and (3) irrelevant documents by RAG downgrade the response accuracy and completeness, especially in challenging tasks.
Conclusion:
Despite their potential, LLMs in medicine require improved evidence factuality and relevance. Through a case investigation in long-form medical question answering, RAG demonstrated effectiveness but encountered challenges, highlighting the need for further development in domain-specific LLM and RAG techniques.
Speaker(s):
Qingyu Chen, PhD
Yale University
Author(s):
MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering
Presentation Time: 09:30 AM - 09:45 AM
Abstract Keywords: Usability, Deep Learning, Large Language Models (LLMs), Knowledge Representation and Information Modeling
Primary Track: Applications
Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks like medical question answering (QA). Moreover, they tend to function as "black-boxes," making it challenging to modify their behavior. To address the problem, our study delves into retrieval augmented generation (RAG), aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the query prompt for LLMs. Focusing on medical QA using the MedQA-SMILE dataset, we evaluate the impact of different retrieval models and the number of facts provided to the LLM. Notably, our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM performance, offering a practical approach to mitigate the challenges of black-box LLMs.
Speaker(s):
Yucheng Shi, Ph.D. Student
University of Georgia
Author(s):
Yucheng Shi, Ph.D. student - University of Georgia; Shaochen Xu, Ph.D. student - University of Georgia; Tianze Yang, Ph.D. student - University of Georgia; Zhengliang Liu; Tianming Liu, Ph.D. - University of Georgia; Xiang Li, PhD - Massachusetts General Hospital and Harvard Medical School; Ninghao Liu, Ph.D. - University of Georgia;
Presentation Time: 09:30 AM - 09:45 AM
Abstract Keywords: Usability, Deep Learning, Large Language Models (LLMs), Knowledge Representation and Information Modeling
Primary Track: Applications
Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks like medical question answering (QA). Moreover, they tend to function as "black-boxes," making it challenging to modify their behavior. To address the problem, our study delves into retrieval augmented generation (RAG), aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the query prompt for LLMs. Focusing on medical QA using the MedQA-SMILE dataset, we evaluate the impact of different retrieval models and the number of facts provided to the LLM. Notably, our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM performance, offering a practical approach to mitigate the challenges of black-box LLMs.
Speaker(s):
Yucheng Shi, Ph.D. Student
University of Georgia
Author(s):
Yucheng Shi, Ph.D. student - University of Georgia; Shaochen Xu, Ph.D. student - University of Georgia; Tianze Yang, Ph.D. student - University of Georgia; Zhengliang Liu; Tianming Liu, Ph.D. - University of Georgia; Xiang Li, PhD - Massachusetts General Hospital and Harvard Medical School; Ninghao Liu, Ph.D. - University of Georgia;
Me LLaMA: Foundation Large Language Models for Medical Applications
Presentation Time: 09:45 AM - 10:00 AM
Abstract Keywords: Large Language Models (LLMs), Natural Language Processing, Deep Learning
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Recent large language models (LLMs) such as ChatGPT and LLaMA have shown great promise in many AI applications. However, their performance on medical tasks is suboptimal and can be improved by training on extensive domain-specific datasets. This study introduces Me LLaMA, a medical LLM family that includes foundation models – Me LLaMA 13/70B, along with their chat-enhanced versions – Me LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our domain-specific data suite for training and evaluation includes a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications.
Speaker(s):
Qianqian Xie, PhD
Yale University
Author(s):
Qianqian Xie, PhD - Yale University; Qingyu Chen, PhD - Yale University; Aokun Chen; Cheng Peng, PhD - University of Florida; Yan Hu - UTHealth Science Center Houston; Fongci Lin, PhD - Yale University; Xueqing Peng, PhD - Yale University; Jimin Huang, MS - Yale University; Jeffrey Zhang, PhD - Yale University; Vipina K. Keloth, PhD - Yale University; Xingyu Zhou, Bachelor - Yale University; Huan He, Ph.D. - Yale University; Lucila Ohno-Machado, MD, PhD - UC San Diego School of Medicine; Yonghui Wu, PhD - University of Florida; Hua Xu, Ph.D - Yale University; Jiang Bian, PhD - University of Florida;
Presentation Time: 09:45 AM - 10:00 AM
Abstract Keywords: Large Language Models (LLMs), Natural Language Processing, Deep Learning
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Recent large language models (LLMs) such as ChatGPT and LLaMA have shown great promise in many AI applications. However, their performance on medical tasks is suboptimal and can be improved by training on extensive domain-specific datasets. This study introduces Me LLaMA, a medical LLM family that includes foundation models – Me LLaMA 13/70B, along with their chat-enhanced versions – Me LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our domain-specific data suite for training and evaluation includes a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications.
Speaker(s):
Qianqian Xie, PhD
Yale University
Author(s):
Qianqian Xie, PhD - Yale University; Qingyu Chen, PhD - Yale University; Aokun Chen; Cheng Peng, PhD - University of Florida; Yan Hu - UTHealth Science Center Houston; Fongci Lin, PhD - Yale University; Xueqing Peng, PhD - Yale University; Jimin Huang, MS - Yale University; Jeffrey Zhang, PhD - Yale University; Vipina K. Keloth, PhD - Yale University; Xingyu Zhou, Bachelor - Yale University; Huan He, Ph.D. - Yale University; Lucila Ohno-Machado, MD, PhD - UC San Diego School of Medicine; Yonghui Wu, PhD - University of Florida; Hua Xu, Ph.D - Yale University; Jiang Bian, PhD - University of Florida;