Times are displayed in (UTC-07:00) Pacific Time (US & Canada) Change
11/11/2024 |
3:30 PM – 5:00 PM |
Franciscan A
S55: AI Vulnerabilities and Limitations - Finding "ai" in "fail"
Presentation Type: Oral
Session Chair:
Josette Jones, RN, PhD - Indiana University
Adversarial Attacks on Large Language Models in Medicine
Presentation Time: 03:30 PM - 03:45 PM
Abstract Keywords: Large Language Models (LLMs), Privacy and Security, Clinical Decision Support, Diversity, Equity, Inclusion, Accessibility, and Health Equity
Primary Track: Applications
Advancements in Large Language Models (LLMs) have shown potential for healthcare applications. Our study exposes vulnerabilities to attacks that could manipulate outputs, revealing that LLM application settings can be easily compromised by malicious attackers or interested stakeholders through two distinct methods. Despite these manipulations, models maintain medical task performance, masking potential harms. This underscores critical risks, particularly in recommending harmful operation or medication to at-risk patients, highlighting the need for safeguards in medical LLM usage.
Speaker(s):
Yifan Yang, B.S.
NCBI, NLM/NIH
Author(s):
Yifan Yang, B.S. - NCBI, NLM/NIH; Qiao Jin, M.D. - National Institutes of Health; Furong Huang, PhD - Department of Computer Science at the University of Maryland; Zhiyong Lu, PhD - National Library of Medicine, NIH;
Presentation Time: 03:30 PM - 03:45 PM
Abstract Keywords: Large Language Models (LLMs), Privacy and Security, Clinical Decision Support, Diversity, Equity, Inclusion, Accessibility, and Health Equity
Primary Track: Applications
Advancements in Large Language Models (LLMs) have shown potential for healthcare applications. Our study exposes vulnerabilities to attacks that could manipulate outputs, revealing that LLM application settings can be easily compromised by malicious attackers or interested stakeholders through two distinct methods. Despite these manipulations, models maintain medical task performance, masking potential harms. This underscores critical risks, particularly in recommending harmful operation or medication to at-risk patients, highlighting the need for safeguards in medical LLM usage.
Speaker(s):
Yifan Yang, B.S.
NCBI, NLM/NIH
Author(s):
Yifan Yang, B.S. - NCBI, NLM/NIH; Qiao Jin, M.D. - National Institutes of Health; Furong Huang, PhD - Department of Computer Science at the University of Maryland; Zhiyong Lu, PhD - National Library of Medicine, NIH;
Time Matters: Examine Temporal Effects on Biomedical Language Models
Presentation Time: 03:45 PM - 04:00 PM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Evaluation, Deep Learning
Primary Track: Foundations
Programmatic Theme: Translational Bioinformatics
Time roots in developing and deploying language models for biomedical applications (e.g., phenotype inference) that we trained models on historical data and deploy those models on new or future data, which may vary from training data. While increasing biomedical tasks have employed state-of-the-art language models (e.g., T5 and GPT), there are very few studies have examined temporal effects on model performance over multiple key biomedical tasks when data shift across development and deployment.
In this study, we aim to fill the gap by systematically probing temporal relations between model performance and data shifts and language models three major and critical biomedical downstream tasks, including phenotype classification, information extraction, and question answering. We deploy diverse benchmark metrics to evaluate model performance, distance methods to evaluate data drifts over time spans, and statistical measurements to quantify temporal effects on biomedical language models. Our study has demonstrated that time matters for deploying language models on the biomedical applications, while the degree of performance degradation varies by specific biomedical tasks and statistical quantification approaches, and model performance variations highly correlate with data temporal shifts across development and deployment phases. We believe this study can establish a solid benchmark to evaluate and assess temporal effects on deploying biomedical language models.
Speaker(s):
Weisi Liu, Bachelor of Science in Statistics
University of Memphis
Author(s):
Weisi Liu, Bachelor of Science in Statistics - University of Memphis; Zhe He, PhD, FAMIA - Florida State University; Xiaolei Huang, PhD in Information Science - University of Memphis;
Presentation Time: 03:45 PM - 04:00 PM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Evaluation, Deep Learning
Primary Track: Foundations
Programmatic Theme: Translational Bioinformatics
Time roots in developing and deploying language models for biomedical applications (e.g., phenotype inference) that we trained models on historical data and deploy those models on new or future data, which may vary from training data. While increasing biomedical tasks have employed state-of-the-art language models (e.g., T5 and GPT), there are very few studies have examined temporal effects on model performance over multiple key biomedical tasks when data shift across development and deployment.
In this study, we aim to fill the gap by systematically probing temporal relations between model performance and data shifts and language models three major and critical biomedical downstream tasks, including phenotype classification, information extraction, and question answering. We deploy diverse benchmark metrics to evaluate model performance, distance methods to evaluate data drifts over time spans, and statistical measurements to quantify temporal effects on biomedical language models. Our study has demonstrated that time matters for deploying language models on the biomedical applications, while the degree of performance degradation varies by specific biomedical tasks and statistical quantification approaches, and model performance variations highly correlate with data temporal shifts across development and deployment phases. We believe this study can establish a solid benchmark to evaluate and assess temporal effects on deploying biomedical language models.
Speaker(s):
Weisi Liu, Bachelor of Science in Statistics
University of Memphis
Author(s):
Weisi Liu, Bachelor of Science in Statistics - University of Memphis; Zhe He, PhD, FAMIA - Florida State University; Xiaolei Huang, PhD in Information Science - University of Memphis;
Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer
Presentation Time: 04:00 PM - 04:15 PM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Legal, Ethical, Social and Regulatory Issues, Deep Learning
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model
for public access is the standard practice currently. Despite their transformative impact on natural language processing, public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs and encourage the mindful and responsible usage of LLMs in the clinical domain.
Speaker(s):
Avisha Das, Ph.D.
Mayo Clinic
Author(s):
Avisha Das, Ph.D. - Mayo Clinic; Amara Tariq, Ph.D. - Mayo Clinic Arizona; Felipe Batalini, M.D. - Mayo Clinic Arizona; Boddhisattwa Dhara, B. Tech. - Birla Institute of Technology and Science (BITS) Pilani, Hyderabad Campus, India; Imon Banerjee, PhD - Arizona State U, Mayo Clinic;
Presentation Time: 04:00 PM - 04:15 PM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Legal, Ethical, Social and Regulatory Issues, Deep Learning
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model
for public access is the standard practice currently. Despite their transformative impact on natural language processing, public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs and encourage the mindful and responsible usage of LLMs in the clinical domain.
Speaker(s):
Avisha Das, Ph.D.
Mayo Clinic
Author(s):
Avisha Das, Ph.D. - Mayo Clinic; Amara Tariq, Ph.D. - Mayo Clinic Arizona; Felipe Batalini, M.D. - Mayo Clinic Arizona; Boddhisattwa Dhara, B. Tech. - Birla Institute of Technology and Science (BITS) Pilani, Hyderabad Campus, India; Imon Banerjee, PhD - Arizona State U, Mayo Clinic;
Semantic Clinical Artificial Intelligence (SCAI) Improves LLM performance on the USMLE Step 1, 2 and 3 Examinations
Presentation Time: 04:15 PM - 04:30 PM
Abstract Keywords: Controlled Terminologies, Ontologies, and Vocabularies, Large Language Models (LLMs), Natural Language Processing, Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Introduction – Large Language Models (LLMs) such as ChatGPT and GPT4 have been shown to perform well on the USMLE examination, achieving over 60% accuracy. Current LLMs predict the next word given a string of words and although this turns out to be quite powerful it lacks formal semantics and therefore cannot reason. We have developed a semantically augmented LLM named Semantic Clinical Artificial Intelligence (SCAI) which is a Generative Pre-trained Transformer. Our hypothesis was that adding semantics to LLMs would improve accuracy. This was tested on the United States Medical Licensing Examination (USMLE) performance comparing the LLM alone against the SCAI model.
Results: There were 87 text based questions in the examination. The native 13B parameter Llama LLM was able to get 29 (33.3%) questions correct. The SCAI version of the same native Llama 13B parameter LLM was able to get 48 (55.2%) of the 87 questions correct, p<0.0001.
There were 101 text based questions in the step 2 examination. The native LLM was able to get 35 (35%) correct and the SCAI was able to get 49 correct (49%), with p=0.005.
There were 123 text based questions in the step 3 examination. The native LLM was able to get 45 (36.6%) correct and the SCAI was able to get 68 correct (55.3%), with p<0.0001.
Semantic augmentation using RAG (SCAI) led to significantly improved scores on the USMLE step 1, 2 and 3 tests. None of these methods were able to pass any of the USMLE step exams.
Speaker(s):
Peter Elkin, MD, MACP, FACMI, FNYAM, FAMIA, FIAHSI
Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, State University of New York
Author(s):
Presentation Time: 04:15 PM - 04:30 PM
Abstract Keywords: Controlled Terminologies, Ontologies, and Vocabularies, Large Language Models (LLMs), Natural Language Processing, Clinical Decision Support
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Introduction – Large Language Models (LLMs) such as ChatGPT and GPT4 have been shown to perform well on the USMLE examination, achieving over 60% accuracy. Current LLMs predict the next word given a string of words and although this turns out to be quite powerful it lacks formal semantics and therefore cannot reason. We have developed a semantically augmented LLM named Semantic Clinical Artificial Intelligence (SCAI) which is a Generative Pre-trained Transformer. Our hypothesis was that adding semantics to LLMs would improve accuracy. This was tested on the United States Medical Licensing Examination (USMLE) performance comparing the LLM alone against the SCAI model.
Results: There were 87 text based questions in the examination. The native 13B parameter Llama LLM was able to get 29 (33.3%) questions correct. The SCAI version of the same native Llama 13B parameter LLM was able to get 48 (55.2%) of the 87 questions correct, p<0.0001.
There were 101 text based questions in the step 2 examination. The native LLM was able to get 35 (35%) correct and the SCAI was able to get 49 correct (49%), with p=0.005.
There were 123 text based questions in the step 3 examination. The native LLM was able to get 45 (36.6%) correct and the SCAI was able to get 68 correct (55.3%), with p<0.0001.
Semantic augmentation using RAG (SCAI) led to significantly improved scores on the USMLE step 1, 2 and 3 tests. None of these methods were able to pass any of the USMLE step exams.
Speaker(s):
Peter Elkin, MD, MACP, FACMI, FNYAM, FAMIA, FIAHSI
Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, State University of New York
Author(s):
BadCLM: Backdoor Attack in Clinical Language Models for Electronic Health Records
Presentation Time: 04:30 PM - 04:45 PM
Abstract Keywords: Clinical Decision Support, Clinical Decision Support, Natural Language Processing
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
The advent of clinical language models integrated into electronic health records (EHR) for clinical decision support has marked a significant advancement, leveraging the depth of clinical notes for improved decision-making. Despite their success, the potential vulnerabilities of these models remain largely unexplored. This paper delves into the realm of backdoor attacks on clinical language models, introducing an innovative attention-based backdoor attack method, BadCLM. This technique clandestinely embeds a backdoor within the models, causing them to produce incorrect predictions when a pre-defined trigger is present in inputs, while functioning accurately otherwise. We demonstrate the efficacy of BadCLM through an in-hospital mortality prediction task with MIMIC III dataset, showcasing its potential to compromise model integrity. Our findings illuminate a significant security risk in clinical decision support systems and pave the way for future endeavors in fortifying clinical language models against such vulnerabilities.
Speaker(s):
Weimin Lyu, Master
Stony Brook University
Author(s):
Zexin Bi, High school student; Fusheng Wang, Ph.D. - Stony Brook University; Chao Chen, Ph.D. - Stony Brook University;
Presentation Time: 04:30 PM - 04:45 PM
Abstract Keywords: Clinical Decision Support, Clinical Decision Support, Natural Language Processing
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
The advent of clinical language models integrated into electronic health records (EHR) for clinical decision support has marked a significant advancement, leveraging the depth of clinical notes for improved decision-making. Despite their success, the potential vulnerabilities of these models remain largely unexplored. This paper delves into the realm of backdoor attacks on clinical language models, introducing an innovative attention-based backdoor attack method, BadCLM. This technique clandestinely embeds a backdoor within the models, causing them to produce incorrect predictions when a pre-defined trigger is present in inputs, while functioning accurately otherwise. We demonstrate the efficacy of BadCLM through an in-hospital mortality prediction task with MIMIC III dataset, showcasing its potential to compromise model integrity. Our findings illuminate a significant security risk in clinical decision support systems and pave the way for future endeavors in fortifying clinical language models against such vulnerabilities.
Speaker(s):
Weimin Lyu, Master
Stony Brook University
Author(s):
Zexin Bi, High school student; Fusheng Wang, Ph.D. - Stony Brook University; Chao Chen, Ph.D. - Stony Brook University;
A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations
Presentation Time: 04:45 PM - 05:00 PM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Deep Learning
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Despite the potential of Large Language Models in biomedicine, they lack baseline performance, benchmarks, and recommendations for using LLMs in the biomedical domain. This study makes three contributions. First, it undertakes a comprehensive evaluation to establish the baseline performance of LLMs (GPT-3.5, GPT-4, and LLaMA) across 12 BioNLP datasets encompassing six distinct extractive and generative tasks. Second, we conducted thorough manual validation collectively over thousands of sample outputs in total. Third, the study offers valuable suggestions for the effective use of LLMs in BioNLP applications.
Speaker(s):
Qingyu Chen, PhD
Yale University
Author(s):
Presentation Time: 04:45 PM - 05:00 PM
Abstract Keywords: Natural Language Processing, Large Language Models (LLMs), Deep Learning
Primary Track: Applications
Programmatic Theme: Clinical Informatics
Despite the potential of Large Language Models in biomedicine, they lack baseline performance, benchmarks, and recommendations for using LLMs in the biomedical domain. This study makes three contributions. First, it undertakes a comprehensive evaluation to establish the baseline performance of LLMs (GPT-3.5, GPT-4, and LLaMA) across 12 BioNLP datasets encompassing six distinct extractive and generative tasks. Second, we conducted thorough manual validation collectively over thousands of sample outputs in total. Third, the study offers valuable suggestions for the effective use of LLMs in BioNLP applications.
Speaker(s):
Qingyu Chen, PhD
Yale University
Author(s):