Times are displayed in (UTC-08:00) Pacific Time (US & Canada) Change
11/10/2024 |
3:30 PM – 5:00 PM |
Franciscan B
S11: Biomedical Literature Analysis - PubMed Pundits
Presentation Type: Oral
Session Chair:
Katrina Romagnoli, PhD, MS, MLIS - Geisinger
LitSense Insight: Bridging Gaps in Information Retrieval through Sentence Level Knowledge Discovery
Presentation Time: 03:30 PM - 03:45 PM
Abstract Keywords: Information Retrieval, Large Language Models (LLMs), Natural Language Processing, Delivering Health Information and Knowledge to the Public, Deep Learning
Primary Track: Applications
Programmatic Theme: Public Health Informatics
LitSense, a service provided by NCBI, is a web-based system that specializes in biomedical sentence retrieval. Given a query sentence, LitSense retrieves from over 1.3 billion sentences in PubMed abstracts and PMC full texts. In this work, we propose an improvement to LitSense using semantic search technologies powered by MedCPT, a state-of-the-art language embedding model for biomedical text. By incorporating MedCPT embeddings into LitSense, we achieve a significant improvement in retrieval performance.
Speaker(s):
Lana Yeganova, Dr / PhD
NIH
Author(s):
Won G Kim, PhD - National Library of Medicine, NIH; Shubo Tian, PhD - National Library of Medicine, NIH; Donald C. Comeau, PhD - National Library of Medicine, NIH; W. John Wilbur, MD - Computer Craft Corporation; Zhiyong Lu, PhD - National Library of Medicine, NIH;
Presentation Time: 03:30 PM - 03:45 PM
Abstract Keywords: Information Retrieval, Large Language Models (LLMs), Natural Language Processing, Delivering Health Information and Knowledge to the Public, Deep Learning
Primary Track: Applications
Programmatic Theme: Public Health Informatics
LitSense, a service provided by NCBI, is a web-based system that specializes in biomedical sentence retrieval. Given a query sentence, LitSense retrieves from over 1.3 billion sentences in PubMed abstracts and PMC full texts. In this work, we propose an improvement to LitSense using semantic search technologies powered by MedCPT, a state-of-the-art language embedding model for biomedical text. By incorporating MedCPT embeddings into LitSense, we achieve a significant improvement in retrieval performance.
Speaker(s):
Lana Yeganova, Dr / PhD
NIH
Author(s):
Won G Kim, PhD - National Library of Medicine, NIH; Shubo Tian, PhD - National Library of Medicine, NIH; Donald C. Comeau, PhD - National Library of Medicine, NIH; W. John Wilbur, MD - Computer Craft Corporation; Zhiyong Lu, PhD - National Library of Medicine, NIH;
IntelliGenes: A novel, interactive, customizable, and user-friendly AI/ML application for biomarker discovery and predictive analysis
Presentation Time: 03:45 PM - 04:00 PM
Abstract Keywords: Machine Learning, Omics (genomics, metabolomics, proteomics, transcriptomics, etc.) and Integrative Analyses, Biomarkers
Primary Track: Applications
Programmatic Theme: Translational Bioinformatics
Artificial intelligence (AI) and machine learning (ML) have advanced in several areas and fields of life, however, its progress in the field of genomics is not matching the levels others have attained. Challenges include but are not limited to the handling and analysis of high volumes of complex genomic data, and the expertise needed to implement and execute AI/ML approaches. In this study, we present IntelliGenes, a novel, interactive, customizable, cross-platform, and user-friendly AI/ML application for multi-genomic data exploration to discover novel biomarkers and predict rare, common, and complex diseases. The implemented methodology is based on a nexus of conventional statistical techniques and cutting-edge ML algorithms. The interactive and cross-platform graphical user interface of IntelliGenes is divided into three main sections: 1) Data Manager, 2) AI/ML Analysis, and 3) Visualization. Data Manager supports the user in loading and customizing the input data and list of existing biomarkers. AI/ML Analysis allows the user to apply default combinations of statistical and ML algorithms, as well as customize and create new AI/ML pipelines. Visualization provides options to interpret produced results. The performance of IntelliGenes has been successfully tested at variable in-house and peer reviewed studies to discover biomarkers associated with and to predict cardiovascular diseases. We have designed and implemented it in a way that the user with and without computational background can apply AI/ML approaches to discover novel biomarkers and predict diseases.
Speaker(s):
Zeeshan Ahmed, PhD
Department of Medicine, Rutgers Robert Wood Johnson Medical School. Rutgers Institute for Health, Health Care Policy and Aging Research. Rutgers Biomedical and Health Sciences. Rutgers The State University of New Jersey.
Author(s):
William DeGroat; Rishabh Narayanan, BS - Rutgers Institute for Health, Health Care Policy and Aging Research; Dinesh Mendhe, MS in Computer Science; Habiba Abdelhalim; Zeeshan Ahmed, PhD - Department of Medicine, Rutgers Robert Wood Johnson Medical School. Rutgers Institute for Health, Health Care Policy and Aging Research. Rutgers Biomedical and Health Sciences. Rutgers The State University of New Jersey.;
Presentation Time: 03:45 PM - 04:00 PM
Abstract Keywords: Machine Learning, Omics (genomics, metabolomics, proteomics, transcriptomics, etc.) and Integrative Analyses, Biomarkers
Primary Track: Applications
Programmatic Theme: Translational Bioinformatics
Artificial intelligence (AI) and machine learning (ML) have advanced in several areas and fields of life, however, its progress in the field of genomics is not matching the levels others have attained. Challenges include but are not limited to the handling and analysis of high volumes of complex genomic data, and the expertise needed to implement and execute AI/ML approaches. In this study, we present IntelliGenes, a novel, interactive, customizable, cross-platform, and user-friendly AI/ML application for multi-genomic data exploration to discover novel biomarkers and predict rare, common, and complex diseases. The implemented methodology is based on a nexus of conventional statistical techniques and cutting-edge ML algorithms. The interactive and cross-platform graphical user interface of IntelliGenes is divided into three main sections: 1) Data Manager, 2) AI/ML Analysis, and 3) Visualization. Data Manager supports the user in loading and customizing the input data and list of existing biomarkers. AI/ML Analysis allows the user to apply default combinations of statistical and ML algorithms, as well as customize and create new AI/ML pipelines. Visualization provides options to interpret produced results. The performance of IntelliGenes has been successfully tested at variable in-house and peer reviewed studies to discover biomarkers associated with and to predict cardiovascular diseases. We have designed and implemented it in a way that the user with and without computational background can apply AI/ML approaches to discover novel biomarkers and predict diseases.
Speaker(s):
Zeeshan Ahmed, PhD
Department of Medicine, Rutgers Robert Wood Johnson Medical School. Rutgers Institute for Health, Health Care Policy and Aging Research. Rutgers Biomedical and Health Sciences. Rutgers The State University of New Jersey.
Author(s):
William DeGroat; Rishabh Narayanan, BS - Rutgers Institute for Health, Health Care Policy and Aging Research; Dinesh Mendhe, MS in Computer Science; Habiba Abdelhalim; Zeeshan Ahmed, PhD - Department of Medicine, Rutgers Robert Wood Johnson Medical School. Rutgers Institute for Health, Health Care Policy and Aging Research. Rutgers Biomedical and Health Sciences. Rutgers The State University of New Jersey.;
Identifying Genomic Data Sources from Biomedical Literature
Presentation Time: 04:00 PM - 04:15 PM
Abstract Keywords: Data Mining, Information Extraction, Natural Language Processing
Primary Track: Applications
Genomic research is becoming increasingly data-intensive, yet the proper reference of data remains a persistent challenge. Despite various efforts to establish and standardize data citation practices, scientists frequently fall short of accurately referencing data in their papers. This deficiency complicates the attribution of contributions to data providers and impedes the reproducibility of findings in genomic research. This study addresses this gap by introducing a gold standard corpus designed to identify mentions of genomic data sources and associated attributes, thereby offering insights into data source availability and accessibility. Within this corpus, we categorize entities into six classes, encompassing three primary entities (Dataset, Repository, and Contributor) and three attributes (Accession Number, URL, and DOI). We also define and annotate the relations between these main entities and attributes. We perform a comprehensive analysis of the corpus, by assessing inter-annotator agreements and implementing an information extraction pipeline using BERT-based models. Our BERT-based models achieve a best F1 score of 0.94 in recognizing mentions of genomic data sources and 0.76 in extracting relationships between these mentions and associated attributes. By introducing this genomic data source mention corpus, we aim to propel the progress of data sharing and reuse in forthcoming genomic research.
Speaker(s):
Kalpana Raja
Author(s):
Xu Zuo - UTHealth Health Science Center at Houston; Ashley Gilliam, B.S. - UTHealth Health Science Center at Houston; Yan Hu - UTHealth Science Center Houston; Kirk Roberts, PhD - University of Texas Health Science Center at Houston; Hua Xu, Ph.D - Yale University;
Presentation Time: 04:00 PM - 04:15 PM
Abstract Keywords: Data Mining, Information Extraction, Natural Language Processing
Primary Track: Applications
Genomic research is becoming increasingly data-intensive, yet the proper reference of data remains a persistent challenge. Despite various efforts to establish and standardize data citation practices, scientists frequently fall short of accurately referencing data in their papers. This deficiency complicates the attribution of contributions to data providers and impedes the reproducibility of findings in genomic research. This study addresses this gap by introducing a gold standard corpus designed to identify mentions of genomic data sources and associated attributes, thereby offering insights into data source availability and accessibility. Within this corpus, we categorize entities into six classes, encompassing three primary entities (Dataset, Repository, and Contributor) and three attributes (Accession Number, URL, and DOI). We also define and annotate the relations between these main entities and attributes. We perform a comprehensive analysis of the corpus, by assessing inter-annotator agreements and implementing an information extraction pipeline using BERT-based models. Our BERT-based models achieve a best F1 score of 0.94 in recognizing mentions of genomic data sources and 0.76 in extracting relationships between these mentions and associated attributes. By introducing this genomic data source mention corpus, we aim to propel the progress of data sharing and reuse in forthcoming genomic research.
Speaker(s):
Kalpana Raja
Author(s):
Xu Zuo - UTHealth Health Science Center at Houston; Ashley Gilliam, B.S. - UTHealth Health Science Center at Houston; Yan Hu - UTHealth Science Center Houston; Kirk Roberts, PhD - University of Texas Health Science Center at Houston; Hua Xu, Ph.D - Yale University;
Artificial Intelligence-assisted Biomedical Literature Knowledge Synthesis to Support Decision-making in Precision Oncology
Presentation Time: 04:15 PM - 04:30 PM
Abstract Keywords: Natural Language Processing, Information Extraction, Cancer Genetics
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
The delivery of effective targeted therapies requires comprehensive analyses of the molecular profiling of tumors and matching with clinical phenotypes in the context of existing knowledge described in biomedical literature, registries, and knowledge bases. We evaluated the performance of natural language processing approaches in supporting knowledge retrieval and synthesis from the biomedical literature. We tested PubTator 3.0, Bidirectional Encoder Representations from Transformers (BERT), and Large Language Models and evaluated their ability to support named entity recognition (NER) and relation extraction (RE) from biomedical texts. PubTator 3.0 and the BioBERT model performed best in the NER task (best F1-score 0.93 and 0.89, respectively), while BioBERT outperformed all other solutions in the RE task (best F1-score 0.79) and a specific use case it was applied to by recognizing nearly all entity mentions and most of the relations. Our findings support the use of AI-assisted approaches in facilitating precision oncology decision-making.
Speaker(s):
Ting He
Johns Hopkins University
Author(s):
Ting He - Johns Hopkins University; Kory Kreimeyer - US Food and Drug Administration; Taxiarchis Botsis, PhD - Johns Hopkins University School of Medicine; Mimi Najjar, MD - Johns Hopkins University; Jonathan Spiker, BS - Johns Hopkins University; Maria Fatteh, MD - Johns Hopkins University; Valsamo Anagnostou, MD, PhD - Johns Hopkins University;
Presentation Time: 04:15 PM - 04:30 PM
Abstract Keywords: Natural Language Processing, Information Extraction, Cancer Genetics
Primary Track: Applications
Programmatic Theme: Clinical Research Informatics
The delivery of effective targeted therapies requires comprehensive analyses of the molecular profiling of tumors and matching with clinical phenotypes in the context of existing knowledge described in biomedical literature, registries, and knowledge bases. We evaluated the performance of natural language processing approaches in supporting knowledge retrieval and synthesis from the biomedical literature. We tested PubTator 3.0, Bidirectional Encoder Representations from Transformers (BERT), and Large Language Models and evaluated their ability to support named entity recognition (NER) and relation extraction (RE) from biomedical texts. PubTator 3.0 and the BioBERT model performed best in the NER task (best F1-score 0.93 and 0.89, respectively), while BioBERT outperformed all other solutions in the RE task (best F1-score 0.79) and a specific use case it was applied to by recognizing nearly all entity mentions and most of the relations. Our findings support the use of AI-assisted approaches in facilitating precision oncology decision-making.
Speaker(s):
Ting He
Johns Hopkins University
Author(s):
Ting He - Johns Hopkins University; Kory Kreimeyer - US Food and Drug Administration; Taxiarchis Botsis, PhD - Johns Hopkins University School of Medicine; Mimi Najjar, MD - Johns Hopkins University; Jonathan Spiker, BS - Johns Hopkins University; Maria Fatteh, MD - Johns Hopkins University; Valsamo Anagnostou, MD, PhD - Johns Hopkins University;
Publication Type Tagging using Transformer Models and Multi-Label Classification
Presentation Time: 04:30 PM - 04:45 PM
Abstract Keywords: Natural Language Processing, Deep Learning, Machine Learning, Knowledge Representation and Information Modeling, Information Retrieval
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2.
Speaker(s):
Joseph Menke, MS
University of Illinois, Urbana-Champaign
Author(s):
Joseph Menke, MS - University of Illinois, Urbana-Champaign; Halil Kilicoglu, PhD - University of Illinois at Urbana Champaign; Neil Smalheiser, MD - University of Illinois at Chicago;
Presentation Time: 04:30 PM - 04:45 PM
Abstract Keywords: Natural Language Processing, Deep Learning, Machine Learning, Knowledge Representation and Information Modeling, Information Retrieval
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2.
Speaker(s):
Joseph Menke, MS
University of Illinois, Urbana-Champaign
Author(s):
Joseph Menke, MS - University of Illinois, Urbana-Champaign; Halil Kilicoglu, PhD - University of Illinois at Urbana Champaign; Neil Smalheiser, MD - University of Illinois at Chicago;
Leveraging Large Language Models for Data Extraction in Living Systematic Reviews and Meta-analyses
Presentation Time: 04:45 PM - 05:00 PM
Abstract Keywords: Clinical Decision Support, Clinical Guidelines, Deep Learning
Primary Track: Applications
Programmatic Theme: Clinical Informatics
We maintain living, interactive systematic reviews (LISRs) to synthesize up-to-date evidence as soon as the new evidence becomes available. Data extraction for systematic reviews and meta-analyses (SRMAs) is time-consuming, resource-intensive, and prone to errors and hence completed by two reviewers in practice. Automating this step can tremendously enhance the efficiency of evidence synthesis informing clinical practice guidelines in a timely manner. To address this need, we propose a pipeline that leverages collaborative capabilities of large language models (LLMs) to automate data extraction for living systematic reviews.
Speaker(s):
Muhammad Ali Khan, M.B.B.S.
Mayo Clinic
Author(s):
Muhammad Ali Khan, M.B.B.S. - Mayo Clinic; Umair Ayub, PhD - Mayo Clinic; Syed Arsalan Ahmed Naqvi, M.B.B.S - Mayo Clinic; Kaneez Zahra Rubab Khakwani, M.B.B.S. - University of Arizona; Zaryab bin Riaz Sipra, M.B.B.S. - Rashid Latif Medical College, Pakistan; Sihan Zhou, PhD. - Mayo Clinic; Huan He, Ph.D. - Yale University; Seyyed Amir Hossein, MS - Mayo Clinic; Hasan Bashar, M.B.B.S - Mayo Clinic; Bryan Rumble, MSc - American Society of Clinical Oncology; Danielle S. Bitterman, MD - Dana Farber Cancer Institute; Jeremy Warner, MD, MS - Brown University; Jia Zou, PhD - Arizona State University; Chitta Baral, PhD - Arizona State University; Jeanne M. Palmer, MD - Mayo Clinic; M. Hassan Murad, M.D. - Mayo Clinic; Irbaz Riaz, MD - Mayo Clinic;
Presentation Time: 04:45 PM - 05:00 PM
Abstract Keywords: Clinical Decision Support, Clinical Guidelines, Deep Learning
Primary Track: Applications
Programmatic Theme: Clinical Informatics
We maintain living, interactive systematic reviews (LISRs) to synthesize up-to-date evidence as soon as the new evidence becomes available. Data extraction for systematic reviews and meta-analyses (SRMAs) is time-consuming, resource-intensive, and prone to errors and hence completed by two reviewers in practice. Automating this step can tremendously enhance the efficiency of evidence synthesis informing clinical practice guidelines in a timely manner. To address this need, we propose a pipeline that leverages collaborative capabilities of large language models (LLMs) to automate data extraction for living systematic reviews.
Speaker(s):
Muhammad Ali Khan, M.B.B.S.
Mayo Clinic
Author(s):
Muhammad Ali Khan, M.B.B.S. - Mayo Clinic; Umair Ayub, PhD - Mayo Clinic; Syed Arsalan Ahmed Naqvi, M.B.B.S - Mayo Clinic; Kaneez Zahra Rubab Khakwani, M.B.B.S. - University of Arizona; Zaryab bin Riaz Sipra, M.B.B.S. - Rashid Latif Medical College, Pakistan; Sihan Zhou, PhD. - Mayo Clinic; Huan He, Ph.D. - Yale University; Seyyed Amir Hossein, MS - Mayo Clinic; Hasan Bashar, M.B.B.S - Mayo Clinic; Bryan Rumble, MSc - American Society of Clinical Oncology; Danielle S. Bitterman, MD - Dana Farber Cancer Institute; Jeremy Warner, MD, MS - Brown University; Jia Zou, PhD - Arizona State University; Chitta Baral, PhD - Arizona State University; Jeanne M. Palmer, MD - Mayo Clinic; M. Hassan Murad, M.D. - Mayo Clinic; Irbaz Riaz, MD - Mayo Clinic;