Detecting Manuscripts Related to Computable Phenotypes Using a Transformer-based Language Model
Presentation Time: 05:00 PM - 06:30 PM
Abstract Keywords: Knowledge Representation and Information Modeling, Artificial Intelligence, Phenomics and Phenome-wide Association Studies, Large Language Models (LLMs)
Primary Track: Applications
Programmatic Theme: Public Health Informatics
Identifying relevant manuscripts for phenomics knowledgebases is a complex and time-consuming task. We developed a Transformer-based language model using a fine-tuned BioBERT model to detect manuscripts related to computable phenotypes. To address BioBERT’s 512-token limit, we introduced a sliding-window method, segmenting documents into multiple segments and aggregating classification scores. Our model significantly outperformed the default approach (AUC: 0.99 vs. 0.83, Accuracy: 0.95 vs. 0.72). This method enhances automated phenotyping literature identification, improving knowledgebase development efficiency.
Speaker(s):
Junghoon Chae, PhD
Oak Ridge National Laboratory
Author(s):
Junghoon Chae, PhD - Oak Ridge National Laboratory; David Heise; Keith Connatser; Jacqueline Honerlaw, RN, MPH - VA Boston Healthcare System; Monika Maripuri, MBBS, MPH - VA Boston Healthcare System; Yuk-Lam Ho, MPH - VA Boston Healthcare System; Kelly Cho, PhD - VA Boston Healthcare/Harvard Medical School;
Presentation Time: 05:00 PM - 06:30 PM
Abstract Keywords: Knowledge Representation and Information Modeling, Artificial Intelligence, Phenomics and Phenome-wide Association Studies, Large Language Models (LLMs)
Primary Track: Applications
Programmatic Theme: Public Health Informatics
Identifying relevant manuscripts for phenomics knowledgebases is a complex and time-consuming task. We developed a Transformer-based language model using a fine-tuned BioBERT model to detect manuscripts related to computable phenotypes. To address BioBERT’s 512-token limit, we introduced a sliding-window method, segmenting documents into multiple segments and aggregating classification scores. Our model significantly outperformed the default approach (AUC: 0.99 vs. 0.83, Accuracy: 0.95 vs. 0.72). This method enhances automated phenotyping literature identification, improving knowledgebase development efficiency.
Speaker(s):
Junghoon Chae, PhD
Oak Ridge National Laboratory
Author(s):
Junghoon Chae, PhD - Oak Ridge National Laboratory; David Heise; Keith Connatser; Jacqueline Honerlaw, RN, MPH - VA Boston Healthcare System; Monika Maripuri, MBBS, MPH - VA Boston Healthcare System; Yuk-Lam Ho, MPH - VA Boston Healthcare System; Kelly Cho, PhD - VA Boston Healthcare/Harvard Medical School;
Detecting Manuscripts Related to Computable Phenotypes Using a Transformer-based Language Model
Category
Poster - Regular