Publication Type Tagging using Transformer Models and Multi-Label Classification
Presentation Time: 04:30 PM - 04:45 PM
Abstract Keywords: Natural Language Processing, Deep Learning, Machine Learning, Knowledge Representation and Information Modeling, Information Retrieval
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2.
Speaker(s):
Joseph Menke, MS
University of Illinois, Urbana-Champaign
Author(s):
Joseph Menke, MS - University of Illinois, Urbana-Champaign; Halil Kilicoglu, PhD - University of Illinois at Urbana Champaign; Neil Smalheiser, MD - University of Illinois at Chicago;
Presentation Time: 04:30 PM - 04:45 PM
Abstract Keywords: Natural Language Processing, Deep Learning, Machine Learning, Knowledge Representation and Information Modeling, Information Retrieval
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2.
Speaker(s):
Joseph Menke, MS
University of Illinois, Urbana-Champaign
Author(s):
Joseph Menke, MS - University of Illinois, Urbana-Champaign; Halil Kilicoglu, PhD - University of Illinois at Urbana Champaign; Neil Smalheiser, MD - University of Illinois at Chicago;
Publication Type Tagging using Transformer Models and Multi-Label Classification
Category
Paper - Regular