Identifying Genomic Data Sources from Biomedical Literature
Presentation Time: 04:00 PM - 04:15 PM
Abstract Keywords: Data Mining, Information Extraction, Natural Language Processing
Primary Track: Applications
Genomic research is becoming increasingly data-intensive, yet the proper reference of data remains a persistent challenge. Despite various efforts to establish and standardize data citation practices, scientists frequently fall short of accurately referencing data in their papers. This deficiency complicates the attribution of contributions to data providers and impedes the reproducibility of findings in genomic research. This study addresses this gap by introducing a gold standard corpus designed to identify mentions of genomic data sources and associated attributes, thereby offering insights into data source availability and accessibility. Within this corpus, we categorize entities into six classes, encompassing three primary entities (Dataset, Repository, and Contributor) and three attributes (Accession Number, URL, and DOI). We also define and annotate the relations between these main entities and attributes. We perform a comprehensive analysis of the corpus, by assessing inter-annotator agreements and implementing an information extraction pipeline using BERT-based models. Our BERT-based models achieve a best F1 score of 0.94 in recognizing mentions of genomic data sources and 0.76 in extracting relationships between these mentions and associated attributes. By introducing this genomic data source mention corpus, we aim to propel the progress of data sharing and reuse in forthcoming genomic research.
Speaker(s):
Kalpana Raja
Author(s):
Xu Zuo - UTHealth Health Science Center at Houston; Ashley Gilliam, B.S. - UTHealth Health Science Center at Houston; Yan Hu - UTHealth Science Center Houston; Kirk Roberts, PhD - University of Texas Health Science Center at Houston; Hua Xu, Ph.D - Yale University;
Presentation Time: 04:00 PM - 04:15 PM
Abstract Keywords: Data Mining, Information Extraction, Natural Language Processing
Primary Track: Applications
Genomic research is becoming increasingly data-intensive, yet the proper reference of data remains a persistent challenge. Despite various efforts to establish and standardize data citation practices, scientists frequently fall short of accurately referencing data in their papers. This deficiency complicates the attribution of contributions to data providers and impedes the reproducibility of findings in genomic research. This study addresses this gap by introducing a gold standard corpus designed to identify mentions of genomic data sources and associated attributes, thereby offering insights into data source availability and accessibility. Within this corpus, we categorize entities into six classes, encompassing three primary entities (Dataset, Repository, and Contributor) and three attributes (Accession Number, URL, and DOI). We also define and annotate the relations between these main entities and attributes. We perform a comprehensive analysis of the corpus, by assessing inter-annotator agreements and implementing an information extraction pipeline using BERT-based models. Our BERT-based models achieve a best F1 score of 0.94 in recognizing mentions of genomic data sources and 0.76 in extracting relationships between these mentions and associated attributes. By introducing this genomic data source mention corpus, we aim to propel the progress of data sharing and reuse in forthcoming genomic research.
Speaker(s):
Kalpana Raja
Author(s):
Xu Zuo - UTHealth Health Science Center at Houston; Ashley Gilliam, B.S. - UTHealth Health Science Center at Houston; Yan Hu - UTHealth Science Center Houston; Kirk Roberts, PhD - University of Texas Health Science Center at Houston; Hua Xu, Ph.D - Yale University;
Identifying Genomic Data Sources from Biomedical Literature
Category
Paper - Regular