Custom CSS
double-click to edit, do not edit in source
5/19/2026 |
3:30 PM – 4:45 PM |
Mt. Elbert B - 555 Building, 2nd Floor
TRI23: Learning Across Modalities, Institutions, and Tasks (Oral Presentation)
Presentation Type: Oral Presentations
2026 CIC 25x5 Presentation
Session Credits: 1.25
Multimodal Training to Unimodal Deployment: Leveraging Unstructured Data During Training To Optimize Structured Data Only Deployment
Presentation Type: Paper - Student
Presentation Time: 03:30 PM - 03:42 PM
Primary Track: Data Science/Artificial Intelligence
Unstructured Electronic Health Record (EHR) data—such as clinical notes—contain clinical contextual observations that are not directly reflected in structured data fields. This additional information can substantially improve model learning. However, due to their unstructured nature, these data are often unavailable or impractical to use when deploying a model. We introduce a multimodal learning framework that leverages unstructured EHR data during training while producing a model that can be deployed using only structured EHR data. Using a cohort of 3,466 children evaluated for late talking, we generated note embeddings with BioClinicalBERT and encoded structured embeddings from demographics and medical codes. A note-based teacher model and a structured-only student model were jointly trained using contrastive learning and contrastive knowledge distillation loss, producing a strong classifier (AUROC = 0.985). Our proposed model reached AUROC of 0.705—outperforming the structured-only baseline of 0.656. These results demonstrate that incorporating unstructured data during training enhances the model’s capacity to identify task-relevant information within structured EHR data, enabling a deployable structured-only phenotype model.
Speaker(s):
Zigui Wang, Bachelor
Duke University
Author(s):
Zigui Wang, Bachelor - Duke University;
Minghui Sun, Master of Biostatistics - Duke University;
JIANG SHU, Student - DUKE UNIVERSITY;
Matthew Engelhard, PhD, MD - Duke University School of Medicine;
Lauren Franz, MBBCh - Duke University;
Benjamin Goldstein, PhD - Duke University;
Zigui
Wang,
Bachelor - Duke University
Federated R-Learner for Estimating Heterogeneous Treatment Effects for Personalized Treatments
Presentation Type: Podium Abstract
Presentation Time: 03:42 PM - 03:54 PM
Primary Track: Clinical Research Informatics
Federated health data networks such as OHDSI and PCORnet enable real-world evidence generation across diverse health systems without centralizing electronic health records (EHRs). These networks are well suited for estimating heterogeneous treatment effects (HTEs), which characterize how treatment benefits and harms vary across demographic, clinical, and social subgroups. However, most causal analyses in real-world data focus on average treatment effects, and state-of-the-art HTE estimators such as the R-learner typically require centralized individual-level data—a major limitation under modern privacy and governance constraints. To address this need, we developed the Federated R-learner (Fed-R), a privacy-preserving, communication-efficient, and heterogeneity-aware algorithm for estimating HTEs across multiple institutions without sharing patient-level data. Fed-R uses site-specific cross-fitted outcome and propensity residuals to compute aggregated sufficient statistics, which are then combined to solve a single global linear system. Variance-model–based conformal inference provides valid uncertainty quantification using only aggregated summaries.
Across six simulation scenarios varying in covariate shift and treatment-policy heterogeneity, Fed-R matched centralized performance when heterogeneity was mild and substantially outperformed centralized R-learner models when site-specific overlap assumptions failed. Although centralized analysis remained advantageous under pure covariate shift, Fed-R demonstrated strong stability and resistance to bias. These results establish Fed-R as a practical and robust approach for HTE estimation in real-world federated data networks.
Speaker(s):
Linying Zhang, PhD
Washington University in St. Louis
Author(s):
Runyan Xin, BS - Washington University in St. Louis;
Yiqiao Jin, MS - Washington University in St. Louis;
Nan Lin, PhD - Washington University in St. Louis;
Linying
Zhang,
PhD - Washington University in St. Louis
Enhanced Atrial Fibrillation Prediction in ESUS Patients with Hypergraph-based Pre-training
Presentation Type: Paper - Regular
Presentation Time: 03:54 PM - 04:06 PM
Primary Track: Data Science/Artificial Intelligence
Atrial fibrillation (AF) is a major complication following embolic stroke of undetermined source (ESUS), elevating the risk of recurrent stroke and mortality. Early identification is clinically important, yet existing tools face limitations in accuracy, scalability, and cost. Machine learning (ML) offers promise but is hindered by small ESUS cohorts and high-dimensional medical features. To address these challenges, we introduce supervised and unsupervised hypergraph-based pre-training strategies to improve AF prediction in ESUS patients. We first pre-train hypergraph-based patient embedding models on a large stroke cohort (7,780 patients) to capture salient features and higher-order interactions. The resulting embeddings are transferred to a smaller ESUS cohort (510 patients), reducing feature dimensionality while preserving clinically meaningful information, enabling effective prediction with lightweight models. Experiments show that both pre-training approaches outperform traditional models trained on raw data, improving accuracy and robustness. This framework offers a scalable and efficient solution for AF prediction after stroke.
Speaker(s):
Yuzhang Xie, Doctoral Student
Emory University
Author(s):
Yuzhang Xie, Doctoral Student - Emory University;
Yuhua Wu, BSN - School of nursing, Emory University;
Ruiyu Wang, Bachelor of Science - Emory University;
Fadi Nahab, MD - Emory University;
Xiao Hu, PhD - Emory University;
Carl Yang, Ph.D. - Emory University;
Yuzhang
Xie,
Doctoral Student - Emory University
Exploring Anti-Aging Literature via ConvexTopics and Large Language Models.
Presentation Type: Paper - Regular
Presentation Time: 04:06 PM - 04:18 PM
Primary Track: Data Science/Artificial Intelligence
The rapid expansion of biomedical publications creates challenges for organizing knowledge and detecting emerging trends, underscoring the need for scalable and interpretable methods. Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation. We propose a reformulation of a convex-optimization–based clustering algorithm that produces stable, fine-grained topics by selecting exemplars from the data and guaranteeing a global optimum. Applied to ~12,000 PubMed articles on aging and longevity, our method uncovers topics validated by medical experts. It yields interpretable topics spanning from molecular mechanisms to dietary supplements, physical activity, and gut microbiota. The method performs favorably, and most importantly, its reproducibility and interpretability distinguish it from common clustering approaches, including K-means, LDA, and BERTopic. This work provides a basis for developing scalable, web-accessible tools for knowledge discovery.
Speaker(s):
Lana Yeganova, Dr / PhD
NIH
Author(s):
Won Kim, PhD - NIH;
Shubo Tian, PhD - NIH;
Natalie Xie, BS - NIH;
Donald Comeau, PhD - NIH;
W. John Wilbur, MD - Computer Craft Corporation;
Zhiyong Lu, PhD - National Library of Medicine, NIH;
Lana
Yeganova,
Dr / PhD - NIH
Leveraging LLM-Driven Weak Supervision to Classify Infant Feeding Behavior from Clinical Notes
Presentation Type: Podium Abstract
Presentation Time: 04:18 PM - 04:30 PM
Primary Track: Data Science/Artificial Intelligence
Infant feeding behavior is inconsistently documented in clinical notes, limiting the ability to identify breast milk exposure. Compared to labels from a weakly supervised large language model, our rule-based pipeline achieved an accuracy of over 94% on a held-out test set. This approach provides an efficient, auditable method for transforming unstructured breastfeeding documentation into structured data suitable for downstream clinical and pharmacologic research.
Speaker(s):
Connor Grannis, MTDA
Nationwide Children's Hospital
Author(s):
Connor Grannis, MTDA - Nationwide Children's Hospital;
Austin Antoniou, PhD - Nationwide Children's Hospital;
David Gordon, BS - Nationwide Children's Hospital;
Jinyu Xu, PhD - Nationwide Children's Hospital;
Peter White, PhD - Nationwide Children's Hospital;
Sarah Keim, PhD - Nationwide Children's Hospital;
Christopher Bartlett, PhD, MHA - Nationwide Children's Hosptial;
Connor
Grannis,
MTDA - Nationwide Children's Hospital
Leveraging Multi-Institutional Auxiliary Outcomes to Improve Prediction of High-Missingness Outcomes in Data-Limited Hospitals
Presentation Type: Podium Abstract
Presentation Time: 04:30 PM - 04:42 PM
Primary Track: Data Science/Artificial Intelligence
Rural hospitals often experience high rates of missing data, limiting the effectiveness of AI models. This study introduces AuxLike, a likelihood-based method that leverages auxiliary outcomes from multi-institutional datasets to improve prediction of unobserved outcomes. Simulation results show AuxLike outperforms complete case and doubly robust approaches, particularly when urban hospitals contribute large samples, even with added noise. These findings highlight AuxLike’s potential to make AI models more practical for data-limited rural hospitals.
Speaker(s):
Jaeyoung Park, PhD
University of Central Florida
Author(s):
Jaeyoung Park, PhD - University of Central Florida;
Anotnio Castellanos, PhD - Hebrew University of Jerusalem;
Johnathan Sheele, MD, MHS, MPH - Mayo Clinic Jacksonville;
Jaeyoung
Park,
PhD - University of Central Florida