American Medical Informatics Association - Identifying a Multi-Stage Symptom Evolution Pattern of Colorectal Cancer in High-Risk Young Adults Using Patient-Centered Artificial Intelligence

Early Prediction of Colorectal Cancer Using Electronic Health Record Data

Presentation Type: Paper - Regular
Presentation Time: 10:00 AM - 10:12 AM

Abstract Keywords: Machine Learning, Real-World Evidence Generation, Data Mining
Programmatic Theme: Clinical Informatics

Colorectal cancer (CRC) is one of the most common malignancies in the United States and a major contributor to cancer-related mortality. In this study, we developed predictive models to identify CRC 6 and 12 months prior to diagnosis using electronic health record (EHR) data that captured prior diagnoses, procedures, medications, and healthcare utilization patterns. CRC cases were identified between January 1, 2010, and December 31, 2019. A non-cancer control cohort was constructed and balanced with cases through 1:1 propensity score matching on age, race, and diagnosis year. We evaluated a linear support vector machine (SVM) and XGBoost, using three-fold cross-validation. XGBoost showed better predictive performance. The 6-month prediction model achieved an F1 score of 0.83 and an AUC of 0.93, while the 12-month model achieved an F1 score of 0.82 and an AUC of 0.92. Top important predictors mainly reflected healthcare utilization and diagnostic testing patterns, including laboratory tests, consultation services, telehealth visits, and other outpatient diagnostic procedures.

Speaker(s):
Wanting Cui, Masters
University of Arizona

Author(s):
Joseph Finkelstein, MD, PhD - University of Arizona;

Who Should Be Screened? Improving Lung Cancer Risk Prediction Using Clinical and Lifestyle Variables from the All of Us Research Program

Presentation Type: Paper - Student
Presentation Time: 10:12 AM - 10:24 AM

Abstract Keywords: Machine Learning, Clinical Decision Support, Deep Learning, Population Health, Real-World Evidence Generation, Quantitative Methods, Data Mining, Evaluation
Programmatic Theme: Clinical Research Informatics

Traditional lung cancer risk models rely on questionnaire-derived smoking variables that are often incomplete in routine clinical care. We evaluated whether routinely collected electronic health record data can provide comparable risk prediction and whether combining EHR and lifestyle data improves performance. Using the All of Us Research Program, we performed the first evaluation of the PLCOm2012 model in this cohort and compared four approaches for three-year lung cancer risk prediction among ever-smokers: original PLCOm2012, cohort-refitted PLCOm2012, EHR-only models using Phecodes or foundation model representations, and hybrid models integrating lifestyle and EHR features. The original PLCOm2012 model achieved an AUROC of 0.736 and improved to 0.810 after refitting. The EHR-only Phecode model achieved comparable or higher discrimination (AUROC 0.836), while the hybrid Phecode model performed best (AUROC 0.843). At the top 20% risk threshold, the hybrid model identified 70% of lung cancer cases compared with 51% using the original PLCOm2012 model.

Speaker(s):
Jiayuan Wang, PhD
UCLA

Author(s):
Luoting Zhuang, M.S. - Medical Informatics Home Area, University of California, Los Angeles; William Hsu, PhD - University of California, Los Angeles; Yannan Lin, MD, MPH, PhD - UCLA;

Comparing Study Designs for Rare Cancer Risk Prediction Using Electronic Health Records

Presentation Type: Podium Abstract
Presentation Time: 10:24 AM - 10:36 AM

Abstract Keywords: Machine Learning, Clinical Decision Support, Data Mining, Knowledge Representation & Information Modeling
Programmatic Theme: Clinical Informatics

Gastrointestinal cancer incidence is rising among adults younger than 50 years. Using WashU/BJC EHR data, we compared retrospective cohort versus nested case-control study designs for predicting EOGIC using L1-regularized logistic regression and temporal testing in a 2023 outpatient cohort. The case-control design showed higher internal AUPRC (0.61) but poor temporal performance (0.01), whereas the cohort design had lower internal AUPRC (0.17) yet more stable temporal performance (0.07), supporting cohort-based development for deployment-focused transportability.

Speaker(s):
Ruochong Fan, MA
Washington University in St. Louis

Author(s):
Ruochong Fan, MA - Washington University in St. Louis; Sina Azadnajafabad, MD, MPH - Division of Public Health Sciences, Department of Surgery, Washington University in St. Louis; Jenna Reps, PhD - Janssen Research and Development; Benjamin Bowe, PhD, MPH - Division of Public Health Sciences, Department of Surgery, Washington University in St. Louis; Mackenzie Hofford, MD - Washington University; Kian-Huat Lim, MD, PhD - Department of Medicine, Washington University in St. Louis; George Hripcsak, MD - Columbia University Irving Medical Center; Yin Cao, ScD, MPH - Division of Public Health Sciences, Department of Surgery, Washington University in St. Louis; Linying Zhang, PhD - Washington University in St. Louis;

Identifying a Multi-Stage Symptom Evolution Pattern of Colorectal Cancer in High-Risk Young Adults Using Patient-Centered Artificial Intelligence

Presentation Type: Podium Abstract
Presentation Time: 10:36 AM - 10:48 AM

Abstract Keywords: Large Language Models (LLMs), Machine Learning, Patient-/Person-Generated Health Data, Artificial Intelligence, Clinical Decision Support
Programmatic Theme: Clinical Research Informatics

Early-onset colorectal cancer (CRC) diagnosis is often delayed because symptoms in young adults are misattributed to benign conditions. We developed a large language model-augmented dual machine learning pipeline to identify a multi-stage symptom evolution pattern by temporal proximity, leveraging patient-reported signs extracted from secure messages of high-risk individuals in the past 10 years (2014-2024). We further designed and simulated an artificial intelligence (AI)-enabled symptom screening system for early detection of potential CRC.

Speaker(s):
Jiyeong Kim, PhD
Stanford University

Author(s):
Stephen Ma, MD, PhD - Stanford University School of Medicine; Jonathan Chen, MD, PhD - Stanford University Hospital; Julia Adler-Milstein, PhD, FACMI - UCSF School of Medicine;

Transforming Lung Cancer Screening Pathway Through Data-Driven Patient Journey Analysis and Simulation

Presentation Type: Podium Abstract
Presentation Time: 10:48 AM - 11:00 AM

Abstract Keywords: Population Health, Causal Inference, Data Mining
Programmatic Theme: Clinical Informatics

Despite 14.5 million eligible Americans, lung cancer screening (LCS) uptake remains below 20%. We developed a probabilistic Markov microsimulation grounded in EHR-derived patient journey archetypes to evaluate referral strategies and their downstream outcomes. Clustering revealed three distinct care pathways with heterogeneous screening benefits. Simulations demonstrated that increased referral rates reduce mortality across all clusters, with archetype-specific variation, supporting data-driven, subtype-aware capacity planning in learning health systems.

Speaker(s):
Yiye Zhang, PhD
Weill Cornell Medicine

Author(s):
Mahsa Zahery, PhD - NewYork-Presbyterian Hospital; Neil Kavthekar, MS - NewYork-Presbyterian Hospital; Ziqi Gao, MS - NewYork-Presbyterian Hospital; Vanesa Kovac, MS - NewYork-Presbyterian Hospital; Brandon Christophe, MS - NewYork-Presbyterian Hospital; Ernie Pascal, MS - NewYork-Presbyterian Hospital; Christine Garcia, MD - Weill Cornell Medicine; Hao Dai, PhD - Indiana University School of Medicine; Jiang Bian, PhD - Indiana University/Regenstrief Institute; Yiye Zhang, PhD - Weill Cornell Medicine;

External Validation and Calibration Analysis of a Deep Learning Algorithm for Mammography-based Breast Cancer Risk Prediction

Presentation Type: Podium Abstract
Presentation Time: 11:00 AM - 11:12 AM

Abstract Keywords: Evaluation, Artificial Intelligence, Clinical Decision Support, Deep Learning, Health Equity, Population Health, Machine Learning, Public Health
Working Group: Genomics and Translational Bioinformatics Working Group
Programmatic Theme: Clinical Research Informatics

The Mirai open-source deep learning algorithm has the potential to enable personalized, risk-stratified breast cancer screening. We conducted a large-scale external validation study to assess Mirai’s discrimination and calibration, evaluating whether its risk scores can reliably inform eligibility for risk-reduction interventions and supplemental screening. Our findings suggest that Mirai's predictions are partly driven by cancer detection rather than purely prospective risk prediction, indicating that population-specific recalibration may be needed before clinical adoption.

Speaker(s):
Ojas Ankurbhai Ramwala, PhD
University of Wisconsin–Madison

Author(s):
Daniel Hippe, MS - Fred Hutchinson Cancer Center; Christoph Lee, MD, MS, MBA - University of Wisconsin-Madison; Kathryn Lowry, MD - University of Washington;

Custom CSS

S28: New Frontiers in Cancer Detection (Oral Presentations)

Identifying a Multi-Stage Symptom Evolution Pattern of Colorectal Cancer in High-Risk Young Adults Using Patient-Centered Artificial Intelligence

Category

Description

Custom CSS