Times are displayed in (UTC-04:00) Eastern Time (US & Canada) Change
3/12/2025 |
10:30 AM – 12:00 PM |
Frick
S21: Privacy Preservation in Health Data
Presentation Type: Podium Abstract
Session Credits: 1.5
Session Chair:
Luke Rasmussen, MS, FAMIA - Northwestern University
Safeguarding Privacy in Genome Research: A Comprehensive Framework for Authors
Presentation Time: 10:30 AM - 10:45 AM
Abstract Keywords: Data Security and Privacy, Data Sharing/Interoperability, Data Mining and Knowledge Discovery, Ethical, Legal, and Social Issues, Real-World Evidence and Policy Making
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Implementation Science and Deployment in Informatics: Enabling Clinical and Translational Research
As genomic research continues to advance, sharing of genomic data and research outcomes has become increasingly important for fostering collaboration and accelerating scientific discovery. However, such data sharing must be balanced with the need to protect the privacy of individuals whose genetic information is being utilized. This paper presents a bidirectional framework for evaluating privacy risks associated with data shared (both in terms of summary statistics and research datasets) in genomic research papers, particularly focusing on re-identification risks such as membership inference attacks (MIA). The framework consists of a structured workflow that begins with a questionnaire designed to capture researchers’ (authors') self-reported data sharing practices and privacy protection measures. Responses are used to calculate the risk of re-identification for their study (paper) when compared with the National Institutes of Health (NIH) genomic data sharing policy. Any gaps in compliance help us to identify potential vulnerabilities and encourage the researchers to enhance their privacy measures before submitting their research for publication. The paper also demonstrates the application of this framework, using published genomic research as case study scenarios to emphasize the importance of implementing bidirectional frameworks to support trustworthy open science and genomic data sharing practices.
Speaker(s):
Maryam Ghasemian, PhD
Case Western Reserve University
Author(s):
Maryam Ghasemian, PhD - Case Western Reserve University; Lynette Hammond Gerido, PhD, MPH, MBA - Case Western Reserve University School of Medicine; Erman Ayday;
Presentation Time: 10:30 AM - 10:45 AM
Abstract Keywords: Data Security and Privacy, Data Sharing/Interoperability, Data Mining and Knowledge Discovery, Ethical, Legal, and Social Issues, Real-World Evidence and Policy Making
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Implementation Science and Deployment in Informatics: Enabling Clinical and Translational Research
As genomic research continues to advance, sharing of genomic data and research outcomes has become increasingly important for fostering collaboration and accelerating scientific discovery. However, such data sharing must be balanced with the need to protect the privacy of individuals whose genetic information is being utilized. This paper presents a bidirectional framework for evaluating privacy risks associated with data shared (both in terms of summary statistics and research datasets) in genomic research papers, particularly focusing on re-identification risks such as membership inference attacks (MIA). The framework consists of a structured workflow that begins with a questionnaire designed to capture researchers’ (authors') self-reported data sharing practices and privacy protection measures. Responses are used to calculate the risk of re-identification for their study (paper) when compared with the National Institutes of Health (NIH) genomic data sharing policy. Any gaps in compliance help us to identify potential vulnerabilities and encourage the researchers to enhance their privacy measures before submitting their research for publication. The paper also demonstrates the application of this framework, using published genomic research as case study scenarios to emphasize the importance of implementing bidirectional frameworks to support trustworthy open science and genomic data sharing practices.
Speaker(s):
Maryam Ghasemian, PhD
Case Western Reserve University
Author(s):
Maryam Ghasemian, PhD - Case Western Reserve University; Lynette Hammond Gerido, PhD, MPH, MBA - Case Western Reserve University School of Medicine; Erman Ayday;
Reliable Generation of Privacy-preserving Synthetic EHR Time Series via Diffusion Models
Presentation Time: 10:45 AM - 11:00 AM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Data Security and Privacy, Public Health Informatics
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Proactive Machine Learning in Biomedical Applications: The Power of Generative AI and Reinforcement Learning
This study addresses the challenges of privacy concerns and limited access to Electronic Health Records (EHRs) by proposing a novel method for generating realistic and privacy-preserving synthetic EHR time series. Current EHR de-identification methods pose risks of privacy leakage, while public EHR datasets are insufficient for advancing medical research. To overcome these limitations, we introduce a Denoising Diffusion Probabilistic Model (DDPM) to generate diverse and realistic synthetic EHR time series data. Our method was evaluated on five datasets, including MIMIC-III/IV, eICU, and non-EHR datasets like Stocks and Energy, and compared against benchmark methods. Results show that our approach outperforms all baseline models in data fidelity and requires less training effort. Additionally, our method produces synthetic data with lower privacy risk, as evidenced by reduced discriminative accuracy. This diffusion-based method offers an efficient and reliable solution for generating synthetic EHR time series, facilitating downstream medical data analysis.
Speaker(s):
Anru Zhang, Ph.D.
Duke University
Author(s):
Muhang Tian, BS - Duke University; Bernie Chen, N/A - Duke University; Allan Guo, N/A - Duke University; Shiyi Jiang; Anru Zhang, Ph.D. - Duke University;
Presentation Time: 10:45 AM - 11:00 AM
Abstract Keywords: Machine Learning, Generative AI, and Predictive Modeling, Data Security and Privacy, Public Health Informatics
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Proactive Machine Learning in Biomedical Applications: The Power of Generative AI and Reinforcement Learning
This study addresses the challenges of privacy concerns and limited access to Electronic Health Records (EHRs) by proposing a novel method for generating realistic and privacy-preserving synthetic EHR time series. Current EHR de-identification methods pose risks of privacy leakage, while public EHR datasets are insufficient for advancing medical research. To overcome these limitations, we introduce a Denoising Diffusion Probabilistic Model (DDPM) to generate diverse and realistic synthetic EHR time series data. Our method was evaluated on five datasets, including MIMIC-III/IV, eICU, and non-EHR datasets like Stocks and Energy, and compared against benchmark methods. Results show that our approach outperforms all baseline models in data fidelity and requires less training effort. Additionally, our method produces synthetic data with lower privacy risk, as evidenced by reduced discriminative accuracy. This diffusion-based method offers an efficient and reliable solution for generating synthetic EHR time series, facilitating downstream medical data analysis.
Speaker(s):
Anru Zhang, Ph.D.
Duke University
Author(s):
Muhang Tian, BS - Duke University; Bernie Chen, N/A - Duke University; Allan Guo, N/A - Duke University; Shiyi Jiang; Anru Zhang, Ph.D. - Duke University;
Not Fully Synthetic: LLM-based Hybrid Approaches Towards Privacy-Preserving Clinical Note Sharing
Presentation Time: 11:00 AM - 11:15 AM
Abstract Keywords: Data Security and Privacy, Data Sharing/Interoperability, Data Quality, Clinical and Research Data Collection, Curation, Preservation, or Sharing, Natural Language Processing, Machine Learning, Generative AI, and Predictive Modeling
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Health Data Science and Artificial Intelligence Innovation: From Single-Center to Multi-Site
The publication and sharing of clinical notes are crucial for healthcare research and innovation. However, privacy regulations such as HIPAA and GDPR pose significant challenges. While de-identification techniques aim to remove protected health information, they often fall short of achieving complete privacy protection. Similarly, the current state of synthetic clinical note generation can lack nuance and content coverage. To address these limitations, we propose an approach that combines de-identification, filtration, and synthetic clinical note generation. Variations of this approach currently retain 36%-61% of the original note's content and fill the remaining gaps using an LLM, ensuring high information coverage. We also evaluated the de-identification performance of the hybrid notes, demonstrating that they surpass or at least match the standalone de-identification methods. Our results show that hybrid notes can maintain patient privacy while preserving the richness of clinical data. This approach offers a promising solution for safe and effective data sharing, encouraging further research.
Speaker(s):
Yao-Shun Chuang, MS
UT Health Science Center at Houston
Author(s):
Atiquer Rahman Sarkar, MS - University of Manitoba; Yao-Shun Chuang, MS - UT Health Science Center at Houston; Xiaoqian Jiang, PhD - University of Texas Health Science Center at Houston; Noman Mohammed, PhD - University of Manitoba;
Presentation Time: 11:00 AM - 11:15 AM
Abstract Keywords: Data Security and Privacy, Data Sharing/Interoperability, Data Quality, Clinical and Research Data Collection, Curation, Preservation, or Sharing, Natural Language Processing, Machine Learning, Generative AI, and Predictive Modeling
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Health Data Science and Artificial Intelligence Innovation: From Single-Center to Multi-Site
The publication and sharing of clinical notes are crucial for healthcare research and innovation. However, privacy regulations such as HIPAA and GDPR pose significant challenges. While de-identification techniques aim to remove protected health information, they often fall short of achieving complete privacy protection. Similarly, the current state of synthetic clinical note generation can lack nuance and content coverage. To address these limitations, we propose an approach that combines de-identification, filtration, and synthetic clinical note generation. Variations of this approach currently retain 36%-61% of the original note's content and fill the remaining gaps using an LLM, ensuring high information coverage. We also evaluated the de-identification performance of the hybrid notes, demonstrating that they surpass or at least match the standalone de-identification methods. Our results show that hybrid notes can maintain patient privacy while preserving the richness of clinical data. This approach offers a promising solution for safe and effective data sharing, encouraging further research.
Speaker(s):
Yao-Shun Chuang, MS
UT Health Science Center at Houston
Author(s):
Atiquer Rahman Sarkar, MS - University of Manitoba; Yao-Shun Chuang, MS - UT Health Science Center at Houston; Xiaoqian Jiang, PhD - University of Texas Health Science Center at Houston; Noman Mohammed, PhD - University of Manitoba;
Exploring Privacy Preserving Record Linkage (PPRL): Case Studies from the National Center for Health Statistics (NCHS)
Presentation Time: 11:15 AM - 11:30 AM
Abstract Keywords: Data Security and Privacy, Data Integration, Data Sharing/Interoperability
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Health Data Science and Artificial Intelligence Innovation: From Single-Center to Multi-Site
NCHS is exploring the potential for using PPRL to conducts linkages with new sources of data to expand public health surveillance capacity. Evaluations are being conducted by comparing PPRL results with those obtained from standard linkage methods. Results from these evaluations will be disseminated widely to inform efforts to integrate PPRL techniques in linkage activities.
Speaker(s):
Cordell Golden, MPS
National Center for Health Statistics / Centers for Disease Control and Prevention
Author(s):
Christine Cox, MS - National Center for Health Statistics / Centers for Disease Control and Prevention; Cindy Zhang, MPH - National Center for Health Statistics / Centers for Disease Control and Prevention;
Presentation Time: 11:15 AM - 11:30 AM
Abstract Keywords: Data Security and Privacy, Data Integration, Data Sharing/Interoperability
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Health Data Science and Artificial Intelligence Innovation: From Single-Center to Multi-Site
NCHS is exploring the potential for using PPRL to conducts linkages with new sources of data to expand public health surveillance capacity. Evaluations are being conducted by comparing PPRL results with those obtained from standard linkage methods. Results from these evaluations will be disseminated widely to inform efforts to integrate PPRL techniques in linkage activities.
Speaker(s):
Cordell Golden, MPS
National Center for Health Statistics / Centers for Disease Control and Prevention
Author(s):
Christine Cox, MS - National Center for Health Statistics / Centers for Disease Control and Prevention; Cindy Zhang, MPH - National Center for Health Statistics / Centers for Disease Control and Prevention;
DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization
Presentation Time: 11:30 AM - 11:45 AM
Abstract Keywords: Data Security and Privacy, Natural Language Processing, Ethical, Legal, and Social Issues, Clinical and Research Data Collection, Curation, Preservation, or Sharing, Secondary Use of EHR Data
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) Score. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical notes from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.
Speaker(s):
John Morris, M.Sc.
Cornell University
Author(s):
John Morris, M.Sc. - Cornell University; Thomas Campion, PhD - Weill Cornell Medicine; Sri Laasya Nutheti, M. Sc. - Cornell University; Yifan Peng, PhD - Weill Cornell Medicine; Dept of Population Health Sciences; Div of Health Informatics; Akhil Raj, M. Sc. - Cornell University; Ramin Zabih, PhD - Cornell Tech; Curtis Cole, MD - Cornell University;
Presentation Time: 11:30 AM - 11:45 AM
Abstract Keywords: Data Security and Privacy, Natural Language Processing, Ethical, Legal, and Social Issues, Clinical and Research Data Collection, Curation, Preservation, or Sharing, Secondary Use of EHR Data
Primary Track: Data Science/Artificial Intelligence
Programmatic Theme: Harnessing the Power of Large Language Models in Health Data Science
Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) Score. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical notes from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.
Speaker(s):
John Morris, M.Sc.
Cornell University
Author(s):
John Morris, M.Sc. - Cornell University; Thomas Campion, PhD - Weill Cornell Medicine; Sri Laasya Nutheti, M. Sc. - Cornell University; Yifan Peng, PhD - Weill Cornell Medicine; Dept of Population Health Sciences; Div of Health Informatics; Akhil Raj, M. Sc. - Cornell University; Ramin Zabih, PhD - Cornell Tech; Curtis Cole, MD - Cornell University;
Certified Large-Scale De-Identification of DICOM Medical Imaging and Clinical Notes Using HPC Environment for the Information Commons Research Platform
Presentation Time: 11:45 AM - 12:00 PM
Abstract Keywords: Data Commons, Medical Imaging, Real-World Evidence and Policy Making, Natural Language Processing, Data/System Integration, Standardization and Interoperability, Data Sharing/Interoperability, Implementation Science and Deployment, Data-Driven Research and Discovery
Primary Track: Clinical Research Informatics
Programmatic Theme: Real-World Evidence in Informatics: Bridging the Gap between Research and Practice
The UCSF clinical notes and images provide a wealth of research knowledge but are challenge to utilize due to HIPAA (Health Insurance Portability and Accountability Act) restrictions on identified data. At our institution, we have a committed team of informatics experts, data scientists, and software engineers dedicated to increased data access and security. We consistently develop, expand, and maintain extensive de-identified clinical imaging and unstructured clinical notes data assets for the research community. The Image and Clinical Notes de-identification pipelines are computationally demanding and require execution on an HPC (High Performance Computing) cluster to ensure reasonable throughput. After an extensive series of trials and development cycles, we have created our certified deidentification pipeline for notes and images, implemented on HPC for accurate and quick turnaround. Our work to adapt the de-identification algorithm for an HPC system has significantly reduced the processing time for millions of clinical records from months to weeks. This has enabled us to perform routine data refreshes, supplying our research community with newly deidentified data on a regular basis. Our success has enabled us to colloborate with other major institutions like Johns Hopkins, UC San Diego, University of Michigan, UC Davis to help them establish a similar pipeline at their end.
Speaker(s):
Lakshmi Radhakrishnan, M.S
University of California San Francisco
Author(s):
Marram Beck Olson, BS - University of California San Francisco; Lakshmi Radhakrishnan, M.S - University of California San Francisco;
Presentation Time: 11:45 AM - 12:00 PM
Abstract Keywords: Data Commons, Medical Imaging, Real-World Evidence and Policy Making, Natural Language Processing, Data/System Integration, Standardization and Interoperability, Data Sharing/Interoperability, Implementation Science and Deployment, Data-Driven Research and Discovery
Primary Track: Clinical Research Informatics
Programmatic Theme: Real-World Evidence in Informatics: Bridging the Gap between Research and Practice
The UCSF clinical notes and images provide a wealth of research knowledge but are challenge to utilize due to HIPAA (Health Insurance Portability and Accountability Act) restrictions on identified data. At our institution, we have a committed team of informatics experts, data scientists, and software engineers dedicated to increased data access and security. We consistently develop, expand, and maintain extensive de-identified clinical imaging and unstructured clinical notes data assets for the research community. The Image and Clinical Notes de-identification pipelines are computationally demanding and require execution on an HPC (High Performance Computing) cluster to ensure reasonable throughput. After an extensive series of trials and development cycles, we have created our certified deidentification pipeline for notes and images, implemented on HPC for accurate and quick turnaround. Our work to adapt the de-identification algorithm for an HPC system has significantly reduced the processing time for millions of clinical records from months to weeks. This has enabled us to perform routine data refreshes, supplying our research community with newly deidentified data on a regular basis. Our success has enabled us to colloborate with other major institutions like Johns Hopkins, UC San Diego, University of Michigan, UC Davis to help them establish a similar pipeline at their end.
Speaker(s):
Lakshmi Radhakrishnan, M.S
University of California San Francisco
Author(s):
Marram Beck Olson, BS - University of California San Francisco; Lakshmi Radhakrishnan, M.S - University of California San Francisco;