Times are displayed in (UTC-07:00) Pacific Time (US & Canada) Change
11/11/2024 |
1:45 PM – 3:15 PM |
Franciscan A
S44: Privacy and Data Lineage - Hide and Seek
Presentation Type: Oral
Session Chair:
Saptarshi Purkayastha, PhD - Indiana University, Luddy School of Informatics, Computing and Engineering
Exploring the use of Artificial Genomes for Genome-wide Association Studies through the lens of Utility and Privacy
Presentation Time: 01:45 PM - 02:00 PM
Abstract Keywords: Data Sharing, Deep Learning, Privacy and Security
Primary Track: Applications
Collaborative Genome-wide association studies (GWAS) have the potential to uncover rare genetic variant-trait associations by leveraging larger datasets and diverse population samples. Despite this potential, privacy concerns and cumbersome review processes for data validation and collaborator selection hinder their broader implementation. Advances in generative models present a possible solution by generating synthetic datasets that closely resemble real genomic data, thus enhancing privacy and expediting the review process. This study assesses the capability of deep generative models to produce artificial genomic data for GWAS applications. We evaluate two state-of-the-art models on real-world datasets, identifying significant limitations in their ability to generate high-quality artificial genomes. Furthermore, we demonstrate that prevailing privacy measures, mainly based on membership inference attacks, are inadequate for providing insightful privacy evaluations. Our findings highlight the critical challenges and suggest future directions for the effective use of artificial genomes in GWAS.
Speaker(s):
Sitao Min, PhD
Rutgers University
Author(s):
Xinyue Wang, PhD - Rutgers Univeristy; Sitao Min, PhD - Rutgers University; Jaideep Vaidya, Ph.D. - Rutgers University;
Presentation Time: 01:45 PM - 02:00 PM
Abstract Keywords: Data Sharing, Deep Learning, Privacy and Security
Primary Track: Applications
Collaborative Genome-wide association studies (GWAS) have the potential to uncover rare genetic variant-trait associations by leveraging larger datasets and diverse population samples. Despite this potential, privacy concerns and cumbersome review processes for data validation and collaborator selection hinder their broader implementation. Advances in generative models present a possible solution by generating synthetic datasets that closely resemble real genomic data, thus enhancing privacy and expediting the review process. This study assesses the capability of deep generative models to produce artificial genomic data for GWAS applications. We evaluate two state-of-the-art models on real-world datasets, identifying significant limitations in their ability to generate high-quality artificial genomes. Furthermore, we demonstrate that prevailing privacy measures, mainly based on membership inference attacks, are inadequate for providing insightful privacy evaluations. Our findings highlight the critical challenges and suggest future directions for the effective use of artificial genomes in GWAS.
Speaker(s):
Sitao Min, PhD
Rutgers University
Author(s):
Xinyue Wang, PhD - Rutgers Univeristy; Sitao Min, PhD - Rutgers University; Jaideep Vaidya, Ph.D. - Rutgers University;
Privacy-Preserving Record Linkage (PPRL): A Transformative Solution for Secure Data Linkage in Public Health
Presentation Time: 02:00 PM - 02:15 PM
Abstract Keywords: Informatics Implementation, Population Health, Privacy and Security, Real-World Evidence Generation, Personal Health Informatics
Primary Track: Applications
Programmatic Theme: Public Health Informatics
This study explores the application of Privacy-Preserving Record Linkage (PPRL) in securely linking individual immunization health records. Implementing PPRL significantly reduces processing time, improves efficiency, and data quality. PPRL technology offers a promising solution for linking deidentified datasets across various domains.
Speaker(s):
Jennifer McGehee, MA, MS
CDC
Author(s):
Janet Fath, PhD - CDC; Jennifer McGehee, MS - CDC; Jina Dcruz, MSW, PhD - US Centers for Disease Control and Prevention (CDC); Danielle Henderson; Agha Nabeel Khan, MD MPH MBA - CDC;
Presentation Time: 02:00 PM - 02:15 PM
Abstract Keywords: Informatics Implementation, Population Health, Privacy and Security, Real-World Evidence Generation, Personal Health Informatics
Primary Track: Applications
Programmatic Theme: Public Health Informatics
This study explores the application of Privacy-Preserving Record Linkage (PPRL) in securely linking individual immunization health records. Implementing PPRL significantly reduces processing time, improves efficiency, and data quality. PPRL technology offers a promising solution for linking deidentified datasets across various domains.
Speaker(s):
Jennifer McGehee, MA, MS
CDC
Author(s):
Janet Fath, PhD - CDC; Jennifer McGehee, MS - CDC; Jina Dcruz, MSW, PhD - US Centers for Disease Control and Prevention (CDC); Danielle Henderson; Agha Nabeel Khan, MD MPH MBA - CDC;
Interdisciplinary Platform for Bruise Image Research
Presentation Time: 02:15 PM - 02:30 PM
Abstract Keywords: Data Sharing, Deep Learning, Racial disparities, Data Transformation/ETL, Imaging Informatics
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Current research activities on bruise analysis and alternative light sources (ALS) are hampered by lack of data, difficulties in collecting bruise images and linked clinical data, and inability to collaborate across institutions. Access to very large datasets is needed especially when deep learning methods are applied to construct models for classifying images, but also when traditional statistical analyses are performed. This presentation describes a research platform that integrates longitudinal bruise images with clinical and measurement data. The web-based platform allows users (researchers, practitioners) to browse and compare images, access structured data, upload their own images and data, annotate images, and link to deep learning and analytic methods. The platform offers multiple levels of security and is designed with HIPAA compliance as a core functionality. The platform is currently populated with about 30,000 images along with structured data. Additional data are collected from partner institutions, including EHR data. Deep learning methods applied for bruise image classification show promising results in detecting bruises.
Speaker(s):
Janusz Wojtusiak, PhD
George Mason University
Author(s):
Janusz Wojtusiak, PhD - George Mason University; Mohammad Qodrati, MD - George Mason University; Michał Markiewicz, PhD - Jagiellonian University; Kiyarash Aminfar, MS - George Mason University; David Lattanzi, PhD - George Mason University; Katherine Scafide, RN, PhD - George Mason University;
Presentation Time: 02:15 PM - 02:30 PM
Abstract Keywords: Data Sharing, Deep Learning, Racial disparities, Data Transformation/ETL, Imaging Informatics
Primary Track: Foundations
Programmatic Theme: Clinical Research Informatics
Current research activities on bruise analysis and alternative light sources (ALS) are hampered by lack of data, difficulties in collecting bruise images and linked clinical data, and inability to collaborate across institutions. Access to very large datasets is needed especially when deep learning methods are applied to construct models for classifying images, but also when traditional statistical analyses are performed. This presentation describes a research platform that integrates longitudinal bruise images with clinical and measurement data. The web-based platform allows users (researchers, practitioners) to browse and compare images, access structured data, upload their own images and data, annotate images, and link to deep learning and analytic methods. The platform offers multiple levels of security and is designed with HIPAA compliance as a core functionality. The platform is currently populated with about 30,000 images along with structured data. Additional data are collected from partner institutions, including EHR data. Deep learning methods applied for bruise image classification show promising results in detecting bruises.
Speaker(s):
Janusz Wojtusiak, PhD
George Mason University
Author(s):
Janusz Wojtusiak, PhD - George Mason University; Mohammad Qodrati, MD - George Mason University; Michał Markiewicz, PhD - Jagiellonian University; Kiyarash Aminfar, MS - George Mason University; David Lattanzi, PhD - George Mason University; Katherine Scafide, RN, PhD - George Mason University;
A Data Governance Metadata Schema for Streamlining Decisions about Data Linkage
Presentation Time: 02:30 PM - 02:45 PM
Abstract Keywords: Data Sharing, Legal, Ethical, Social and Regulatory Issues, Data Standards, Interoperability and Health Information Exchange, Controlled Terminologies, Ontologies, and Vocabularies, Privacy and Security, Pediatrics, Governance of Artificial Intelligence
Primary Track: Policy
Programmatic Theme: Clinical Research Informatics
The NICHD Office of Data Science and Sharing is developing a data governance metadata schema to support responsible use of individual-level record linkage for patient-centered outcomes research with NICHD populations and testing schema implementation with two prototype tools. Our aim is to provide researchers and other stakeholders with frameworks and tools they can use to determine whether certain dataset linkages are appropriate, and if so, what rules and controls apply to the resulting linked dataset.
Speaker(s):
Valerie Cotton, BSc
NIH/NICHD
Author(s):
Presentation Time: 02:30 PM - 02:45 PM
Abstract Keywords: Data Sharing, Legal, Ethical, Social and Regulatory Issues, Data Standards, Interoperability and Health Information Exchange, Controlled Terminologies, Ontologies, and Vocabularies, Privacy and Security, Pediatrics, Governance of Artificial Intelligence
Primary Track: Policy
Programmatic Theme: Clinical Research Informatics
The NICHD Office of Data Science and Sharing is developing a data governance metadata schema to support responsible use of individual-level record linkage for patient-centered outcomes research with NICHD populations and testing schema implementation with two prototype tools. Our aim is to provide researchers and other stakeholders with frameworks and tools they can use to determine whether certain dataset linkages are appropriate, and if so, what rules and controls apply to the resulting linked dataset.
Speaker(s):
Valerie Cotton, BSc
NIH/NICHD
Author(s):
Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world
Presentation Time: 02:45 PM - 03:00 PM
Abstract Keywords: Data Sharing, Large Language Models (LLMs), Privacy and Security
Primary Track: Foundations
Programmatic Theme: Clinical Informatics
Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.
Speaker(s):
Fangyi Chen
Department of Biomedical Informatics, Columbia University
Author(s):
Fangyi Chen - Department of Biomedical Informatics, Columbia University; Kenrick Cato, PhD, RN, CPHIMS, FAAN - University of Pennsylvania/ Children's Hospital of Philadelphia; Gamze Gursoy, PhD - Columbia University; PATRICIA C DYKES, PhD, MA, RN - Brigham and Women's Hospital; Graham Lowenthal, BA - Brigham and Women's Hospital; Sarah Rossetti, RN, PhD - Columbia University Department of Biomedical Informatics;
Presentation Time: 02:45 PM - 03:00 PM
Abstract Keywords: Data Sharing, Large Language Models (LLMs), Privacy and Security
Primary Track: Foundations
Programmatic Theme: Clinical Informatics
Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.
Speaker(s):
Fangyi Chen
Department of Biomedical Informatics, Columbia University
Author(s):
Fangyi Chen - Department of Biomedical Informatics, Columbia University; Kenrick Cato, PhD, RN, CPHIMS, FAAN - University of Pennsylvania/ Children's Hospital of Philadelphia; Gamze Gursoy, PhD - Columbia University; PATRICIA C DYKES, PhD, MA, RN - Brigham and Women's Hospital; Graham Lowenthal, BA - Brigham and Women's Hospital; Sarah Rossetti, RN, PhD - Columbia University Department of Biomedical Informatics;