Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world
Presentation Time: 02:45 PM - 03:00 PM
Abstract Keywords: Data Sharing, Large Language Models (LLMs), Privacy and Security
Primary Track: Foundations
Programmatic Theme: Clinical Informatics
Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.
Speaker(s):
Fangyi Chen
Department of Biomedical Informatics, Columbia University
Author(s):
Fangyi Chen - Department of Biomedical Informatics, Columbia University; Kenrick Cato, PhD, RN, CPHIMS, FAAN - University of Pennsylvania/ Children's Hospital of Philadelphia; Gamze Gursoy, PhD - Columbia University; PATRICIA C DYKES, PhD, MA, RN - Brigham and Women's Hospital; Graham Lowenthal, BA - Brigham and Women's Hospital; Sarah Rossetti, RN, PhD - Columbia University Department of Biomedical Informatics;
Presentation Time: 02:45 PM - 03:00 PM
Abstract Keywords: Data Sharing, Large Language Models (LLMs), Privacy and Security
Primary Track: Foundations
Programmatic Theme: Clinical Informatics
Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.
Speaker(s):
Fangyi Chen
Department of Biomedical Informatics, Columbia University
Author(s):
Fangyi Chen - Department of Biomedical Informatics, Columbia University; Kenrick Cato, PhD, RN, CPHIMS, FAAN - University of Pennsylvania/ Children's Hospital of Philadelphia; Gamze Gursoy, PhD - Columbia University; PATRICIA C DYKES, PhD, MA, RN - Brigham and Women's Hospital; Graham Lowenthal, BA - Brigham and Women's Hospital; Sarah Rossetti, RN, PhD - Columbia University Department of Biomedical Informatics;
Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world
Category
Paper - Student