Comparative Ranking of Marginal Confounding Impact of Natural Language Processing-Derived Versus Structured Features in Pharmacoepidemiology
Presentation Time: 09:30 AM - 09:45 AM
Abstract Keywords: Causal Inference, Natural Language Processing, Real-World Evidence Generation
Primary Track: Foundations
Objective: To explore the ability of natural language processing (NLP) methods to identify confounder information beyond what can be identified using claims codes alone for pharmacoepidemiology. Methods: We developed a retrospective cohort for high vs low dose proton pump inhibitors from linked Medicare claims (2008-2017) and clinical data for patients with a history of peptic ulcer disease. Clinical notes authored one year prior to cohort entry were processed via three NLP tools: bag-of-n-grams, MTERMS, and clustered BERT sentence embeddings. Candidate features were ranked using Bross formula. Results: The top 100 consisted of structured (75%; 19 prespecified) versus NLP-derived (25% with all tools accounted for) features. Conclusions: Bross formula is a simple way to rank the marginal confounding impact of binary features on estimated causal effects. NLP (especially n-grams) contributed to finding large numbers of features that can supplement claims data and prespecified variables to help in providing additional confounder information.
Speaker(s):
Joseph Plasek, PhD
Mass General Brigham
Presentation Time: 09:30 AM - 09:45 AM
Abstract Keywords: Causal Inference, Natural Language Processing, Real-World Evidence Generation
Primary Track: Foundations
Objective: To explore the ability of natural language processing (NLP) methods to identify confounder information beyond what can be identified using claims codes alone for pharmacoepidemiology. Methods: We developed a retrospective cohort for high vs low dose proton pump inhibitors from linked Medicare claims (2008-2017) and clinical data for patients with a history of peptic ulcer disease. Clinical notes authored one year prior to cohort entry were processed via three NLP tools: bag-of-n-grams, MTERMS, and clustered BERT sentence embeddings. Candidate features were ranked using Bross formula. Results: The top 100 consisted of structured (75%; 19 prespecified) versus NLP-derived (25% with all tools accounted for) features. Conclusions: Bross formula is a simple way to rank the marginal confounding impact of binary features on estimated causal effects. NLP (especially n-grams) contributed to finding large numbers of features that can supplement claims data and prespecified variables to help in providing additional confounder information.
Speaker(s):
Joseph Plasek, PhD
Mass General Brigham
Comparative Ranking of Marginal Confounding Impact of Natural Language Processing-Derived Versus Structured Features in Pharmacoepidemiology
Category
Podium Abstract