[Skip to Content]
Join AMIA
Menu
  • Home
  • Program Schedule
    • Full Schedule
    • Clinical Informatics Schedule
    • Translational Research Informatics Schedule
  • Speaker Search
  • My Account
  • Home
  • 2026 Amplify Informatics Conference Gallery
  • Logit Fingerprinting: A Novel, Accuracy-Independent Method for Validating Large Language Model Stability in High-Stakes Clinical Applications

Custom CSS

double-click to edit, do not edit in source


5/19/2026 | 11:15 AM – 12:30 PM | Mt. Elbert B - 555 Building, 2nd Floor

TRI15: LLMs in the Clinic: From Hype to Help (Oral Presentations)


Presentation Type: Oral Presentations

Advances in Large Language Model Reasoning Enable Flexibility in Clinical Problem-Solving

Presentation Type: Paper - Regular

Click to View Presentation

Presentation Time: 11:15 AM - 11:27 AM

Primary Track: Data Science/Artificial Intelligence


Large Language Models (LLMs) have achieved high accuracy on medical question-answer (QA) benchmarks, yet their capacity for flexible clinical reasoning has been debated. In this study, we asked whether advances in reasoning LLMs improve their cognitive flexibility in clinical reasoning. We assessed reasoning models from the OpenAI, Grok, Gemini, Claude, and DeepSeek families on the medicine abstraction and reasoning corpus (mARC), an adversarial medical QA benchmark which utilizes the Einstellung effect to induce inflexible over-reliance on learned heuristic patterns in contexts where they become suboptimal. We found that strong reasoning models avoided Einstellung-based traps more often than weaker reasoning models, achieving human-level performance on mARC. On questions most commonly missed by physicians, the top 5 performing models answered 55% to 70% correctly with high confidence, indicating that these models may be less susceptible than humans to Einstellung effects. Our results indicate that strong reasoning models demonstrate improved flexibility in medical reasoning, achieving performance on par with humans on mARC.

Speaker(s):
Kie Shidara, Master's of Science
University of California, San Francisco

Author(s):
Kie Shidara, Master's of Science - University of California, San Francisco; Preethi Prem, Bachelor's of Science - Carle Illinois College of Medicine; Jonathan Kim, M.D. - Stanford University; Anna Podlasek, M.D., PhD - University of Dundee; Feng Liu, PhD - Stevens Institute of Technology; Ahmed Alaa, PhD - University of California, Berkeley; Danilo Bernardo, M.D. - University of California, San Francisco;

Kie Shidara, Master's of Science - University of California, San Francisco


Large Language Models and Primary Care: A Scoping Review

Presentation Type: Paper - Student

Click to View Presentation

2026 Amplify 25x5 Presentation

Presentation Time: 11:27 AM - 11:39 AM

Primary Track: Data Science/Artificial Intelligence


Background: Large Language Models (LLMs) like ChatGPT offer transformative potential for primary care but carry risks regarding bias and reliability. This scoping review synthesizes current evidence on LLM applications in primary care.

Methods: Adhering to PRISMA-ScR guidelines, we searched 10 databases (including Medline, EMBASE, and IEEE Xplore) for original research on LLMs in primary care settings. We extracted data on study characteristics, performance metrics, and assessed risk of bias using PROBAST.

Results: Of 25,304 records, 28 studies met inclusion criteria. Most originated from high-income countries (US: 30%, UK: 20%), with significant underrepresentation of low-and-middle-income countries. Common applications included information extraction (25%) and predictive modeling (14%). While performance was generally high (F1-scores 0.70–0.95), particularly for structured tasks, interpretive tasks showed greater variability. Crucially, only 18% of studies had a low overall risk of bias, with 39% exhibiting high risk, often due to poor handling of participants or analysis domains.

Conclusion: LLMs demonstrate utility in administrative and clinical primary care tasks. However, the current landscape is geographically skewed and methodologically limited. Future implementation requires rigorous, standardized evaluation frameworks and a focus on equity to mitigate algorithmic bias before widespread adoption.

Speaker(s):
Julio Zhang, MD
Columbia University

Author(s):
Julio Min Fei Zhang, MD - Columbia University; Mariana Leite, MD - Faculdade Santa Marcelina; Carolina Baptista dos Santos, MD - Federal University of Fronteira Sul; Felipe Dirceu Dantas Leite Pessôa, MD - University of São Paulo School of Medicine; Wen-Jan Tuan, DHA, MPH, MS - Penn State University;

Julio Zhang, MD - Columbia University


Optimizing an LLM-Based Clinical Data Querying System Using Metadata Enrichment and Task Decomposition

Presentation Type: Paper - Regular

Click to View Presentation

Presentation Time: 11:39 AM - 11:51 AM

Primary Track: Data Science/Artificial Intelligence


Accessing complex clinical registries traditionally requires SQL programming expertise, limiting data accessibility for non-technical researchers. In this paper, we designed and evaluated whether a text-to-SQL solution based on large language models (LLMs) could enable natural language querying of a real-world clinical registry under strict privacy and security constraints. Using self-hosted, open-source LLMs, we developed a multi-layered optimization framework incorporating metadata enrichment, query decomposition, hybrid retrieval, and SQL self-correction. We assessed its performance across 600 queries spanning one-, two-, and three-field complexity using execution-based validation. Accuracy was improved from 88.0% to 94.5% for one-field queries and from 10.0% to 82.0% for three-field queries. Real-world testing by data scientists revealed domain-specific challenges related to coded variables, clinical ambiguity, and multi-step reasoning. We summarize key technical and operational lessons learned and discuss implications for safe, scalable deployment of LLM-assisted analytic tools in clinical registry environments.

Speaker(s):
Weixin Liu, PhD Student
Vanderbilt University

Author(s):
Weixin Liu, PhD Student - Vanderbilt University; Bowen Qu, PhD - Vanderbilt University; Pratheek Mallya, MS - The American Heart Association; Jingyuan Wu, MS - The American Heart Association; Kathie Thomas, DHA, MPH - The American Heart Association; Jennifer Hall, PhD - The American Heart Association; Juan Zhao, PhD - The American Heart Association; Zhijun Yin, Ph.D. - Vanderbilt University Medical Center;

Weixin Liu, PhD Student - Vanderbilt University


Optimizing Open‑Source LLMs for Reporting: A Practical, On‑Prem Alternative to a Validated Closed-API Baseline

Presentation Type: Podium Abstract

Click to View Presentation

Presentation Time: 11:51 AM - 12:03 PM

Primary Track: Clinical Research Informatics


ClinicalTrials.gov reporting is labor-intensive and time-sensitive. We are exploring open-source large language models (LLMs) deployed on-premises as a viable, auditable alternative to a GPT-4 retrieval-augmented generation (RAG) baseline. Our approach emphasizes reproducibility, observability, and biomedical concept validation. Preliminary results on IRB protocols show high concept extraction precision and recall with minimal hallucinations, suggesting that an open LLM pipeline can approach GPT-4 performance while maintaining data privacy and reducing costs. This ongoing work will inform best practices for automating trial registry entries.

Speaker(s):
Ramya Sri Baluguri, Postdoctoral Scholar
University of California, Davis

Author(s):
Ramya Sri Baluguri, Postdoctoral Scholar - University of California, Davis; Nicholas Anderson, PhD - University of California, Davis;

Ramya Sri Baluguri, Postdoctoral Scholar - University of California, Davis


Logit Fingerprinting: A Novel, Accuracy-Independent Method for Validating Large Language Model Stability in High-Stakes Clinical Applications

Presentation Type: Paper - Regular

Click to View Presentation

Presentation Time: 12:03 PM - 12:15 PM

Primary Track: Data Science/Artificial Intelligence


The integration of Large Language Models (LLMs) into clinical settings requires quality assurance mechanisms capable of detecting the hidden effects of model compression and architectural instability. Conventional accuracy metrics often fail to capture the behavioral volatility introduced byF quantization, distillation, and sparse architectures. We propose the ``Single-Token Forced-Choice Logit Probe,'' a method that generates a ``behavioral fingerprint'' of a model by analyzing its decision-making stability on a domain-specific (MedQA) benchmark. Validated on 11 local model families, our approach achieved 100\% accuracy in distinguishing full-precision models from quantized variants. Furthermore, a longitudinal audit of commercial APIs revealed a distinct ``Stability Gap'': distilled ``Nano'' models exhibited nearly double the decision instability (2.82\% vs. 1.58\% Flip Rate) of their standard counterparts. Forensic classification identified the underlying compression techniques (Q8 vs. FP8), while analysis suggests the inherent non-determinism stems from Sparse Mixture-of-Experts (SMoE) routing. We conclude that Flip Rate is a critical safety metric and that distilled and quantized models require rigorous stability auditing before clinical deployment.

Speaker(s):
William Logan, B.S. in Computer Engineering
UKY

Author(s):
William Logan, B.S. in Computer Engineering - UKY; Cody Bumgardner, PhD - University of Kentucky;

William Logan, B.S. in Computer Engineering - UKY


Prompt injection of OpenAI custom GPTs leaks informatics secrets

Presentation Type: Paper - Student

Click to View Presentation

Presentation Time: 12:15 PM - 12:27 PM

Primary Track: Data Science/Artificial Intelligence


Low-code platforms like OpenAI Custom GPTs (cGPTs) promise easy development of specialized AI assistants for complex bioinformatics and clinical tasks, allowing researchers to integrate proprietary data into intuitive chatbot interfaces. However, these commercial frameworks operate as opaque "black boxes," fundamentally clashing with open-science values and principles of reproducibility. To audit their hidden configurations, we performed "jailbreak" (prompt injection) attacks. We found that all tested cGPTs were critically vulnerable, leading to the leakage of private system instructions, full knowledge base files, and proprietary API details. This systemic failure poses severe security and privacy risks, particularly when handling sensitive patient data, clinical notes, or proprietary basic science assets. While low-code tools lower the barrier to AI adoption, their commercial nature and security flaws warrant extreme caution, forcing biomedical researchers to weigh convenience against the non-negotiable standards of data integrity and security.

Speaker(s):
Van Truong, PhD
University of Pennsylvania

Author(s):
Van Truong, PhD - University of Pennsylvania; Marylyn Ritchie, PhD - Medical University of South Carolina;

Van Truong, PhD - University of Pennsylvania



Logit Fingerprinting: A Novel, Accuracy-Independent Method for Validating Large Language Model Stability in High-Stakes Clinical Applications

Category

Informatics Summit > Paper - Regular

Description

Custom CSS

double-click to edit, do not edit in source

Date: Tuesday (05/19)
Time: 11:15 AM to 12:30 PM
Room: Mt. Elbert B - 555 Building, 2nd Floor

Back to Program Schedule


Searching for individual speakers is available here:


Speaker Search


Amia logo

Headquarters:
6218 Georgia Avenue NW, Suite #1
PMB 3077
Washington, DC 20011
Phone: 301.657.1291

© 2026 American Medical Informatics Association. All Rights Reserved.