Subject Area

Biostatistics, Statistics

Abstract

Electronic Health Records (EHR) contain a wealth of structured and unstructured patient data that can be leveraged for computable phenotyping, the process of algorithmically identifying patient cohorts with specific diseases or conditions. Traditional rule-based phenotyping approaches, while interpretable, often struggle with scalability, portability across institutions, and effective use of unstructured clinical narratives. Recent advances in large language models (LLMs) present new opportunities for synthesizing complex free-text information into concise, clinically meaningful representations. However, integrating LLMs into phenotyping workflows requires careful design to maintain transparency, interpretability, and measurable uncertainty—features essential for clinical adoption and downstream applications such as decision support.

We developed an end-to-end, multimodal phenotyping pipeline that integrates structured EHR data with LLM-derived insights from unstructured clinical notes to improve disease classification. Using diabetes phenotyping as a proof-of-concept, the framework begins with a logistic-LASSO model trained on structured EHR features to generate patient-level predicted probabilities. Initially, augmentation targeted cases with intermediate probabilities—where uncertainty was highest and structured data alone was insufficient for accurate classification—by prompting an LLM to classify disease status from retrieved clinical notes. LLM-derived classifications were added to the structured predictor set as a three-level categorical variable indicating whether the patient was (1) not flagged for LLM augmentation, (2) LLM-classified as disease-absent, or (3) LLM-classified as disease-present. Compared with both a traditional rule-based phenotype and the structured-only logistic-LASSO, this probability-thresholding approach improved all measured performance metrics, demonstrating the added value of targeted unstructured data insights. Nonetheless, reliance on manually defined thresholds limited generalizability.

To address these limitations, we advanced to an ensemble-guided LLM-augmentation strategy. Here, a diverse set of base learners trained on structured data flagged cases for augmentation based on disagreement, eliminating subjective thresholds and offering an objective, adaptable selection criterion. This improved identification of patients most likely to benefit from LLM augmentation, and the resulting ensemble-guided, LLM-augmented logistic-LASSO outperformed the threshold-based method.

We evaluated this approach on both diabetes and peripheral artery disease (PAD), two phenotypes with distinct clinical presentations and documentation patterns. Ensemble disagreement proved to be a phenotype-agnostic and effective criterion for targeted augmentation. Compared with full cohort augmentation, this strategy prompted the LLM for only 10\% of patients on average, yet achieved comparable—or occasionally superior—performance, delivering significant gains in cost-efficiency, scalability, and sustainability.

Finally, we incorporated a human-in-the-loop (HIL) mechanism for targeted label correction and identification of high-quality examples for LLM self-improvement. Iterative fine-tuning with expert-reviewed cases consistently improved sensitivity, negative predictive value, and overall accuracy across development, internal validation, and external validation cohorts. Together, these findings demonstrate that targeted, uncertainty-guided LLM integration can deliver high performance while preserving portability across settings.

Key contributions include: (1) a transparent, interpretable, and uncertainty-aware method for integrating LLMs into phenotyping pipelines; (2) an ensemble disagreement metric as a scalable and objective patient selection strategy for augmentation; and (3) a HIL-driven self-improvement process to refine performance. Limitations include the cost of LLM inference, the site-specific nature of self-improvement gains, and the need for adaptation to new clinical domains. Overall, this framework offers a practical, clinician-friendly pathway for enhancing disease detection from EHR data—balancing innovation with interpretability and adaptability.

Degree Date

Fall 2025

Document Type

Dissertation

Degree Name

Ph.D.

Department

Statistics and Data Science

Advisor

Jing Cao

Second Advisor

Mehak Gupta

Third Advisor

Ann Marie Navar

Fourth Advisor

Eric Peterson

Fifth Advisor

Sy Han Chiou

Sixth Advisor

Chul Moon

Number of Pages

115

Format

.pdf

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Share

COinS