Sen YangFollow


The human microbiome, comprising trillions of microorganisms, plays a pivotal role in modulating host physiology via molecular and metabolite exchanges. One of the major challenges in this field lies in the effective integration of microbiome and metabolomics data, an achievement that holds the promise of substantially enhancing the precision of disease prediction. However, many datasets prioritize microbiome data while neglecting paired metabolome information. Additionally, the prevalent analytical tools face challenges in effectively merging these intricate datasets, leading to possible misinterpretations and reduced prediction accuracies.

To address these challenges, the first part of this research introduces the Microbiome-based Supervised Contrastive Learning Framework (MB-SupCon). This innovative framework integrates both microbiome and metabolome data to produce microbiome embeddings, which can be used to enhance the accuracy of predicting disease traits in datasets solely centered on microbiome data. For validation, MB-SupCon was employed on 720 samples with paired 16S microbiome data and metabolomics data from type 2 diabetes patients. MB-SupCon outperformed existing predictive methods and achieved notable average prediction accuracies for insulin resistance status (84.62%), sex (78.98%), and race (80.04%). Importantly, the generated microbiome embeddings clustered distinctly for different covariate groups in a lower-dimensional space, which enriched data visualization. When implemented in an extensive inflammatory bowel disease study, MB-SupCon exhibited analogous advantages. Consequently, MB-SupCon has shown vast potential to enhance microbiome-based predictive models in multi-omics disease investigations.

However, despite the remarkable success of MB-SupCon in integrating omics data, its applicability was limited by an inherent constraint in processing only categorical covariates. To overcome this limitation and broaden the applications of supervised contrastive learning on continuous covariates, a new framework named MB-SupCon-cont was introduced. This framework comprises two main elements: a supervised contrastive learning model and a prediction head. With the introduction of a generalized contrastive loss based on similar and dissimilar data pairs, MB-SupCon-cont merges both self-supervised and supervised contrastive learning, making it applicable to both categorical and continuous covariates. Through tests on both simulated and real datasets, the model demonstrated its superiority in predicting diverse continuous covariates. Also, the MB-SupCon-cont framework proves to be robust and flexible in terms of prediction head selection. Moreover, the embedding learned in the representation domain exhibits a distinctive varying trend associated with continuous covariates. By combining these features, MB-SupCon-cont offers a more adaptable and efficient method for supervised multi-omics integration, making it a significant advancement in the field.

Transitioning from multi-omics data integration, the last segment of this study investigates the link between the intratumor microbiome and cancer. While there is emerging evidence pointing to a profound link between cancer microbiome and tumorigenesis, current studies focus narrowly on selected bacterial species or individual cancer types. To fill this knowledge gap, a three-stage computational framework named MB-LRP was developed. It harnesses the power of explainable deep learning to identify bacterial biomarkers associated with various features of cancer patients via layer-wise relevance propagation (LRP). For validation, multiple MB-LRP-identified microbial biomarkers were examined based on experimental evidence from colon and stomach cancer patients, confirming the presence of previously undetected microbial biomarkers. Furthermore, the clinical relevance of these biomarkers has been demonstrated through association studies with patients’ survival outcomes. Overall, MB-LRP offers a comprehensive approach for detecting microbial biomarkers for diverse cancer attributes. For the broader cancer research community, all microbial biomarkers identified by MB-LRP, spanning immune, clinical, and genomic characteristics for multiple cancer types, are accessible on the MB-LRP data portal.

Degree Date

Fall 12-16-2023

Document Type


Degree Name



Department of Statistics and Data Science


Dr. Xiaowei Zhan

Subject Area

Biostatistics, Statistics

Number of Pages




Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Available for download on Saturday, November 16, 2024