Bayesian Multiple Instance Learning with Application to Cancer Detection Using TCR Repertoire Sequencing Data
As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. Each instance is described by a feature vector. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction.
The first part of this dissertation focuses on a comparative study of MIL methods for a novel biomedical application. To date, the majority of the off-the-shelf MIL methods are developed in the computer science domain and so algorithm-driven. We review and apply a large collection of existing methods to investigate the applicability of MIL to cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject's peripheral blood. Based on our numerical results from extensive simulation and analysis of sequencing data from The Cancer Genome Atlas for ten types of cancer, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further identify a pressing need of new model-based MIL methodologies for accurate modeling of increasingly complex structures of real world data and more explainable outcomes.
The second part of this dissertation proposes a novel Bayesian MIL method for binary classification based on hierarchical probit regression (MICProB), which contributes a significant portion to the suite of statistical methodologies for MIL. MICProB is composed of two nested probit regression models, where the inner model is estimated for predicting primary instances, which are considered as the ``important'' ones that determine the bag label, and the outer model is for predicting bag labels based on the features of primary instances estimated by the inner model. The posterior distribution of MICProB can be conveniently approximated using a Gibbs sampler, and the prediction for new bags can be performed in a fully integrated Bayesian way. We evaluate the performance of MICProB against various benchmark methods and demonstrate its competitiveness in simulation and real data examples. In addition to its capability of identifying primary instances, as compared to existing optimization-based approaches, MICProB also enjoys great advantages in providing a transparent model structure, straightforward statistical inference of quantities related to model parameters, and favorable interpretability of covariate effects on the bag-level response.
Department of Statistical Science
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License
XIONG, DANYI, "Bayesian Multiple Instance Learning with Application to Cancer Detection Using TCR Repertoire Sequencing Data" (2021). Statistical Science Theses and Dissertations. 28.