Abstract

With the rapid development of new data collection and acquisition techniques, high-dimensional data have emerged from various fields. Consequentially, new variable selection methods especially in ultra-high dimensional problems are demanding.

The first part of this dissertation focuses on developing a new Bayesian variable selection method for a differential expression analysis using raw NanoString nCounter data. The medium-throughput mRNA abundance platform NanoString nCounter has gained great popularity in the past decade, due to its high sensitivity and technical reproducibility as well as remarkable applicability to ubiquitous formalin fixed paraffin embedded (FFPE) tissue samples. Based on RCRnorm developed for normalizing NanoString nCounter data and Bayesian LASSO for variable selection, we propose a fully integrated Bayesian method, called RCRdiff, to detect differentially expressed (DE) genes between different groups of tissue samples (e.g. normal and cancer). Unlike existing methods that often require normalization performed beforehand, RCRdiff directly handles raw read counts and jointly models the behaviors of different types of internal controls along with DE and non-DE gene patterns. Doing so would avoid efficiency loss caused by ignoring estimation uncertainty from the normalization step in a sequential approach and thus can offer more reliable statistical inference. We also propose clustering-based strategies for DE gene selection, which do not require any external dataset and are free of any arbitrary cutoff. Empirical evidence of the attractiveness of RCRdiff is demonstrated via extensive simulation and data examples.

The second part of this dissertation proposes a novel Bayesian variable selection method based on empirical likelihood for ultra-high dimensional data. Although a great amount of literature has shown that development of variable selection techniques can enable efficient and interpretable analysis of high dimensional data, variable selection involving ultra-high dimensional data, where the number of covariates p is (much) large than the sample size n, remains a highly challenging task. Furthermore, many popular methods based on linear regression models assume Gaussian random noise. In the semi-parametric domain, under the ultra-high dimensional setting, we propose a Bayesian empirical likelihood method for variable selection, which requires no distributional assumptions but only estimating equations. Motivated by doubly penalized empirical likelihood (EL), we introduce priors to regularize both regression parameters and Lagrange multipliers associated with the estimating equations, to promote sparse learning. We further develop an efficient Markov chain Monte Carlo (MCMC) sampling algorithm based on the active set idea, which has been proved to be useful in reducing computational burden in several existing studies. The proposed method not only inherits merits from both Bayesian and EL inferences, but also has superior performance in both the prediction and variable selection, as shown in our numerical studies.

Degree Date

Fall 2021

Document Type

Dissertation

Degree Name

Ph.D.

Department

Statistical Science

Advisor

Xinlei Wang

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Share

COinS