Statistical Modeling of High-throughput Sequencing Data and Spatially Resolved Transcriptomic Data
Recent studies have shown that RNA sequencing (RNA-seq) can be used to measure mRNA of sufficient quality extracted from Formalin-Fixed Paraffin-Embedded (FFPE) tissues to provide whole-genome transcriptome analysis. However, little attention has been given to the normalization of FFPE RNA-seq data. In Chapters 1 and 2, we propose a new normalization method, labeled MIXnorm, and its simplified version SMIXnorm, for FFPE RNA-seq data. MIXnorm relies on a two-component mixture model, which models non-expressed genes by zero-inflated Poisson distributions and models expressed genes by truncated normal distributions. To obtain maximum likelihood estimates, we develop a nested EM algorithm, in which closed-form updates are available in each iteration. We evaluate MIXnorm and SMIXnorm through simulations and cancer studies.
Recently, spatial molecular profiling technologies have enabled a comprehensive catalog of molecular profiling data together with tissue imaging data with spatial locations. In the context of spatial profiling, the research interest lies in investigating the association between gene expression levels and their spatial locations, i.e., identifying spatially expressed (SE) genes. However, gene expression data from spatial molecular profiling are subject to severe zero-inflation issues. In Chapter 3, we propose a Bayesian Spatial HEAPing model (SHEAP), which aims to accurately recover major spatial patterns underlying the gene expression levels that are partially observed and subject to heaping at zero. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for Bayesian inference. We evaluate the proposed method through simulation studies and real data applications.
Department of Statistical Science
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License
Yin, Shen, "Statistical Modeling of High-throughput Sequencing Data and Spatially Resolved Transcriptomic Data" (2020). Statistical Science Theses and Dissertations. 17.