Analysis of biological data for differentiation of organisms/cells within and across species or even the same organism is important to a wide variety of applications. This work considers three different biological data sets at the genome, proteome, and epigenome levels: respectively, DNA sequences, glycosalation data, and DNA methylation. We explore some statistical modeling approaches for handling these modern datasets, and provide a relevant set of experiments for explanation and illustration.

First, genomic Fourier coefficients, which capture information about the harmonics of genetic sequences in terms of nucleotide pattern recurrence are investigated as summary metrics for medium sized virus genomes from the SARS-CoV-2 virus. Clustering and classification techniques are applied to these for identification of the geographic location of submission of the original sample. It is shown that the Fourier coefficients are potential features on which geographic location of sequences can be classified with 79\% accuracy. Furthermore, the Fourier coefficients provide distance metrics for efficient clustering.

Second, at the protein expression level, we describe data that measure the composition of protein glycosalation in tuberculous patients and perform studies to use the glycosalation profiles as markers for patients with a particular disease status. Three models are discussed: a classical approach known as partial least squares discriminant analysis (PLS-DA), and two new approaches which are developed for general datasets with compositional data. These models are examined using protein data from capillary electrophoresis (CE) quantification of glycan species in tuberculosis patients. The models show a marginal improvement over the PLS-DA approach, 45\% accuracy over 41\% (five-fold cross validation, with five outcome categories).

Third, at the epigenetic level, we discuss a critique of the use of local likelihood regression smoothing to determine methylation via Bisulfite sequencing. We show a relationship between the sensitivity of these windowed averaging techniques and variations in the coverage of methylated areas via simulation. A procedure for combining read densities with methylation information to resolve multi-mapped reads is described.

Degree Date

Winter 12-18-2021

Document Type


Degree Name



Statistical Science


Monnie McGee

Subject Area




Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License