Subject Area

Biostatistics, Statistics


This dissertation investigates: (1) A Bayesian Semi-supervised Approach to Keyphrase Extraction with Only Positive and Unlabeled Data, (2) Jackknife Empirical Likelihood Confidence Intervals for Assessing Heterogeneity in Meta-analysis of Rare Binary Events.

In the big data era, people are blessed with a huge amount of information. However, the availability of information may also pose great challenges. One big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyphrase extraction methods summarize an article by identifying a list of keyphrases. Many existing keyphrase extraction methods focus on the unsupervised setting, with all keyphrases assumed unknown. In reality, a (small) subset of the keyphrases may be available for an article. To utilize such information, we propose a probability model based on a semi-supervised setup. Our method incorporates the graph-based information of an article into a Bayesian framework so that our model facilitates statistical inference, which is often absent in the existing methods. To overcome the difficulty arising from high-dimensional posterior sampling, we develop two Markov chain Monte Carlo algorithms based on Gibbs samplers, and compare their performance using benchmark data. We further propose a false discovery rate (FDR) based approach for selecting the number of keyphrases, while the existing methods use ad-hoc threshold values. Our numerical results show that the proposed method compared favorably with state-of-the-art methods for keyphrase extraction.

In meta-analysis, the extent to which effect sizes vary across component studies is called heterogeneity. Typically, it is reflected by a variance parameter in a widely used random-effects (Re) model. In the literature, methods for constructing confidence intervals (CIs) for the parameter often assume that study-level effect sizes be normally distributed. However, this assumption may be violated in practice, especially in meta-analysis of rare binary events. We propose to use jackknife empirical likelihood (JEL), a nonparametric approach that uses jackknife pseudo-values, to construct CIs for the heterogeneity parameter, which lifts the requirement of normality in the Re model. To compute jackknife pseudo-values, we employ a moment-based estimator and consider two commonly used weighing schemes (i.e., equal and inverse variance weights). We prove that with each scheme, the resulting log empirical likelihood ratio follows a chi-square distribution asymptotically. We further examine the performance of the proposed JEL methods and compare them with existing CIs through simulation studies and data examples that focus on data of rare binary events. Our numerical results suggest that the JEL method with equal weights compares favorably with other alternatives, especially when (observed) effect sizes are non-normal and the number of component studies is large. Thus, it is worth serious consideration in statistical inference.

Degree Date

Fall 12-19-2020

Document Type


Degree Name



Statistical Science


Xinlei Wang

Second Advisor

Yichen Cheng

Number of Pages




Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License