Probability sampling has served as the gold-standard in survey practice for many decades. However, as many new data collection methods become available, it is possible to improve the quality and efficiency of traditional survey practices by integrating different sample sources. Web-based surveys from the so-called opt-in panels are one type of nonprobability sample that becoming popular these years. They often come with large sample sizes to yield efficient estimates, but selection bias may compromise the generalizability of results to the broader population.
Our motivating example is a survey conducted by the National Marine Fisheries Service (NMFS), which collects data to estimate the catch of recreational anglers. Currently, the samples are from two surveys, a mail survey measuring effort (# of trips made in a given area) and an intercept survey measuring catch per unit effort (# of fish per trip by species). The samples are combined to provide an estimate of the total catch. However, NMFS is experimenting with alternative data collection procedures that use self-reports submitted by anglers via electronic devices, such as cell phones. The self-reports are from a nonprobability sample of anglers and may not be accurate. The objective is to improve the quality and speed of estimation, and/or to reduce cost.
This dissertation consists of two pieces of research that are both related to this problem. The first part of this dissertation is about finding the sampling design for the current estimators to meet the desired precision. Currently, the estimators proposed by Liu et al. v (2017) treat the self-reports as auxiliary data to the sample of intercepts, so they are not used directly in estimation. The estimators’ precision depends on several factors, including reporting rate, the accuracy and representativeness of reported counts, and the size of the dockside sample. We develop the R package OptimalFisheryDesign to compare the estimation precision of the new estimators, investigate the effects of different factors, and find the corresponding optimal designs for various implementations of the pilot survey.
The second part of this dissertation is to investigate whether or not better estimators of the catch can be developed by treating the large sample of voluntary reports as actual data, rather than simply as auxiliary information to improve estimates from the dockside sample. To integrate the non-probability sample and the probability samples, we modify and evaluate two different weighting approaches proposed by Robbins et al. (2015): joint weighting and disjoint weighting. In the joint weighting approach, the samples are only representative when combined as one sample, while in disjoint weighting each sample is weighted to be individually representative of the population, and then averaged.
In addition to PSA, we propose a new method called Adaptive Propensity Score Adjustment (APSA). The method serves as an indicator of whether the propensity score model correctly predicts the selection probability. It can also reduce the selection bias by detecting and dropping part of the non-probability sample whose selection mechanism can not be explained by the model. Both the jackknife and bootstrap methods are proposed and examined for variance estimation.
Department of Statistical Science
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License
Liu, Zhaoce, "Integrating Different Data Sources for Estimation of Total with Unknown Population Size" (2020). Statistical Science Theses and Dissertations. 20.