Much recent research has focused on methods for combining a probability sample with a non-probability sample to improve estimation by making use of information from both sources. If units exist in both samples, it becomes necessary to link the information from the two samples for these units. Record linkage is a technique to link records from two lists that refer to the same unit but lack a unique identifier across both lists. Record linkage assigns a probability to each potential pair of records from the lists so that principled matching decisions can be made. Because record linkage is a probabilistic endeavor it introduces randomness into estimators that use the linked data. The effects of this randomness on regression involving the linked datasets has been examined (for example: Lahiri and Larsen, 2005). However, the effect of matching error has not been considered for the case of estimating the total of a population from a capture-recapture model. In this dissertation we present a general model for matching errors arising from a linkage procedure and examine the effects on bias and variance of some estimators used for such scenarios.
Our work is motivated by the application of estimating fish catch in the Gulf of Mexico. The National Marine Fisheries Service (NMFS) estimates the total number of fish caught by recreational marine anglers. Currently, NMFS arrives at this by estimating from independent surveys the total effort (the number of fishing trips) and the catch per unit effort or CPUE (the number of fish caught per species per trip), and then multiplying them together. Effort data are collected via a mail survey of potential anglers. CPUE data are collected via face-to-face intercepts of anglers completing fishing trips at randomly selected times/docks. The interviewers identify the catch totals of intercepted anglers by species.
The effort survey has a high non-response rate. It is also retrospective, which causes the entire estimation process to take more than a month, precluding in-season management. Due to these limitations, the NMFS is experimenting with replacing the effort survey with electronic self-reporting. The anglers report details of their trip via an electronic device and remain eligible to be sampled in the dockside intercept.
Several estimators have been proposed to estimate total catch using these self-reports alongside the dockside intercept using capture-recapture methodology (Liu et al., 2017). For the estimators to be valid, the records from trips that both self-reported and were sampled in the intercept survey must be linked. The self-reported data is a non-probability sample because it is voluntarily submitted and can be considered as a big data source, while the dockside intercept is a smaller probability sample. Liu et al. assumed perfect matching, however this is difficult in practice due to device and measurement error. Currently, the effect of potential matching errors on the estimators is unknown.
In this research, we develop a novel model to investigate the effect matching errors have on the bias and mean square error of the estimators. We describe and implement a record linkage algorithm for our pilot study data following the work of Bell et al. 1994. Then we discuss two other estimators appropriate for scenarios when either there is no undercoverage or angler reporting is completely accurate (Breidt et al., 2018). Finally, we introduce a simulation study and future research plans.
S. Lynne Stokes
Number of Pages
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License
Williams, Benjamin, "Samples, Unite! Understanding the Effects of Matching Errors on Estimation of Total when Combining Data Sources" (2019). Statistical Science Theses and Dissertations. 5.