Abstract

Grammatical triples extraction has become increasingly important for the analysis of large, textual corpora. By providing insight into the sentence-level linguistic features of a corpus, extracted triples have supported interpretations of some of the most relevant problems of our time. The growing importance of triples extraction for analyzing large corpora has put the quality of extracted triples under new scrutiny, however. Triples outputs are known to have large amounts of erroneous triples. The extraction of erroneous triples poses a risk for understanding a textual corpus because erroneous triples can be nonfactual and even analogous to misinformation. Disciplines such as the social sciences, history, and literature rely on accurate representations of events. In some cases, misrepresentations of language can be as problematic as describing a historical event that never occurred. The present research proposes a method of triples extraction that has been designed to meet the increasing need for high-accuracy triples outputs for the analysis of text. We propose a solution aimed at reducing errors related to: a) ungrammatical extractions; b) double counting; and c) the missed detection of triples. To improve the accuracy of triples extraction, we implement a series of 12 linguistic rules that leverage syntactic dependency parsing. For its case studies, this dissertation draws upon three data sets: a) Wikipedia; b) the 19th-century British Parliamentary debates, also known as Hansard; and c) half a year of online news articles (Aug. 2021 - Dec. 2021) from FOX News and NPR. In its final chapter, this dissertation offers a pedagogical piece that applies triples extraction to teach concepts related to data analysis. Extracted triples are thus evaluated through two means: a) in Chapter 1, precision and recall is used to vet the accuracy of the present method and b) in chapters 2 and 3, we use human observation to show how the present method of triples extraction can give an accurate and insightful perspective into textual corpora that rivals and, in some cases, exceeds existing methods.

Degree Date

Spring 5-13-2023

Document Type

Dissertation

Degree Name

Ph.D.

Department

Applied Science

Advisor

Jo Guldi

Second Advisor

Corey Clark

Third Advisor

Mark Fontenot

Subject Area

Computer Science, Linguistics, Humanities

Number of Pages

253

Format

.pdf

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Share

COinS