In this paper we present a system of long text classification for environmental regulations that emphasizes background Subject Matter Expert (SME) knowledge for token selection to increase its robustness and accuracy. Here we characterize anything as long text if it exceeds more than twice the normal token limit for neural networks (512). Over one thousand new environmental regulations are released each year and our system helps SMEs prioritize post-classified regulations and highlights their important aspects via Natural Language Processing (NLP). We have utilized environmental regulation SME expertise through a busi- ness specific keyword dictionary to target tokens to input into NLP neural networks. We sought to optimize this method as it doesn’t rely on fallacies in common approaches like inputting the beginning tokens of a document where formatting differences can bias a model, or segmenting the entire document that muddles a model’s ability to find important features as well as increase computational time drastically. Many NLP neu- ral networks were tested along with different token pre-processing to find the optimal combination for this unique environmental regulation corpus. We found our best results were with the BERT neural network with it’s extracted tokens lemmatized but keeping stop words with a recall of 93.33% and accuracy of 81.67%. We conclude that this mostly transparent method can yield highly accurate classifications and can easily be translated to any field of expertise, given enough background knowledge to build a proper keyword dictionary.
Bass, Clovis R.; Benefield, Brett; Horn, Debbie; and Morones, Rebecca
"Increasing Robustness in Long Text Classifications Using Background Corpus Knowledge for Token Selection.,"
SMU Data Science Review: Vol. 2:
3, Article 10.
Available at: https://scholar.smu.edu/datasciencereview/vol2/iss3/10
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License