SMU Data Science Review

Reading PDFs Using Adversarially Trained Convolutional Neural Network Based Optical Character Recognition

Michael B. Brewer, Southern Methodist UniversityFollow
Michael Catalano, Southern Methodist UniversityFollow
Yat Leung, Southern Methodist UniversityFollow
David Stroud, Troy UniversityFollow

Abstract

A common problem that has plagued companies for years is digitizing documents and making use of the data contained within. Optical Character Recognition (OCR) technology has flooded the market, but companies still face challenges productionizing these solutions at scale. Although these technologies can identify and recognize the text on the page, they fail to classify the data to the appropriate datatype in an automated system that uses OCR technology as its data mining process. The research contained in this paper presents a novel framework for the identification of datapoints on check stub images by utilizing generative adversarial networks (GANs) to create stains that are superimposed onto images which are used to train a convolutional neural network (CNN). For this project, the MNIST dataset is used as a proxy for validating the effectiveness of our approach. A baseline CNN is used to recognize text from unperturbed images, and the results are validated with 97.38% accuracy. Once the perturbations are introduced to the baseline CNN, the accuracy dips to 94.7%. The results from the adversarial-trained data are favorable, with an accuracy of 97.3%, roughly a three-percentage increase in the ability to properly identify the character in an environment with perturbed images.

Recommended Citation

Brewer, Michael B.; Catalano, Michael; Leung, Yat; and Stroud, David (2020) "Reading PDFs Using Adversarially Trained Convolutional Neural Network Based Optical Character Recognition," SMU Data Science Review: Vol. 3: No. 3, Article 1.
Available at: https://scholar.smu.edu/datasciencereview/vol3/iss3/1