SMU Data Science Review

Reading Level Identification Using Natural Language Processing Techniques

William Arnost, Southern Methodist UniversityFollow
Ellen Lull, Southern Methodist UniversityFollow
Joseph Schueder, Southern Methodist UniversityFollow
Joseph Engler, Collins AerospaceFollow

Abstract

This paper investigates using the Bidirectional Encoder Representations from Transformers (BERT) algorithm and lexical-syntactic features to measure readability. Readability is important in many disciplines, for functions such as selecting passages for school children, assessing the complexity of publications, and writing documentation. Text at an appropriate reading level will help make communication clear and effective. Readability is primarily measured using well-established statistical methods. Recent advances in Natural Language Processing (NLP) have had mixed success incorporating higher-level text features in a way that consistently beats established metrics. This paper contributes a readability method using a modern transformer technique and compares the results to established metrics.

This paper finds that the combination of BERT and readability metrics provide a significant improvement in estimation of readability as defined by Crossley et al. [1]. The BERT+Readability model has a root mean square error (RMSE) of 0.30 compared to a BERT only model with RMSE of 0.44. This finding offers an alternative to basic statistical measures currently offered by most word processing software.

Recommended Citation

Arnost, William; Lull, Ellen; Schueder, Joseph; and Engler, Joseph (2021) "Reading Level Identification Using Natural Language Processing Techniques," SMU Data Science Review: Vol. 5: No. 3, Article 7.
Available at: https://scholar.smu.edu/datasciencereview/vol5/iss3/7