SMU Data Science Review
Abstract
This paper presents a novel application of Natural Language Processing techniques to classify unstructured text into toxic and non- toxic categories. In the current century, social media has created many job opportunities and, at the same time, it has become a unique place for people to freely express their opinions. Meanwhile, among these users, there are some groups that are taking advantage of this framework and misuse this freedom to implement their toxic mindset (i.e. insulting, verbal sexual harassment, threads, Obscene, etc.). The 2017 Youth Risk Behavior Surveillance System (Centers for Disease Control and Prevention) estimated that 14.9% of high school students were electronically bullied in the 12 months, prior to the survey. The primary result could be an Open Source model used by app developers in support of anti- bullying efforts. The analysis of the results showed that LSTM has a 20% higher true positive rate than well-known Naive Bayes method and this can be a big game changer in the field of comment classification. These results indicated that smart use of data science is able to form a healthier environment for virtual societies. Additionally, we improved our working pipeline and incorporated Amazon Web Service (AWS) as a fast, reliable and online platform to be able to run our classification algorithm. Our result showed a very promising accuracy of more than 70% performance by LSTM among all algorithms.
Recommended Citation
Zaheri, Sara; Leath, Jeff; and Stroud, David
(2020)
"Toxic Comment Classification,"
SMU Data Science Review: Vol. 3:
No.
1, Article 13.
Available at:
https://scholar.smu.edu/datasciencereview/vol3/iss1/13
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License