•  
  •  
 

SMU Data Science Review

Abstract

The prevalence of data has given consumers the power to make informed choices based off reviews, ratings, and descriptive statistics. However, when a local judge is coming up for re-election there is not any available data that aids voters in making data-driven decision on their vote. Currently court docket data is stored in text or PDFs with very little uniformity. Scaling the collection of this information could prove to be complicated and tiresome. There is a demand for an automated, intelligent system that can extract and organize useful information from the datasets. This paper covers the process of web scraping and implementing natural language processing (NLP) in order to pull court information and criminal information from public datasets and tie it back to judges. A Condition Random Fields (CRF), Support Vector Machine (SVM), and a Bi-Directional Long Short-term Memory (LSTM) Model were weighed against their predictive accuracy scores to determine the best model in order to tag the dockets for the key entities, or tokens. This paper focuses on the initial keywords that would be beneficial in sentencing trends (ie. name of Judge, defendant lawyer, & state representative). The bi-directional LSTM had the highest accuracy score of 99.4%. This paper will serve as the blueprint for further NLP analysis that will be championed by Code for Tulsa with the possible assistance of other civic groups such as Tulsa Legal Hackers.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Share

COinS