SMU Data Science Review


In this paper, we present a novel framework and system for the identification of primary research topics from within a corpus of related publications, the classification of individual publications according to these topics, and the results of the application of our framework and system to the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a corpus of published peer reviewed and pre-peer reviewed articles related to the coronavirus that causes COVID-19. Using machine learning techniques, such as Non-negative Matrix Factorization for Natural Language Processing and a Bayesian classifier, we developed a novel framework and system that automatically extracts sparse and meaningful features from the abstracts of the articles in CORD-19 allowing for primary topic identification and the classification of articles based upon these primary topics. The system uses an adaptive topic model classifier that allows for the identification of new primary topics as papers are added to CORD-19. New primary topics are added only when sufficiently many papers cover that topic. Using our system, we identified ten primary topics for the CORD-19 articles existing as of June 2020. The COVID-19 pandemic began in or around December 2019; therefore, the June 2020 CORD-19 dataset reflects the early research that has been performed related to COVID-19, as well as earlier coronaviruses related to previous epidemics such as SARS and MERS. The ten identified primary topics cover the breadth of the essential research questions that need to be answered in order to understand and find a cure or vaccine for COVID-19. This breadth and coverage demonstrates that beginning early in the pandemic, the research community began the investigation into all aspects of COVID-19 and the coronavirus that causes COVID-19, providing a broad foundation for the ending of the pandemic.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License