SMU Data Science Review


In this paper, we present a new model to predict the prob- ability that a personal computer will become infected with malware. The dataset is selected from a Kaggle competition supported by Mi- crosoft. The data includes computer configuration, owner information, installed software, and configuration information. In our research, sev- eral classification models are utilized to assign a probability of a machine being infected with malware. The LightGBM classifier is the optimum machine learning model by performing faster with higher efficiency and lower memory usage in this research. The LightGBM algorithm obtained a cross-validation ROC-AUC score of 74%. Leading factors and feature importance are also identified by LightGBM technique. Our research revealed that variables related to location, firmware version, operating system, and anti-virus software are the most important variables that have the highest weight in predicting malware detection.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License