Abstract

Selecting a learning algorithm to implement for a particular application on the basis of performance still remains an ad-hoc process using fundamental benchmarks such as evaluating a classifier’s overall loss function and misclassification metrics. In this paper we address the difficulty of model selection by evaluating the overall classification performance between random forest and logistic regression for datasets comprised of various underlying structures: (1) increasing the variance in the explanatory and noise variables, (2) increasing the number of noise variables, (3) increasing the number of explanatory variables, (4) increasing the number of observations. We developed a model evaluation tool capable of simulating classifier models for these dataset characteristics and performance metrics such as true positive rate, false positive rate, and accuracy under specific conditions. We found that when increasing the variance in the explanatory and noise variables, logistic regression consistently performed with a higher overall accuracy as compared to random forest. However, the true positive rate for random forest was higher than logistic regression and yielded a higher false positive rate for dataset with increasing noise variables. Each case study consisted of 1000 simulations and the model performances consistently showed the false positive rate for random forest with 100 trees to be statistically different than logistic regression. In all four cases, logistic regression and random forest achieved varying relative classification scores under various simulated dataset conditions.

Recommended Citation

Kirasich, Kaitlin; Smith, Trace; and Sadler, Bivin (2018) "Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets," SMU Data Science Review: Vol. 1: No. 3, Article 9.
Available at: https://scholar.smu.edu/datasciencereview/vol1/iss3/9