SMU Data Science Review
Abstract
In this paper, we evaluate the self-declared industry classifications and industry relationships between companies listed on either the Nasdaq or the New York Stock Exchange (NYSE) markets. Large corporations typically operate in multiple industries simultaneously; however, for investment purposes they are classified as belonging to a single industry. This simple classification obscures the actual industries within which a company operates, and, therefore, the investment risks of that company.
By using Natural Language Processing (NLP) techniques on Security and Exchange Commission (SEC) filings, we obtained self-defined industry classifications per company. Using clustering techniques such as Hierarchical Agglomerative and k-means clustering we were able to identify companies operating in similar industries. We found that the use of NLP to extract features the text was more important to model performance then model selection or optimization.
Recommended Citation
Torres, Vanessa; Deason, Travis; Landrum, Michael; and Lohria, Nibhrat
(2019)
"A Machine Learning Model for Clustering Securities,"
SMU Data Science Review: Vol. 2:
No.
2, Article 18.
Available at:
https://scholar.smu.edu/datasciencereview/vol2/iss2/18
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License
Included in
Numerical Analysis and Scientific Computing Commons, Programming Languages and Compilers Commons, Theory and Algorithms Commons