SMU Data Science Review


In this paper, various approaches were presented to match the most similar question to a user’s query. This is a two-step process, wherein the tags/topics of the questions are identified using k-means clustering and topic modeling respectively. User’s query is then matched with the most similar question in the corpus using k-means, topic modeling and ensemble models. Our motivation is to improve the developer’s productivity by presenting the top 10 most relevant questions similar to the users’ query. Our study is focused on answering Python (windows) specific technical programming related questions using the Stack Overflow dataset. The models are built using k-mean classification, topic modelling and ensemble of the two approaches, to find similar questions. These three approaches were chosen because the tags provided by the dataset were too generic to contextualize the question – which may result in irrelevant answer queries for future questions. Recall is the metric used to evaluate the models. Based on the results, we concluded that NMF and ensemble method outperformed k-means, with recall for NMF and Ensemble being 67% and recall for k-means being 50%.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License