2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
Download PDF

Abstract

Software engineers produce large amounts of publicly accessible data that enables researchers to mine knowledge, fostering a better understanding of the field. Knowledge extraction often relies on meta data. This meta data can either be harvested from user-provided tags, or inferred by algorithms from the respective data. The question arises to which extent either type of meta data can be trusted and relied upon. We study this problem in the context of language identification of code snippets posted on Stack Overflow. We analyse the consistency between user-provided tags and the classification obtained with GitHub linguist, an industry-strength automated language recognition tool. We find that the results obtained by both approaches are often not consistent. This indicates that both have to be used with great care. Our results also suggest that developers may not follow the evolutionary path of programming languages beyond one step when seeking or providing answers to software engineering challenges encountered.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles