2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI)
Download PDF

Abstract

Term weighting schemes have been widely used in information retrieval and text categorization models. In this paper, we first investigate into the limitations of several state-of-the-art term weighting schemes in the context of text categorization tasks. Considering that category-specific terms are more useful to discriminate different categories, and these terms tend to have smaller entropy with respect to these categories, we then explore the relationship between a term's discriminating power and its entropy with respect to a set of categories. To this end, we propose two entropy-based term weighting schemes (i.e., tf.dc and tf.bdc) which measure the discriminating power of a term based on its global distributional concentration in the categories of a corpus. To demonstrate the effectiveness of the proposed term weighting schemes, we compare them with seven state-of-the-art schemes on a long-text corpus and a short-text corpus respectively. Our experimental results show that the proposed schemes outperform the state-of-the-art schemes in text categorization tasks with KNN and SVM.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles