Abstract
Decision Tree classifiers are important classification techniques which are relatively faster learning speed and provide comparable classification accuracy with other methods. However, the categorical features must be handled before building decision tree model. There are various discretization techniques used to transfer continuous-value data into discrete-value data. It is difficult to select the appropriate discretization algorithms for different characteristics of data sets. In the experiments, we consider the class sample proportion to separate data sets with numerical attributes into two groups and apply with 12 commonly used discretization methods. We study both supervised versus unsupervised and top-down versus bottom-up approaches including ChiMerge, MDLP, Chi2, FUSINTER, Modified Chi2, CAIM, Extended Chi2, MODL, CACC, Ameva, PKID and ZDISC. The experimental results show that the MDLP, which uses class information entropy, is the best overall performance for Decision Tree, especially for different class sample proportion data sets. The CACC, which uses the contingency coefficient criterion, is the best algorithm for similar class sample proportion data sets. Both of them are supervised top-down approaches.