2014 IEEE International Conference on Bioinformatics and Bioengineering (BIBE)
Download PDF

Abstract

In the domain of bioinformatics, two common problems encountered when analyzing real-world datasets are class imbalance and high dimensionality. Boosting is a technique that can be used to improve classification performance, even in the presence of class imbalance. In addition, data sampling and feature selection are two important preprocessing techniques used to counter the adverse effects of both challenges collectively. In this study, we examine whether the inclusion of boosting along with joint deployment of feature selection and data sampling techniques affect the classification performance of inductive models. To this end, we used two approaches: filter-based feature selection followed by either data sampling (denoted as FS-DS) or a hybrid data sampling and boosting technique entitled RUSBoost (denoted as FRB) which integrates random under sampling within the boosting process. We conducted an extensive experimental study using six high dimensional and imbalanced bioinformatics datasets along with three learners and four feature subset sizes. Our results show that the improvement of classification performance due to boosting depends on the choice of learner used to build the model. We recommend FRB because it outperforms FS-DS for nearly all scenarios. Additionally, our ANOVA analysis shows that the FRB is statistically distinguishable from the FS-DS when using the LR learner. To our knowledge, this is the first study to investigate the effects of boosting along with combined feature selection and data sampling on classification performance of inductive models in the domain of bioinformatics.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles