Abstract
This paper presents a method to reduce the labeling cost when acquiring training data for a system that detects malicious domain names by supervised machine learning. The conventional system requires large quantities of both benign and malicious domain names to be prepared as training data to obtain a classifier with high classification accuracy. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by only using approximately 2.5% of the training data used by the conventional system. An additional disadvantage of the conventional system is that, if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved.