Abstract
A deep learning-based approach to phishing detection is proposed. Specifically, websites' URLs and the characters in these URLs are mapped to documents and words, respectively, in the context of word2vec-based word embedding learning. Consequently, character embedding can be achieved from a corpus of URLs in an unsupervised manner. Furthermore, we combine character embedding with the structures of URLs to obtain the vector representations of the URLs. In particular, an URL is partitioned into the following five sections: URL protocol, sub-domain name, domain name, domain suffix, and URL path. To identify the phishing URLs, existing classification algorithms can be used smoothly on the vector representations of the URLs, avoiding laborious work on designing effective features manually and empirically. For evaluations, we collect a large-scale dataset, i.e., 1 Million Phishing Detection Dataset (1M-PD), which has been released for public use. Extensive experiments conducted on two real-world datasets show the effectiveness of the proposed approach, which achieves an accuracy of 99.69% with 0.40% false positive and 99.79% true positives on the 1M-PD dataset. In particular, the proposed approach detects each URL in 32ms on average merely on a personal computer, which is much faster than existing approaches and even can be considered real-time.