Abstract
This paper proposes a supervised deep hashing approach for highly efficient and effective cover song detection. Our system consists of two identical sub-neural networks, each one having a hash layer to learn a binary representations of input audio in the form of spectral features. A loss function joins the two outputs of the sub-networks by minimizing the Hamming distance for a pair of audio files covering the same music work. We further enhance system performance by loudness embedding, beat synchronization, and early fusion of input audio features. The output of 128-bit hash reaches state-of-the-art performance with mean pairwise accuracy. This system demonstrates the possibility of memory-efficient and real-time efficient cover song detection with satisfiable accuracy in large scale.