Abstract
Multiple precision multiplication is widely used in scientific computing and cryptography. When the size of integer grows beyond computer precision (32-bit or 64-bit), the computational cost of multiplication becomes significant. In this paper, we proposed a novel solution to implement multiple precision multiplication in massively parallel GPU with Kepler architecture. Our implementation is designed based on Chinese Remainder Theorem and Number Theoretic Transform with 64-bit prime. We implemented three versions of multiple precision multiplication which utilized global memory, shared memory and registers to store the precomputed twiddle factors. The register version use warp shuffle instruction (available in GPU with Kepler architecture) to exchange data among threads within the same warp. This technique is able to avoid bank conflict issue in shared memory and allow faster computation in GPU. To the best of our knowledge, this is the first implementation reported in the literature that utilized warp shuffle instruction to accelerate NTT computation. Our best implementation is able to perform 1024-bit, 2048-bit, 4096-bit and 8192-bit multiplication in 0.095ms, 0.169ms, 0.444ms and 1.113ms respectively.