2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Download PDF

Abstract

Graphics Processing Units (GPUs) contain multiple SIMD cores and each core can run a large number of threads concurrently. Threads in a core are scheduled and executed infixed sized groups, called warps. Each core contains one or more warp schedulers that select and execute warps from a pool of ready warps. In spite of having a large number of concurrent warps - 48 on NVIDIA Fermi architecture GPU - on manyGPGPU applications, current warp scheduling algorithms can not effectively utilize the hardware resources, resulting in stall cycles and loss in performance. The main reason for this is current warp scheduling algorithms mostly focus on long latency operations, especially global memory accesses, and do not take into account factors such as the progress of each thread block and the number of ready warps. In this paper, we propose, PRO, a progress warp scheduling algorithm that not only focuses on finishing individual thread blocks faster but also on reducing the overall execution time. These goals are achieved by dynamically prioritizing thread blocks and warps, based on their progress. We implemented our proposed algorithm in the GPGPU-SIM simulator and evaluated on various applications from GPGPU-SIM, Rosina and CUDASDK benchmark suites. We achieved an average speedup of 1.12x and a maximum speedup of 1.94x over the commonly used LooseRound Robin warp scheduling algorithm. Over the Two Level warp scheduler, our algorithm showed an average speedup of1.13x and a maximum speedup of 1.6x. Our proposed solution requires only a very small increase in the GPU hardware.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles