2017 IEEE Trustcom/BigDataSE/ICESS
Download PDF

Abstract

Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles