2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)
Download PDF

Abstract

The problem of deepening memory hierarchy towards exascale is becoming serious for applications such as those based on stencil kernels, as it is difficult to satisfy both high memory bandwidth ad capacity requirements simultaneously. This is evident even today, where problem sizes of stencil-based applications on GPU supercomputers are limited by aggregated capacity of GPU device memory. Locality improvement techniques such as temporal blocking is known to preserve performance, but integrating the technique into existing stencil applications results in substantially higher programming cost, especially for complex applications and as a result are not typically utilized. We alleviate this problem with a run-time GPU-MPI process virtualization library we call HHRT that automates data movement across the memory hierarchy, and a systematic methodology to convert and optimize the code to accommodate temporal blocking. The proposed methodology has shown to significantly eases the adaptation of real applications, such as the whole-city airflow simulator embodying more than 12,000 lines of code; with careful tuning, we successfully maintain up to 85% performance even with problems whose footprint is four time larger than GPU device memory capacity, and scale to hundreds of GPUs on the TSUBAME2.5 supercomputer.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles