Architecture-Aware Mapping and Optimization on a 1600-Core GPU

By: M Daga, T Scogland, and Wu-chun Feng

In: Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, 2011 pp. 316-323

Posted: 01 Jan 2011

Tagged: GPU GEM optimization

This paper provides an overview of some of the architecture specific optimizations we have identified for AMD Radeon GPUs. Each is characterized in terms of the GEM GPU application described in Accelerating electrostatic surface potential calculation with multi-scale approximation on graphics processing units. While some of these are less necessary now, many of the optimizations can still be applied and will give benefits not only on AMD GPUs but a variety of other platforms as well.

Abstract:

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-dimensional problem that requires deep technical knowledge of GPU architecture. Although substantial literature exists on how to map and optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on an AMD GPU, such as the 1600-core AMD Radeon HD 5870 GPU. Consequently, we present and evaluate architecture-aware mapping and optimizations for the AMD GPU. The most prominent of which include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU mapping and optimizations by applying each in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, our optimized OpenCL implementation on an AMD Radeon HD 5870 delivers more than a four-fold improvement in performance over the basic OpenCL implementation. In addition, it outperforms our optimized CUDA version on an NVIDIA GTX280 by 12%. Overall, we achieve a speedup of 371-fold over a serial but hand-tuned SSE version of our molecular modeling application, and in turn, a 46-fold speedup over an ideal scaling on an 8-core CPU.

BibTex:
@INPROCEEDINGS{6121293, 
    author={Daga, M. and Scogland, T. and Wu-chun Feng}, 
    booktitle={Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th
    International Conference on}, title={Architecture-Aware Mapping and
    Optimization on a 1600-Core GPU}, 
    year={2011}, 
    month={dec.}, 
    volume={}, 
    number={}, 
    pages={316 -323}, 
    keywords={1600 core AMD Radeon HD 5870 GPU;AMD specific GPU
    optimizations;GEM;GPU architecture;GPU code optimizing;NVIDIA CUDA
    architecture;NVIDIA GTX280;OpenCL;architecture aware mapping;graphics
    processing unit;high performance computing;optimization;graphics processing
    units;optimisation;parallel architectures;}, 
    doi={10.1109/ICPADS.2011.29}, 
    ISSN={1521-9097},
}

blog comments powered by Disqus