StreamMR: An Optimized MapReduce Framework for AMD GPUs

I’m rather fond of this work. It’s in direct opposition to the claims made in the original Mars paper that their two pass method was the only way to handle map reduce on GPUs which cannot use atomics. While StreamMR is compared against versions which can use atomics now, it can work on GPUs with or without them, and does not require a second pass.

Abstract:

MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally focusing on the NVIDIA GPU. Our investigation reveals that the design and mapping of the MapReduce framework needs to be revisited for AMD GPUs due to their notable architectural differences from NVIDIA GPUs. For instance, current state-of-the-art MapReduce implementations employ atomic operations to coordinate the execution of different threads. However, atomic operations can implicitly cause inefficient memory access, and in turn, severely impact performance. In this paper, we propose Streamer, an OpenCL MapReduce framework optimized for AMD GPUs. With efficient atomic-free algorithms for output handling and intermediate result shuffling, Stream MR is superior to atomic-based MapReduce designs and can outperform existing atomic-free MapReduce implementations by nearly five-fold on an AMD Radeon HD 5870.

BibTex:

@inproceedings{Elteir:2011ec,
    author = {Elteir, M and Lin, Heshan and Feng, Wu-chun and Scogland, T},
    title = {{StreamMR: An Optimized MapReduce Framework for AMD GPUs}},
    booktitle = {Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th
    International Conference on},
    year = {2011},
    pages = {364--371},
    publisher = { IEEE Computer Society}
}

StreamMR: An Optimized MapReduce Framework for AMD GPUs

By: M Elteir, Heshan Lin, Wu-chun Feng, and T Scogland

In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), pp. 364-371