Current publication:

Programmability and Performance of Heterogeneous Platforms

In some work that could be considered a continuation of the architecture specific optimization analysis of GEM, Konstantinos Krommydas and I evaluate programmability performance tradeoffs across three architectures, an Intel CPU, Intel Xeon Phi, and an NVIDIA Kepler GPU. Some of the results were surprising, not the least of which being that when fully optimized the GPU core code ended up more readable than the highly optimized CPU code.

Abstract:

General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures presents many challenges to the domain scientist, including device selection, programming model, and level of investment in optimization. All of these choices influence the balance between programmability and performance. In this paper, we characterize the performance achievable across a range of optimizations, along with their programma- bility, for multi- and many-core platforms – specifically, an Intel Sandy Bridge CPU, Intel Xeon Phi co-processor, and NVIDIA Kepler K20 GPU – in the context of an n-body, molecular-modeling application called GEM. Our systematic approach to optimization delivers implementations with speed- ups of 194.98×, 885.18×, and 1020.88× on the CPU, Xeon Phi, and GPU, respectively, over the na ̈ıve serial version. Beyond the speed-ups, we characterize the incremental optimization of the code from na ̈ıve serial to fully hand-tuned on each platform through four distinct phases of increasing complexity to expose the strengths and weaknesses of the programming models offered by each platform.

Recent Blog Entries

Raise a Single Window by Title on Mac OS-X

##Update: Not three full days after posting this, come to find it does not work on OS-X Maverick beta. See the bottom for the updated...

N-Dimensional Array Allocation in C

As languages go, C is not well known for its support of multi-dimensional arrays. While recent standards updates in the C99 and C11 releases have...

Finding the GMail URL scheme for iOS: Part 2

Back in January I went on a quest looking for a good way to open gmail URLs on iOS. At first the attempt focused on...

OpenCL error checking

OpenCL is many things but it is not the easiest programming model to check errors in, as it does not offer a conversion from error...

Recent Publications

Programmability and Performance of Heterogeneous Platforms

In some work that could be considered a continuation of the architecture specific optimization analysis of GEM, Konstantinos Krommydas and I evaluate programmability performance tradeoffs...

Trends in energy-efficient computing: A perspective from the Green500

Balaji Subramaniam took point on our annual analysis of the Green500 this year, and reached out to Winston Saunders to include the Exascalar metric and...

The Green500 list: escapades to exascale

The most recent installment of our annual analysis of the Green500 list is appearing in ISC this year instead of HPPAC. As we collect more...

Heterogeneous Task Scheduling for Accelerated OpenMP

The final camera ready version of our paper “Heterogeneous Task Scheduling for Accelerated OpenMP” is finally in. This paper was a breaking point for me,...

OpenCL and the 13 Dwarfs: A Work in Progress

Our first publication discussing the OpenCL and the 13 Dwarfs benchmark suite, glad to have a tangible artifact from this now. Keep a look out...

StreamMR: An Optimized MapReduce Framework for AMD GPUs

I’m rather fond of this work. It’s in direct opposition to the claims made in the original Mars paper that their two pass method was...