A Study of the Effects of Machine Geometry and Mapping on Distributed Transpose Performance

This paper describes a parallel strategy to extend the scalability of a small 3D FFT on thousands of Blue Gene/L processors. The approach is to execute the intermediate phases of the 3D FFT on smaller processor subsets. Performance measurements of the standalone 3D FFT on two communication protocols, MPI and BG/L ADE [19] are presented. While the performance of the 3D-FFT with MPI-based and BG/L ADE-based implementations exhibited qualitatively similar behavior, the BG/L ADE-based version has lower communication cost than the MPI based version for small message sizes. Measurements also show that the proposed approach is effective in improving Particle-Mesh-based N-body simulation performance significantly at the limits of scalability.

By: M. Eleftheriou; B. Fitch; A. Rayshutskiy; T. J. C. Ward; P. Heidelberger; R. S. Germain

Published in: RC24333 in 2007


