Within the ScalES project, DKRZ is investigating possibilities to reduce load imbalances that prevent applications from scaling to high numbers of processors. One common source of load imbalance is gathering distributed arrays from subarrays on each process to one process (the target).
A naive approach implements only the unavoidable data transfer from all processes to the target. This does not account for the overhead of inserting subarray data into the array. This overhead is concentrated at the target.
Another approach tries to reduce the overhead by assembling intermediate subarrays which possess the contiguous property within the target's array. These can be inserted directly using a collective mpi routine. The assembly of these subarrays can be done in parallel. Thus the overhead is distributed to several processes whose optimal number depends on the latency of the communication.
This results in a two-phase gather that, e.g. for a 192x96x47 double precision array distributed over 8x4 processes, is one order of magnitude faster than the naive approach (measured on IBM Power6 p575 running AIX). With higher parallelization one can further improve performance by adapting the procedure to the usually nonuniform communication properties of the interconnect, e.g., shared memory communication for the first phase and InfiniBand communication for the second.