Project

General

Profile

Effect of compact datatypes on MPIOM boundsexchange (TP04L40 with 16x8 tasks, ST mode)

My current implementation of the MPIOM-boundsexchange using yaxt (to be commited to the trunk branch) shows the following performance for the 2d and 3d exchange measured on blizzard (version info will be supplied later):

original boundsexchange:

 :-------------------------------------------------------------------------------
 : Timer report ( tasknum * threadnum =    128)
 : label                       :       t_min     t_avg     t_max     t_sum lbe[%]
 :-------------------------------------------------------------------------------
 : bex_3d_ref                  :      1.5099    1.9462    2.3534    249.11  82.70
 : bex_2d_ref                  :      1.0647    1.2523    1.3699    160.29  91.42
 :-------------------------------------------------------------------------------

yaxt boundsexchange without compact datatype generators in in src/xt_mpi.c:

 :-------------------------------------------------------------------------------
 : Timer report ( tasknum * threadnum =    128)
 : label                       :       t_min     t_avg     t_max     t_sum lbe[%]
 :-------------------------------------------------------------------------------
 : bex_3d_yaxt                 :      1.8745    2.1560    3.2270    275.97  66.81
 : bex_2d_yaxt                 :      1.7461    2.1992    2.5191    281.50  87.30
 :-------------------------------------------------------------------------------

yaxt boundsexchange with compact datatype generators in in src/xt_mpi.c:

 :-------------------------------------------------------------------------------
 : Timer report ( tasknum * threadnum =    128)
 : label                       :       t_min     t_avg     t_max     t_sum lbe[%]
 :-------------------------------------------------------------------------------
 : bex_2d_yaxt                 :      1.7271    2.2136    2.5461    283.34  86.94
 : bex_3d_yaxt                 :      1.8212    2.0893    2.3056    267.43  90.62
 :-------------------------------------------------------------------------------

Summary:

The 2d-exchange with yaxt is always slower than the original exchange. This is due to the higher number of communication partners (8 instead of 4) in my current single-phase yaxt-exchange. The original exchange uses more phases in order to avoid communication of small sub-domain corners.

For the 3d-exchange the latency effect is less important (in this example) and the yaxt implementation with compact datatypes is as fast as the original. Without compact datatypes it seems the IBM MPI carries extra overhead (maybe due to a bigger amount of meta data for processing the user data access in the gather/scatter phase of pre-/post-communication)

Outlook:

At the moment it is considered not worthwhile to change the single-phase yaxt exchange since we are planing to replace the regular decomposition grid of MPIOM in order to adapt to the new workload profile after we dropped the dry-point iterations. After that the number of neighbors will change anyway. The big performance potential with yaxt to be exploited next is the aggregation of communication steps. Then we will compare the exchange performance again.