Exchange with neighbouring processors (lbclnk.F90, lib_mpp.F90)

For massively parallel processing (mpp), a domain decomposition method is used. The basic idea of the method is to split the large computation domain of a numerical experiment into several smaller domains and solve the set of equations by addressing independent local problems. Each processor has its own local memory and computes the model equation over a subdomain of the whole model domain. The subdomain boundary conditions are specified through communications between processors which are organized by explicit statements (message passing method).

A big advantage is that the method does not need many modifications of the initial FORTRAN code. From the modeller's point of view, each sub domain running on a processor is identical to the "mono-domain" code. In addition, the programmer manages the communications between subdomains, and the code is faster when the number of processors is increased. The porting of OPA code on an iPSC860 was achieved during Guyon's PhD [Guyon et al. 1994, 1995] in collaboration with CETIIS and ONERA. The implementation in the operational context and the studies of performance on a T3D and T3E Cray computers have been made in collaboration with IDRIS and CNRS. The present implementation is largely inspired by Guyon's work [Guyon 1995].

The parallelization strategy is defined by the physical characteristics of the ocean model. Second order finite difference schemes lead to local discrete operators that depend at the very most on one neighbouring point. The only non-local computations concern the vertical physics (implicit diffusion, turbulent closure scheme, ...) (delocalization over the whole water column), and the solving of the elliptic equation associated with the surface pressure gradient computation (delocalization over the whole horizontal domain). Therefore, a pencil strategy is used for the data sub-structuration : the 3D initial domain is laid out on local processor memories following a 2D horizontal topological splitting. Each sub-domain computes its own surface and bottom boundary conditions and has a side wall overlapping interface which defines the lateral boundary conditions for computations in the inner sub-domain. The overlapping area consists of the two rows at each edge of the sub-domain. After a computation, a communication phase starts: each processor sends to its neighbouring processors the update values of the points corresponding to the interior overlapping area to its neighbouring sub-domain ($ i.e.$ the innermost of the two overlapping rows). The communication is done through the Message Passing Interface (MPI). The data exchanges between processors are required at the very place where lateral domain boundary conditions are set in the mono-domain computation : the lbc_lnk routine (found in lbclnk.F90 module) which manages such conditions is interfaced with routines found in lib_mpp.F90 module when running on an MPP computer ($ i.e.$ when key_ mpp_mpi defined). It has to be pointed out that when using the MPP version of the model, the east-west cyclic boundary condition is done implicitly, whilst the south-symmetric boundary condition option is not available.

Figure 8.5: Positioning of a sub-domain when massively parallel processing is used.
\includegraphics[width=0.90\textwidth]{Fig_mpp}

In the standard version of NEMO, the splitting is regular and arithmetic. The i-axis is divided by jpni and the j-axis by jpnj for a number of processors jpnij most often equal to $ jpni \times jpnj$ (parameters set in nammpp namelist). Each processor is independent and without message passing or synchronous process, programs run alone and access just its own local memory. For this reason, the main model dimensions are now the local dimensions of the subdomain (pencil) that are named jpi, jpj, jpk. These dimensions include the internal domain and the overlapping rows. The number of rows to exchange (known as the halo) is usually set to one (jpreci=1, in par_oce.F90). The whole domain dimensions are named jpiglo, jpjglo and jpk. The relationship between the whole domain and a sub-domain is:

$\displaystyle jpi$ $\displaystyle =$ $\displaystyle ( jpiglo-2*jpreci + (jpni-1) ) / jpni + 2*jpreci$  
$\displaystyle jpj$ $\displaystyle =$ $\displaystyle ( jpjglo-2*jprecj + (jpnj-1) ) / jpnj + 2*jprecj$ (8.3)

where jpni, jpnj are the number of processors following the i- and j-axis.

One also defines variables nldi and nlei which correspond to the internal domain bounds, and the variables nimpp and njmpp which are the position of the (1,1) grid-point in the global domain. An element of $ T_{l}$, a local array (subdomain) corresponds to an element of $ T_{g}$, a global array (whole domain) by the relationship:

$\displaystyle T_{g} (i+nimpp-1,j+njmpp-1,k) = T_{l} (i,j,k),$ (8.4)

with $ 1 \leq i \leq jpi$, $ 1 \leq j \leq jpj $ , and $ 1 \leq k \leq jpk$.

Processors are numbered from 0 to $ jpnij-1$, the number is saved in the variable nproc. In the standard version, a processor has no more than four neighbouring processors named nono (for north), noea (east), noso (south) and nowe (west) and two variables, nbondi and nbondj, indicate the relative position of the processor :

During the simulation, processors exchange data with their neighbours. If there is effectively a neighbour, the processor receives variables from this processor on its overlapping row, and sends the data issued from internal domain corresponding to the overlapping row of the other processor.

The NEMO model computes equation terms with the help of mask arrays (0 on land points and 1 on sea points). It is easily readable and very efficient in the context of a computer with vectorial architecture. However, in the case of a scalar processor, computations over the land regions become more expensive in terms of CPU time. It is worse when we use a complex configuration with a realistic bathymetry like the global ocean where more than 50 % of points are land points. For this reason, a pre-processing tool can be used to choose the mpp domain decomposition with a maximum number of only land points processors, which can then be eliminated (Fig. 8.6) (For example, the mpp_optimiz tools, available from the DRAKKAR web site). This optimisation is dependent on the specific bathymetry employed. The user then chooses optimal parameters jpni, jpnj and jpnij with $ jpnij < jpni \times jpnj$, leading to the elimination of $ jpni \times jpnj - jpnij$ land processors. When those parameters are specified in nammpp namelist, the algorithm in the inimpp2 routine sets each processor's parameters (nbound, nono, noea,...) so that the land-only processors are not taken into account.

When land processors are eliminated, the value corresponding to these locations in the model output files is undefined. Note that this is a problem for the meshmask file which requires to be defined over the whole domain. Therefore, user should not eliminate land processors when creating a meshmask file ($ i.e.$ when setting a non-zero value to nn_msh).

Figure 8.6: Example of Atlantic domain defined for the CLIPPER projet. Initial grid is composed of 773 x 1236 horizontal points. (a) the domain is split onto 9 20 subdomains (jpni=9, jpnj=20). 52 subdomains are land areas. (b) 52 subdomains are eliminated (white rectangles) and the resulting number of processors really used during the computation is jpnij=128.
\includegraphics[width=0.90\textwidth]{Fig_mppini2}

Gurvan Madec and the NEMO Team
NEMO European Consortium2017-02-17