For massively parallel processing (mpp), a domain decomposition method is used. The basic idea of the method is to split the large computation domain of a numerical experiment into several smaller domains and solve the set of equations by addressing independent local problems. Each processor has its own local memory and computes the model equation over a subdomain of the whole model domain. The subdomain boundary conditions are specified through communications between processors which are organized by explicit statements (message passing method).
A big advantage is that the method does not need many modifications of the initial FORTRAN code. From the modeller's point of view, each sub domain running on a processor is identical to the "monodomain" code. In addition, the programmer manages the communications between subdomains, and the code is faster when the number of processors is increased. The porting of OPA code on an iPSC860 was achieved during Guyon's PhD [Guyon et al. 1994, 1995] in collaboration with CETIIS and ONERA. The implementation in the operational context and the studies of performance on a T3D and T3E Cray computers have been made in collaboration with IDRIS and CNRS. The present implementation is largely inspired by Guyon's work [Guyon 1995].
The parallelization strategy is defined by the physical characteristics of the ocean model. Second order finite difference schemes lead to local discrete operators that depend at the very most on one neighbouring point. The only nonlocal computations concern the vertical physics (implicit diffusion, turbulent closure scheme, ...) (delocalization over the whole water column), and the solving of the elliptic equation associated with the surface pressure gradient computation (delocalization over the whole horizontal domain). Therefore, a pencil strategy is used for the data substructuration : the 3D initial domain is laid out on local processor memories following a 2D horizontal topological splitting. Each subdomain computes its own surface and bottom boundary conditions and has a side wall overlapping interface which defines the lateral boundary conditions for computations in the inner subdomain. The overlapping area consists of the two rows at each edge of the subdomain. After a computation, a communication phase starts: each processor sends to its neighbouring processors the update values of the points corresponding to the interior overlapping area to its neighbouring subdomain ( the innermost of the two overlapping rows). The communication is done through the Message Passing Interface (MPI). The data exchanges between processors are required at the very place where lateral domain boundary conditions are set in the monodomain computation : the lbc_lnk routine (found in lbclnk.F90 module) which manages such conditions is interfaced with routines found in lib_mpp.F90 module when running on an MPP computer ( when key_ mpp_mpi defined). It has to be pointed out that when using the MPP version of the model, the eastwest cyclic boundary condition is done implicitly, whilst the southsymmetric boundary condition option is not available.
In the standard version of NEMO, the splitting is regular and arithmetic.
The iaxis is divided by jpni and the jaxis by jpnj for a number of processors
jpnij most often equal to
(parameters set in
nammpp namelist). Each processor is independent and without message passing
or synchronous process, programs run alone and access just its own local memory.
For this reason, the main model dimensions are now the local dimensions of the subdomain (pencil)
that are named jpi, jpj, jpk. These dimensions include the internal
domain and the overlapping rows. The number of rows to exchange (known as
the halo) is usually set to one (jpreci=1, in par_oce.F90). The whole domain
dimensions are named jpiglo, jpjglo and jpk. The relationship between
the whole domain and a subdomain is:
One also defines variables nldi and nlei which correspond to the internal domain bounds, and the variables nimpp and njmpp which are the position of the (1,1) gridpoint in the global domain. An element of , a local array (subdomain) corresponds to an element of , a global array (whole domain) by the relationship:
Processors are numbered from 0 to , the number is saved in the variable nproc. In the standard version, a processor has no more than four neighbouring processors named nono (for north), noea (east), noso (south) and nowe (west) and two variables, nbondi and nbondj, indicate the relative position of the processor :
The NEMO model computes equation terms with the help of mask arrays (0 on land points and 1 on sea points). It is easily readable and very efficient in the context of a computer with vectorial architecture. However, in the case of a scalar processor, computations over the land regions become more expensive in terms of CPU time. It is worse when we use a complex configuration with a realistic bathymetry like the global ocean where more than 50 % of points are land points. For this reason, a preprocessing tool can be used to choose the mpp domain decomposition with a maximum number of only land points processors, which can then be eliminated (Fig. 8.6) (For example, the mpp_optimiz tools, available from the DRAKKAR web site). This optimisation is dependent on the specific bathymetry employed. The user then chooses optimal parameters jpni, jpnj and jpnij with , leading to the elimination of land processors. When those parameters are specified in nammpp namelist, the algorithm in the inimpp2 routine sets each processor's parameters (nbound, nono, noea,...) so that the landonly processors are not taken into account.
When land processors are eliminated, the value corresponding to these locations in the model output files is undefined. Note that this is a problem for the meshmask file which requires to be defined over the whole domain. Therefore, user should not eliminate land processors when creating a meshmask file ( when setting a nonzero value to nn_msh).

Gurvan Madec and the NEMO Team
NEMO European Consortium20170217