$NLOGPRT: The first, second and third numbers on the line below
this keyword refer to (i) the debug verbosity, (ii) internal timing statistics
and (iii) component load balancing analysis. The information is written by OASIS3-MCT
for each component and (optionnally) for each process.
The first number (that can be modified at runtime with the oasis_set_debug routine, see section
2.2.9) may be:
- 0 : production mode. One file debug.root.xx is open by the master process of
each component and one file debug_notroot.xx is open for all the
other processes of each component to write only error information.
- 1 : one file debug.root.xx is open by the master process of
each component to write information equivalent to level 10 (see
below) and also to write memory usage information;
one file debug_notroot.xx is open for all the other
processes of each component to write error information.
- 2 : one file debug.yy.xxxxxx is open by each process of each
component (with “yy” being the component number and “xxxxxx” the process number)
to write normal production diagnostics and memory usage information
- 5 : as for 2 with in addition some initial debug info
- 10: as for 5 with in addition the routine calling tree
- 12: as for 10 with in addition some routine calling notes
- 15: as for 12 with even more debug diagnostics
- 20: as for 15 with in addition some extra runtime analysis
- 30: full debug information
The second number defines how time statistics are written out to
file comp_name.timers_xxxx (with comp_name being the component name, see section 2.2.2); it can be:
- 0 : nothing is calculated or written.
- 1 : some time statistics are calculated and written in a
single file by the processor 0 as well as the min and the max
times over all the processors.
- 2 : some time statistics are calculated and each processor
writes its own file ; processor 0 also writes the min and the max
times over all the processors in its file.
- 3 : some time statistics are calculated and each processor
writes its own file ; processor 0 also writes in its file the min
and the max times over all processors and also writes in its file
all the results for each processor.
For more information on the time statistics written out, see section
6.4.2.
The third number (new in OASIS3-MCT_5.0) can be set to 1 to activate a load balancing diagnostic.
An efficient use of the allocated computing resources in a coupled system requires the harmonisation of the components speed. This operation, called load balancing, is often neglected, either because of the apparent resource abundance and practical difficulties.
To facilitates this work, OASIS3-MCT can output the full timeline of all coupling related events, for any of the allocated resources. This timeline is saved in one netCDF file per coupled component (timeline_XXX_component.nc). It provides the comprehensive sequence of any operations related to the coupling (field exchange through MPI, field output on disk, field interpolation and mapping, field reading on disk, restart writing, initialisation and termination phase of the OASIS3-MCT setup) so that any simulation slow down in link with the use of the OASIS library can be identified.
The analysis of the coupling field exchanges, amongst all the
coupling events, allows not only to identify the resources waste of components which are recurrently waiting for their coupling fields but it also reveals other bottlenecks such as disk access, OS interruptions or model internal load imbalance. The full picture of these events makes possible an optimum load balancing, even for the most complex configurations.
For a detailed information on load balancing analysis and timeline visualisation see respectively (Maisonnave et al 2020) and in (Piacentini and Maisonnave 2020).
In addition to the timeline, computing information (time to solution, speed, cost) and a synthesis of the time spent on MPI routines for each coupled component can also help, in the simpler cases, to allocate resources in a balanced way ( see file load_balancing_info.txt ).
$NWGTOPT : Optional (new in OASIS3-MCT_4.0); on the line below this keyword is a character string
that indicates how to handle bad remapping weights. There are four options
abort_on_bad_index, ignore_bad_index, ignore_bad_index_silently, and
use_bad_index. Bad weights are defined as weights in the mapping file for which either
the source or destination index are out of bounds relative to the number of grid cells
in the grid; in that case, the weight is referencing a gridcell that does not physically
exist. Note that an index equal to zero will not be considered as a bad index if the associated weight
is also zero. There are other situations where the value of the actual mapping weight is
scientifically incorrect, but this is not easy to detect and is not dealt with in OASIS3-MCT.
- abort_on_bad_index will write error messages to the log files and abort if a bad weight
index is detected. This is the default option.
- ignore_bad_index will write an error message and then remove bad
weights internally before continuing.
- ignore_bad_index_silently will remove bad weights and continue without writing an error
message.
- use_bad_index will attempt to keep bad weights in the interpolation computation,
but this can result in memory corruption, silent dropping of weights, and incorrect results ; this is not recommended.
Note that the ability to check mapping files at runtime in OASIS3-MCT is limited. It is always
recommended that mapping files be analyzed offline before long production runs are carried out.
Checks can be done to make sure the source and destination indices are valid, that weights values
are reasonable (for instance, between 0 and 1, although this will depend on the mapping method),
and that the sum of weights on the destination cells are reasonable (for instance, 1, in many cases).
In addition, offline tests can be run with analytical functions to verify conservation, gradient
preserving features and other characteristics associated with the particular mapping approach.