array bound mismatch in mod_oasis_load_balancing.F90 (OASIS3-MCT5.2)
Up to Bugs and debugs
Posted by Anonymous at January 9 2025
Good morning
I've encountered a problem when activating the load balancing tool in OASIS3-MCT5.2 as part of EC-Earth4 (OpenIFS + NEMO etc).
Everything is fine when I use the following in namcouple
$NLOGPRT
1 0
but when I activate the load balancing tool with
$NLOGPRT
1 0 1
the model runs fine but crashes at the last time step. I compiled with -fcheck-bounds in GCC and managed to trace the error back to:
0: At line 854 of file /nobackup/rossby22/sm_joakj/models_freja_gcc/ecearth4.1b/scripts/build/../../sources/oasis3-mct-5.2/lib/psmile/src/mod_oasis_load_balancing.F90
0: Fortran runtime error: Array bound mismatch for dimension 1 of array 'bk_avg' (21842/21850)
The line in question is:
bk_avg(:) = SUM ( tl_global_timer(:,:), dim=2) / REAL(mpi_size_local,4)
I added some print statements and found that tl_global_timer is allocated as size ievent x mpi_size_local, while bk_avg is allocated as size ievent.
However, the variable ievent changes between the two allocate statements, in my case from 21850 to 21842, causing the crash.
This seems like a bug to me, but it might not always lead to a crash since accessing out-of-bounds indices in Fortran arrays is sometimes allowed (Intel generally seems more "forgiving" than GCC).
If I modify the code to force both arrays to have the same size, i.e.
ievent_old=SIZE(tl_global_timer, 1)
ALLOCATE(bk_minval(ievent_old), stat=ierror)
ALLOCATE(bk_maxval(ievent_old), stat=ierror)
ALLOCATE(bk_avg(ievent_old), stat=ierror)
the problem goes away and the load balancing tool seems to work.
Could you comment on whether this really is a bug and if it's been addressed in a newer version?
I could make my solution a bit cleaner and submit it if needed.
Best wishes
Joakim Kjellsson, SMHI
PS. It is the OpenIFS executable that finally crashes, i.e. OASIS crashes when computing the timings for OpenIFS. NEMO, runoff mapper and XIOS do not encounter this issue.
Posted by Anonymous at January 9 2025
Hi Joakim,
This indeed looks like a bug and has not been fixed. Thanks for your fix, I will open a ticket (and fix it in the next official version).
Regards,
Sophie
Posted by Anonymous at January 10 2025
If I use the current case -1. . I will be fine & not ... Sophie ?
subhadeep.
Posted by Anonymous at January 10 2025
I doo not understand your question. Please send me a personal email at valcke[at]cerfacs.fr