array bound mismatch in mod_oasis_load_balancing.F90 (OASIS3-MCT5.2)
Up to Bugs and debugs
Posted by Anonymous at January 9 2025
Good morning
I've encountered a problem when activating the load balancing tool in OASIS3-MCT5.2 as part of EC-Earth4 (OpenIFS + NEMO etc).
Everything is fine when I use the following in namcouple
$NLOGPRT
1 0
but when I activate the load balancing tool with
$NLOGPRT
1 0 1
the model runs fine but crashes at the last time step. I compiled with -fcheck-bounds in GCC and managed to trace the error back to:
0: At line 854 of file /nobackup/rossby22/sm_joakj/models_freja_gcc/ecearth4.1b/scripts/build/../../sources/oasis3-mct-5.2/lib/psmile/src/mod_oasis_load_balancing.F90
0: Fortran runtime error: Array bound mismatch for dimension 1 of array 'bk_avg' (21842/21850)
The line in question is:
bk_avg(:) = SUM ( tl_global_timer(:,:), dim=2) / REAL(mpi_size_local,4)
I added some print statements and found that tl_global_timer is allocated as size ievent x mpi_size_local, while bk_avg is allocated as size ievent.
However, the variable ievent changes between the two allocate statements, in my case from 21850 to 21842, causing the crash.
This seems like a bug to me, but it might not always lead to a crash since accessing out-of-bounds indices in Fortran arrays is sometimes allowed (Intel generally seems more "forgiving" than GCC).
If I modify the code to force both arrays to have the same size, i.e.
ievent_old=SIZE(tl_global_timer, 1)
ALLOCATE(bk_minval(ievent_old), stat=ierror)
ALLOCATE(bk_maxval(ievent_old), stat=ierror)
ALLOCATE(bk_avg(ievent_old), stat=ierror)
the problem goes away and the load balancing tool seems to work.
Could you comment on whether this really is a bug and if it's been addressed in a newer version?
I could make my solution a bit cleaner and submit it if needed.
Best wishes
Joakim Kjellsson, SMHI
PS. It is the OpenIFS executable that finally crashes, i.e. OASIS crashes when computing the timings for OpenIFS. NEMO, runoff mapper and XIOS do not encounter this issue.
Posted by Anonymous at January 9 2025
Hi Joakim,
This indeed looks like a bug and has not been fixed. Thanks for your fix, I will open a ticket (and fix it in the next official version).
Regards,
Sophie
Posted by Anonymous at January 10 2025
If I use the current case -1. . I will be fine & not ... Sophie ?
subhadeep.
Posted by Anonymous at January 10 2025
I doo not understand your question. Please send me a personal email at valcke[at]cerfacs.fr
Posted by Anonymous at May 21 2025
I just ran into the same issue with AWIESM3 (OpenIFS48r1 + FESOM2 + LPJGuess + Runoff-mapper & XIOS)
@Joakim & Sophie, is there a patch available? I do not see a fix on the EC-Earth SMHI oasis vendor repo:
https://git.smhi.se/ec-earth/vendor/oasis/oasis3-mct-5/-/commits/main?ref_type=HEADS
Nor under the cerfacs gitlab:
https://gitlab.com/cerfacs/oasis3-mct/-/commits/OASIS3-MCT_5.0?ref_type=HEADS
For now, I just recreated Joakims fix locally.
Best, Jan
Posted by Anonymous at May 21 2025
Hi,
Sorry the fix is still not in the official sources. You did the right thing in the mean time! We will certainly include it in the next official version.
Sorry for the slow reaction (we are a bot overloaded lately :-( )
Regards,
Sophie
Posted by Anonymous at May 21 2025
No worries, Sophie!
I've made a MR about this on the EC-Earth git portal for now.
It just needs to be a small fix on our side for now, and then we won't have this problem anymore when we upgrade to OASIS3-MCT6.
/J
Posted by Anonymous at June 25 2025
Hi Joakim,
The fix is now (finally) in the master branch and in the OASIS3-MCT_5.0 branch. And it will of course follow in the next version OASIS3-MCT_6.0 that will be released this autumn!
Thanks for the fix and have a nice day,
Sophie
Posted by Anonymous at June 26 2025
Hi,
I don't understand the load_balancing concept.
I have tried too much to understand it.
-*.*-
Posted by Anonymous at June 26 2025
Please read section 6.5 of the User Guide https://cerfacs.fr/oa4web/oasis3-mct_5.0/oasis3mct_UserGuide/node65.html and the specific documentation about load balancing in the balancing documentation.pdf file in oasis3-mct/util/load balancing directory
And ask more specific questions if you have any after reading the documentation !
Regards,
Sophie
Posted by Anonymous at June 26 2025
Hi,
Sophie
Thank you for your kind guidance and support.
i will go through it.
Regards,
-*.*-