Up to Specific issues in real coupled models
Hello, I am using oasis3-mct to coupling atmosphere and ocean model (NEMO). After setting the basics and running the model, I have a problem with the mpi task. If we give the the same number of mpi tasks to both, or more mpi tasks to NEMO, the model will perform fine. However, if the number of mpi task in the NEMO is less than the atmosphere model (atmos core=4, ocean core=1~3), the model gets stop on mpi_waitall. The line of code that is stopping is: ~/oasis3-mct/lib/mct/mct/m_Transfer.F90 Subroutine waitrecv_(aV, Rout, Sum) ..... if(Rout%numratt.ge.1) then Call MPI_WAITALL(Rout%nprocs,Rout%rreqs,Rout%rstatus,ier) call if(ier /= 0) MP_perr_die(myname_,'MPI_WAITALL(reals)',ier) endif .... I look forward to the help of the oasis-mct developers. thanks..
Hi, It is hard to believe that this is a problem with MCT or with the OASIS API as as they have been stable for quite some time (but one never knows!). Ideally, could you set-up a toy coupled model (ie. no real models but realistic coupling exchanges) reproducing the problem that we could run on our side to try to understand the problem? In the mean time, some questions to consider: - We assume the models are run as 2 exectuables running concurrently on separate pes. If not, then what is your setup? - Does the behaviour change if you switch the order of model launching (i.e. swap "model1" and "model2" in the job launch command)? - Does this happen on the first communication? - When you change the task count, is something overlooked in the individual model setup or in the calls to the OASIS API? Is the partition correctly expressed with the oasis_def_partition? Are the coupling initialization calls correct? Are you sure you are launching each model on the correct number of tasks? Let un know ... Sophie, for the OASIS3-MCT developers
HI, Sophie. Thanks for your kindly reply. - We are running these models as below. mpirun --hostfile ~/openmpi.hosts -np 15 nemo.exe : --hostfile ~/openmpi.hosts -np 15 ./atmos.exe - Waiting occurs even if atmos.exe and nemo.exe are changed as shown below. mpirun --hostfile ~/openmpi.hosts -np 15 nemo.exe : --hostfile ~/openmpi.hosts -np 15 atmos.exe mpirun --hostfile ~/openmpi.hosts -np 15 atmos.exe : --hostfile ~/openmpi.hosts -np 15 nemo.exe In both cases, the number of pe used in the nemo.exe model must not be less than the number of pe used in atmos.exe. - Initialization of mpi communicator, partition and var definintion are successfully performed, and waiting occurs during get(NEMO->ATMOS) after the first put(ATMOS->NEMO). - When changing the mpi task, there is no difference in the settings in the model code. The atmos model uses an orange partition and is configured as belows. !! 1. PARTITION DEFINITION id_grid_size(1)=3 !orange partition id_grid_size(2)=latlen*2 !the upper corner global offset idx=3 DO i = 1, latlen def_i=latstr+(i-1) latdef_=latdef(def_i) id_grid_size(idx)=(jdim-latdef_)*idim id_grid_size(idx+1)=idim id_grid_size(idx+2)=(latdef_-1)*idim id_grid_size(idx+3)=idim idx = idx + 4 END DO CALL oasis_def_partition(part_id(cnt),id_grid_size(:),ierr) The version of NEMO is 3.6, and the box partition definition in cpl_oasis3.F90 is used as it is. ! ----------------------------------------------------------------- ! ... Define the partition ! ----------------------------------------------------------------- paral(1) = 2 ! box partitioning paral(2) = jpiglo * (nldj-1+njmpp-1) + (nldi-1+nimpp-1) ! NEMO lower left corner global offset paral(3) = nlei-nldi+1 ! local extent in i paral(4) = nlej-nldj+1 ! local extent in j paral(5) = jpiglo ! global extent in x IF( ln_ctl ) THEN WRITE(numout,*) ' multiexchg: paral (1:5)', paral WRITE(numout,*) ' multiexchg: jpi, jpj =', jpi, jpj WRITE(numout,*) ' multiexchg: nldi, nlei, nimpp =', nldi, nlei, nimpp WRITE(numout,*) ' multiexchg: nldj, nlej, njmpp =', nldj, nlej, njmpp ENDIF CALL oasis_def_partition ( id_part, paral, nerror ) There seems to be no problem in the initialization process, and the model is performed according to the set number of mpi tasks. There is no problem if both models have the same mpi task, or if the NEMO has more pes. thanks... joon,
Hello Sophie. I have additional issue. The atm model requires the orange partition, but I tested it with a box partition. After that it is running successfully, regardless of the mpi task. It seems that there is a problem between the orange partition and the box partition in OASIS3-MCT v5. thank you joon,
Hi Joon, This is pretty weird. I guess the only thing to do at this point is to set up a toy model reproducing your problem so that we can run it and try to understand what happens. Can you set up this toy model i.e. two "empty" codes (from the science point of view) that define the same grids and partitions than your model, performs the same coupling exchanges and so reproduce the problem? If you do so, we could run it and try to fix the bug. Let me know ... Regards, Sophie
Hi, Sophie I set up atm and ocn dummy models with the same grids and partitions for the coupled models. After testing, I found out that the same problem occurs. I fixed atm dummy model to 4 mpi tasks because the orange partition configuration was complicated. The ocn dummy model is set up as a box partition where you can set the mpi task to any number. Please let me know the email address where I can send you the code. Regars, joon.
Thanks.If it is not too big, you can send the toy model to oasishelp@cerfacs.fr . Regards, Sophie
I just sent you a mail. thanks!!! Please I hope it solves well. Regards, Joon.