Oasis Forum

WRF+NEMO+OASIS freeze

Up to Specific issues in real coupled models

Posted by Anonymous at May 19 2025

Hello,

I am trying to couple nemo and WRF through oasis, but the run either freezes, either fails. I am not at all specialist of nemo so maybe I did something wrong there. 
If I run mpirun -np 2 ./nemo : -np 1 wrf.exe, it fails at the beginning with the error:
 (oasis_abort) ABORT: file      = /home/wxop1/oasis3-mct/lib/psmile/src/mod_oasis_method.F90
 (oasis_abort) ABORT: line      =          335
 (oasis_abort) ABORT: on model  = oceanx
The end of nout file is:
 (oasis_init_comp) compnm gather            1 oceanx T
 (oasis_init_comp) compnm gather            2 oceanx T
 (oasis_init_comp) compnm gather            3 wrfexe T
 (oasis_init_comp)   COUPLED models            1 oceanx T

If I try the run in the other order: mpirun -np 1 wrf.exe : -np 2 ./nemo , it passes this step and all initialization, I have a first wrfout printed but then it freezes. Could run forever not doing anything.
The end of oucean.ouput is:
    Namelist namtrd : set trends parameters
       global domain averaged dyn & tra trends   ln_glo_trd  =  F
       U & V trends: 3D output                   ln_dyn_trd  =  F
       U & V trends: Mixed Layer averaged        ln_dyn_mxl  =  F
       T & S trends: 3D output                   ln_tra_trd  =  F
The end of ocean debug files are;
 (oasis_advance_run) DEBUG sequence O_OTaux1           0           0           1           1
 -------- NOTE (oasis_advance_run) compute field index and sizes
 (oasis_advance_run) DEBUG nfav,nsav,nsa =            1       13320       13320
 -------- NOTE (oasis_advance_run) comm_now compute
 (oasis_advance_run) at            0           0  STAT:            1  READY
 -------- NOTE (oasis_advance_run) comm_now
 -------- NOTE (oasis_advance_run) get section
 (oasis_advance_run) at            0           0  RECV: O_OTaux1
 -------- NOTE (oasis_advance_run) get recv
 (oasis_lb_measure) event index, coupler, kind, timestep_id            5           5           2           0


Thanks in advance for any insights.

Best

Posted by Anonymous at May 20 2025

It's hard to know what the problem is here. 

The end of the nout file should be something like: 
(oasis_init_comp)   COUPLED models            1 oceanx T
(oasis_init_comp)   COUPLED models            2 wrfexe T
i.e. you should see both your coupled models. The fact that you only see 1 indicates to me that WRF did not do "oasis_init_comp" correctly. That would explain the freeze. 
Did you get debug.root.02 from WRF? 

First, it's weird that the order of executables to mpirun matters. If your system uses SLURM (most do) then you could try "srun" instead of "mpirun". 
The benefit of using "srun" over "mpirun" is that "srun" comes from the system scheduler SLURM so it will know all about how to run programs on your system. "mpirun" usually requires you to specify carefully how tasks should be distributed across nodes while "srun" can just figure it out automatically. But this of course relies on your system having SLURM.
Could be that WRF is not started correctly. 

Second, I would recommend to activate all possible debugs. I don't know WRF, but NEMO has a namelist option "sn_cfctl%l_oasout  = .FALSE." which you can set to "TRUE". This will make NEMO print in ocean.output each OASIS call so you know what NEMO is doing. If such a command does not exist in WRF you could try to add something like "WRITE(*,*) "Calling OASIS_GET" " or similar before OASIS calls. 
Also run with 
NLOGPRT 
5 0 0 
to get detailed info. 

/Joakim

Posted by Anonymous at May 20 2025

Hi,

Thank you for your answer.
Unfortunaly, slurm is not available where I run the computations. By the way we use PGI(nvdia) compiler. Sometimes it is causing issues I don't know if it could be the case here.

Here the end of the various debug files (I put the options as ypu suggested):
debug.02.000000 (nemo 2)
 (oasis_enddef) done prism_advance_init
(oasis_mem_print) memory use (MB) =       799.0152      282.6136 (oasis_enddef):advance_init
(oasis_mem_print) memory use (MB) =       799.0152      282.6136 (oasis_enddef):end
 oasis_get_freqs Coupling frequency of this field O_QsrOce for coupling            1  is        10800
 (oasis_advance_run) at            0           0  RECV: O_OTaux1

debug.02.000001 (nemo 2)
 (oasis_enddef) done prism_advance_init
(oasis_mem_print) memory use (MB) =       876.5542      312.3792 (oasis_enddef):advance_init
(oasis_mem_print) memory use (MB) =       876.5542      312.3792 (oasis_enddef):end
 oasis_get_freqs Coupling frequency of this field O_QsrOce for coupling            1  is        10800
 (oasis_advance_run) at            0           0  RECV: O_OTaux1

debug.01.00000 (WRF)
 ----------------------------------------------------------------
 (oasis_enddef) done prism_advance_init
(oasis_mem_print) memory use (MB) =      3372.8125     1956.4083 (oasis_enddef):advance_init
(oasis_mem_print) memory use (MB) =      3372.8125     1956.4083 (oasis_enddef):end
 (oasis_advance_run) at             0            0  RECV: WRF_d01_EXT_d01_SST

This is when the freezing occurs.
In the other case I only have the nout output.

Best.

Laëtitia

Posted by Anonymous at May 20 2025

Dear Laëtitia,

I use something very much like this coupling system. The freezing step seems to take place where I would expect the coupling restart file to be read. What is your coupling restart file like for NEMO? In some cases, I have seen failures at this step where a component model receives a value from the exchange and instantly crashes before the coupler realises what has happened.

Joakim's advice is good as well.

You also may be missing some important clues in the standard out/standard error files.

 
 

Nick

Posted by Anonymous at May 20 2025

Could it be that issue is, that both models are first trying to receive (RECV). They might never get to the part in the code where any of them starts sending. I could think of two options:

a) Provide oasis restart files and a lag, if you can generate them externally.
b) Swap around the oasis_put and oasis_get in the source code of at least one of the models.

There may be a third option to resolve this through the namcouple with sequencing. See oasis3mct5 user guide part 2.5.4. I have to say I never used these, though, so I'm not sure on that one.

Best, Jan

Posted by Anonymous at May 20 2025

I realise I did not ask this, but are the models running lagged? 

For example, in EC-Earth (OpenIFS atm + NEMO ocean) we use something like

# Momentum fluxes for oce and ice on U grid
 A_TauX_oce:A_TauY_oce:A_TauX_ice:A_TauY_ice O_OTaux1:O_OTauy1:O_ITaux1:O_ITauy1 1 2700 2 rstas.nc EXPORTED
 TL255-ocean eORCA1-U-closedseas LAG=2700
 P 0 P 2
 LOCTRANS SCRIPR
  AVERAGE
  BILINEAR D SCALAR LATITUDE 1

# SST; sea-ice temperature, albedo, fraction, thickness; snow thickness over ice
 O_SSTSST:O_TepIce:O_AlbIce:OIceFrc:OIceTck:OSnwTck A_SST:A_Ice_temp:A_Ice_albedo:A_Ice_frac:A_Ice_thickness:A_Snow_thickness 1 2700 2 rstos.nc EXPORTED
 eORCA1-T-closedseas TL255-ocean LAG=2700
 P 2 P 0
 LOCTRANS SCRIPR
  AVERAGE
  BILINEAR LR SCALAR LATITUDE 1

Both NEMO and OpenIFS have a time step of 2700s and they couple every 2700s as well. The LAG of 2700 means that when NEMO calls OASIS_GET on time step 2 it will receive what OpenIFS called to OASIS_PUT on time step 1. That way both components could call OASIS_GET each time step without a deadlock. 
But it will not work if you use a LAG of 0. 
The OASIS documentation explains this really well. 

/Joakim

Posted by Anonymous at May 21 2025

Hi,
If you run mpirun -np 4 wrf.exe : -np 2 ./nemo
It will show the same log.
your SEND and RECV are not correct(SEND has not done anything yet).
can you post your complete namcouple file. Thank you.
Best,

Posted by Anonymous at May 21 2025

Dear Laëtitia,
Sorry to react so late but I see that Jan and Joakim have been providing useful help. I agree with the analysis that both nemo and wrf are waiting on their first get and therefore they are blocked. In the timestep of nemo and wrk, the get of input coupling fields are probably done before the put of outgoing fields.
Therefore you have to provide restart coupling fields and you have to define a lag for these fields in the namcouple, and the lag has to be equal to the length of the timestep of the source model. See section 2.5.3 of the User Guide.
  I hope this helps?
  With best regards,
 Sophie

Posted by Anonymous at May 21 2025

Thanks evryone for your very helpfull answers. I realize now that I haven't done the things properly, I set lag to zero... I'll change that.

Reply to this

The OASIS Coupler Forum

WRF+NEMO+OASIS freeze