Up to Specific issues in real coupled models
Hello, I am trying to couple nemo and WRF through oasis, but the run either freezes, either fails. I am not at all specialist of nemo so maybe I did something wrong there. If I run mpirun -np 2 ./nemo : -np 1 wrf.exe, it fails at the beginning with the error: (oasis_abort) ABORT: file = /home/wxop1/oasis3-mct/lib/psmile/src/mod_oasis_method.F90 (oasis_abort) ABORT: line = 335 (oasis_abort) ABORT: on model = oceanx The end of nout file is: (oasis_init_comp) compnm gather 1 oceanx T (oasis_init_comp) compnm gather 2 oceanx T (oasis_init_comp) compnm gather 3 wrfexe T (oasis_init_comp) COUPLED models 1 oceanx T If I try the run in the other order: mpirun -np 1 wrf.exe : -np 2 ./nemo , it passes this step and all initialization, I have a first wrfout printed but then it freezes. Could run forever not doing anything. The end of oucean.ouput is: Namelist namtrd : set trends parameters global domain averaged dyn & tra trends ln_glo_trd = F U & V trends: 3D output ln_dyn_trd = F U & V trends: Mixed Layer averaged ln_dyn_mxl = F T & S trends: 3D output ln_tra_trd = F The end of ocean debug files are; (oasis_advance_run) DEBUG sequence O_OTaux1 0 0 1 1 -------- NOTE (oasis_advance_run) compute field index and sizes (oasis_advance_run) DEBUG nfav,nsav,nsa = 1 13320 13320 -------- NOTE (oasis_advance_run) comm_now compute (oasis_advance_run) at 0 0 STAT: 1 READY -------- NOTE (oasis_advance_run) comm_now -------- NOTE (oasis_advance_run) get section (oasis_advance_run) at 0 0 RECV: O_OTaux1 -------- NOTE (oasis_advance_run) get recv (oasis_lb_measure) event index, coupler, kind, timestep_id 5 5 2 0 Thanks in advance for any insights. Best
It's hard to know what the problem is here. The end of the nout file should be something like: (oasis_init_comp) COUPLED models 1 oceanx T (oasis_init_comp) COUPLED models 2 wrfexe T i.e. you should see both your coupled models. The fact that you only see 1 indicates to me that WRF did not do "oasis_init_comp" correctly. That would explain the freeze. Did you get debug.root.02 from WRF? First, it's weird that the order of executables to mpirun matters. If your system uses SLURM (most do) then you could try "srun" instead of "mpirun". The benefit of using "srun" over "mpirun" is that "srun" comes from the system scheduler SLURM so it will know all about how to run programs on your system. "mpirun" usually requires you to specify carefully how tasks should be distributed across nodes while "srun" can just figure it out automatically. But this of course relies on your system having SLURM. Could be that WRF is not started correctly. Second, I would recommend to activate all possible debugs. I don't know WRF, but NEMO has a namelist option "sn_cfctl%l_oasout = .FALSE." which you can set to "TRUE". This will make NEMO print in ocean.output each OASIS call so you know what NEMO is doing. If such a command does not exist in WRF you could try to add something like "WRITE(*,*) "Calling OASIS_GET" " or similar before OASIS calls. Also run with NLOGPRT 5 0 0 to get detailed info. /Joakim
Hi, Thank you for your answer. Unfortunaly, slurm is not available where I run the computations. By the way we use PGI(nvdia) compiler. Sometimes it is causing issues I don't know if it could be the case here. Here the end of the various debug files (I put the options as ypu suggested): debug.02.000000 (nemo 2) (oasis_enddef) done prism_advance_init (oasis_mem_print) memory use (MB) = 799.0152 282.6136 (oasis_enddef):advance_init (oasis_mem_print) memory use (MB) = 799.0152 282.6136 (oasis_enddef):end oasis_get_freqs Coupling frequency of this field O_QsrOce for coupling 1 is 10800 (oasis_advance_run) at 0 0 RECV: O_OTaux1 debug.02.000001 (nemo 2) (oasis_enddef) done prism_advance_init (oasis_mem_print) memory use (MB) = 876.5542 312.3792 (oasis_enddef):advance_init (oasis_mem_print) memory use (MB) = 876.5542 312.3792 (oasis_enddef):end oasis_get_freqs Coupling frequency of this field O_QsrOce for coupling 1 is 10800 (oasis_advance_run) at 0 0 RECV: O_OTaux1 debug.01.00000 (WRF) ---------------------------------------------------------------- (oasis_enddef) done prism_advance_init (oasis_mem_print) memory use (MB) = 3372.8125 1956.4083 (oasis_enddef):advance_init (oasis_mem_print) memory use (MB) = 3372.8125 1956.4083 (oasis_enddef):end (oasis_advance_run) at 0 0 RECV: WRF_d01_EXT_d01_SST This is when the freezing occurs. In the other case I only have the nout output. Best. Laëtitia
Dear Laëtitia, I use something very much like this coupling system. The freezing step seems to take place where I would expect the coupling restart file to be read. What is your coupling restart file like for NEMO? In some cases, I have seen failures at this step where a component model receives a value from the exchange and instantly crashes before the coupler realises what has happened. Joakim's advice is good as well. You also may be missing some important clues in the standard out/standard error files. Nick
Could it be that issue is, that both models are first trying to receive (RECV). They might never get to the part in the code where any of them starts sending. I could think of two options: a) Provide oasis restart files and a lag, if you can generate them externally. b) Swap around the oasis_put and oasis_get in the source code of at least one of the models. There may be a third option to resolve this through the namcouple with sequencing. See oasis3mct5 user guide part 2.5.4. I have to say I never used these, though, so I'm not sure on that one. Best, Jan
I realise I did not ask this, but are the models running lagged? For example, in EC-Earth (OpenIFS atm + NEMO ocean) we use something like # Momentum fluxes for oce and ice on U grid A_TauX_oce:A_TauY_oce:A_TauX_ice:A_TauY_ice O_OTaux1:O_OTauy1:O_ITaux1:O_ITauy1 1 2700 2 rstas.nc EXPORTED TL255-ocean eORCA1-U-closedseas LAG=2700 P 0 P 2 LOCTRANS SCRIPR AVERAGE BILINEAR D SCALAR LATITUDE 1 # SST; sea-ice temperature, albedo, fraction, thickness; snow thickness over ice O_SSTSST:O_TepIce:O_AlbIce:OIceFrc:OIceTck:OSnwTck A_SST:A_Ice_temp:A_Ice_albedo:A_Ice_frac:A_Ice_thickness:A_Snow_thickness 1 2700 2 rstos.nc EXPORTED eORCA1-T-closedseas TL255-ocean LAG=2700 P 2 P 0 LOCTRANS SCRIPR AVERAGE BILINEAR LR SCALAR LATITUDE 1 Both NEMO and OpenIFS have a time step of 2700s and they couple every 2700s as well. The LAG of 2700 means that when NEMO calls OASIS_GET on time step 2 it will receive what OpenIFS called to OASIS_PUT on time step 1. That way both components could call OASIS_GET each time step without a deadlock. But it will not work if you use a LAG of 0. The OASIS documentation explains this really well. /Joakim
Hi, If you run mpirun -np 4 wrf.exe : -np 2 ./nemo It will show the same log. your SEND and RECV are not correct(SEND has not done anything yet). can you post your complete namcouple file. Thank you. Best,
Dear Laëtitia, Sorry to react so late but I see that Jan and Joakim have been providing useful help. I agree with the analysis that both nemo and wrf are waiting on their first get and therefore they are blocked. In the timestep of nemo and wrk, the get of input coupling fields are probably done before the put of outgoing fields. Therefore you have to provide restart coupling fields and you have to define a lag for these fields in the namcouple, and the lag has to be equal to the length of the timestep of the source model. See section 2.5.3 of the User Guide. I hope this helps? With best regards, Sophie
Thanks evryone for your very helpfull answers. I realize now that I haven't done the things properly, I set lag to zero... I'll change that.