Up to Specific issues in real coupled models
I use xios trunk (Revision: 2320) with NEMO trunk (Revision: 15770) in coupled mode. I also use the last official version of oasis3-mct_4.0 The coupled model freezes after printing (oasis_unitget) 9999 (oasis_unitget) 9999 (oasis_unitget) 9999 starting wrf task 0 of 1 ---> prism_initxios.x 0 I tried to change call_oasis_enddef to false in iodef.xml. But it resulted in segmentation fault (oasis_unitget) 9999 (oasis_unitget) 9999 (oasis_unitget) 9999 starting wrf task 0 of 1 ---> prism_initxios.x 0 -> info : CServer : Register new Context : nemo_server -> info : Register new Context : nemo forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source xios_server.exe 0000000000C9A82A for__signal_handl Unknown Unknown libpthread-2.17.s 00002B36B2DFA630 Unknown Unknown Unknown libmpi.so.12.0.0 00002B36B19D1071 Unknown Unknown Unknown libmpi.so.12.0.0 00002B36B19D2C47 PMPI_Iprobe Unknown Unknown xios_server.exe 000000000064FD3C Unknown Unknown Unknown xios_server.exe 0000000000650690 Unknown Unknown Unknown xios_server.exe 0000000000904D26 Unknown Unknown Unknown xios_server.exe 0000000000653185 Unknown Unknown Unknown xios_server.exe 000000000044EBB9 Unknown Unknown Unknown xios_server.exe 0000000000CF7D96 Unknown Unknown Unknown libc-2.17.so 00002B36B352F555 __libc_start_main Unknown Unknown xios_server.exe 000000000044EACF Unknown Unknown Unknown What could be the possible reason for the model hanging? I would appreciate any suggestions.
Hi everyone, Sorry for not providing help, but in fact we are facing quite similar problem here of freezing models. We use NEMO new 4.2 version (revision 15557) with XIOS revision 2297, and oasis-mct_4.0 on Météo-France HPC. The results of our tests NEMO-XIOS/AROME or NEMO-XIOS/toymodel are the following: - when XIOS is used as a server (using_server = true) : the two models freeze somewhere after entering oasis_enddef; XIOS is somewhere after the exit of oasis_get_intercomm. - when XIOS is attached (using_server = false): the two coupled models start running but NEMO/XIOS exit with the following error at nitend .... .... In file "iccontext.cpp", function "void cxios_context_handle_create(xios::CContext **, const char *, int)", line 54 -> Contextunknown terminate called after throwing an instance of 'xios::CException' forrtl: error (76): Abort trap signal Image PC Routine Line Source oceanx 0000000001CCA29E for__signal_handl Unknown Unknown libpthread-2.17.s 00002B6B811F9630 Unknown Unknown Unknown libc-2.17.so 00002B6B8173E387 gsignal Unknown Unknown libc-2.17.so 00002B6B8173FA78 abort Unknown Unknown libstdc++.so.6 00002B6B7E8C2A58 Unknown Unknown Unknown libstdc++.so.6 00002B6B7E8CF646 Unknown Unknown Unknown libstdc++.so.6 00002B6B7E8CF691 Unknown Unknown Unknown libstdc++.so.6 00002B6B7E8CF8C4 Unknown Unknown Unknown oceanx 00000000014956D0 Unknown Unknown Unknown oceanx 0000000000A3049C Unknown Unknown Unknown oceanx 00000000004C9E6C Unknown Unknown Unknown oceanx 000000000044EB47 Unknown Unknown Unknown oceanx 000000000044EA40 Unknown Unknown Unknown oceanx 000000000044EA0E Unknown Unknown Unknown libc-2.17.so 00002B6B8172A555 __libc_start_main Unknown Unknown oceanx 000000000044E929 Unknown Unknown Unknown .... .... The toymodel ends fine. When using NEMO compiled without key_xios, the coupling works fine... So, we would also appreciate some inputs, if any... Sincerely, Cindy Lebeaupin Brossier
Hi, It is hard for us to tell why your model is hanging. However, we recently wrote a note on the Joint Usage of OASIS3-MCT and XIOS in climate models https://oasis.cerfacs.fr/wp-content/uploads/sites/114/2022/02/Joint_usage_OASIS3-MCT_XIOS_2022.pdf Maybe you can start by reading this note and making sure you do everything right? With best regards, Sophie
HI again, I have asked Yann Meurdesoif, who is XIOS developer (as this indeed looks more like an XIOS problem than an OASIS problem). Which MPI version are you using? Yann says that your problem looks like the MPI Intel bug onMPI_Iprobe. The workaround is to use the "release_mt" library source $I_MPI_ROOT/intel64/bin/mpivars.sh release_mt In recent versions, the standard MPI library has been merged with the release_mt library but the bug fix from release_mt was not included. Tickets have been open at Intel. Let me know if this helps ... Regards, Sophie
Thank you for the suggestion. My MPI version is Intel(R) MPI Library for Linux* OS, Version 2021.4 Build 20210831 (id: 758087adf) Copyright 2003-2021, Intel Corporation. I switched to release_mt library but it didn't help. But if NEMO is compiled without XIOS, then the coupled model runs successfully. So, it seems that in my case XIOS causes NOW model to freeze.