The OASIS Coupler Forum

  HOME

get_memusage: pid 1564339 is too large

Up to Bugs and debugs

Posted by Anonymous at June 26 2024

Good morning

I've started using FOCI-OpenIFS 3.0 (OpenIFS 43r3 + NEMO 3.6 + XIOS 2.5 + OASIS3-MCT5) on some brand-new computer resources we have at GWDG in Germany. 
The model was running fine on the older nodes running CentOS 7, Intel compilers v2021 and 48-core Cascade Lake 9242 chips. 
However, on the new compute nodes (Rocky OS 8, Intel compilers v2023, 48-core Sapphire Rapids 8468) I'm getting a strange error. 

700: get_memusage: pid 1564339 is too large

This comes from each of the 732 cores I'm using, and it only appears if NLOGPRT is not 0, i.e. if I choose some debug info. 
I think OASIS tries to track memory usage but can't handle the large process ID numbers. The support team at the computing centre said "On Red Hat Enterprise Linux 8.2 and later the pid_max value is 4194304, which is quite a bit higher than on older versions." and hinted that OASIS might need to be updated to handle it. 

So far I don't think this causes any problems in running the model, but perhaps the debug info won't be accurate? 
In any case, it would be great if someone from the team could test OASIS on some very new machine to see if this is just a problem at GWDG in Germany or a general issue with new OS etc. 
I'm also wondering if there is a way to solve the issue by e.g. increasing the precision of an integer somewhere? If "pid" is a short integer (=2^16) it won't fit on the GWDG system since their max is 2^22. 

Best wishes
Joakim Kjellsson 

PS. Details on the new (standard96s) vs old nodes (standard96) are here: https://docs.hpc.gwdg.de/compute_partitions/cpu_partitions/index.html

Posted by Anonymous at June 26 2024

Dear Joakim,

I get these messages frequently on my systems as well when using OASIS for a somewhat different modelling system (uses NEMO 4.2 but otherwise very different).

I haven't noticed any adverse consequences, but I am curious what the cause is.

I can definitely confirm this happens on a system that uses Debian GNU/Linux 11 (bullseye) if that helps.

Best regards,

Nicholas Heavens
Innovation Project Manager, Climate Modelling
Viridien

Posted by Anonymous at June 27 2024

Dear Joakim, dear Nicholas,

I'm running the coupled of model of ICON-CLM + NEMO 3.6 + HD + OASIS3-MCT 4.0 on Levante computing system at DKRZ, Hamburg. I saw the similar error about two years ago when we moved from the Mistral computing system to Levante. A colleague at DKRZ found out that the error comes from a fix number of 999999 in the subroutine lib/psmile/src/GPTLget_memusage.c:

  pid = (int) getpid ();
  if (pid > 999999) {
    fprintf (stderr, "get_memusage: pid %d is too large\n", pid);
    return -1;
  }

Since the number was changed to 4194304, the error doesn't appear again.
Perhaps, the OASIS team may find a good solution for it in the new version.

Best regards,
Ha Hagemann

Posted by Anonymous at June 27 2024

Dear Ha,

Of course. What an elegant solution to the mystery! The general approach for Linux-based systems would be to somehow configure OASIS to use the value in: /proc/sys/kernel/pid_max, which is indeed 4194304 for my system.

Thanks,

Nick

Posted by Anonymous at June 27 2024

check the current value:
cat /proc/sys/kernel/pid_max
or
sysctl kernel.pid_max

my 32768 :)

best
subhadeep

Posted by Anonymous at July 3 2024

Hi, I've tracked back the origin of this piece of code to the CESM, where it comes from.  
I think this test on the value of pid after it has been cast from pid_t to int has no real meaning.  
Notice that in the __APPLE__ counterpart (line 113 of GPTLget_memusage.c) the check has been dropped.  
pid is the output of getpid, therefore it will certainly be smaller than the limit set in /proc/sys/kernel/pid_max. 
Whether it really fits in a signed int or not is another question.
On current systems we are really far from hitting the upper limit of 2,147,483,647
  
Should the future systems allow up to 8 bytes pids, the casting should move to long int or unsigned long 
I think we are safe for the moment (and for quite a longtime).  

My final suggestion: just remove lines 141-144 from GPTLget_memusage.c and live happily

Andrea
Reply to this