About this document

This document is part of an AVBP training course, where students must monitor several versions of the same run. Most of these versions have crashed, the others have flaws. The objective is to find the origin of the problem in each run.

To help AVBP users in this predicament, we gathered here what experienced users usually do. This is, therefore, an heuristic collection of good practices centered on the AVBP usage.

The log file

When the run has ended, you must start before anything else with avbp log file. It should look something like

Number of MPI processes :           20

     __      ______  _____   __      ________ ___
    /\ \    / /  _ \|  __ \  \ \    / /____  / _ \
   /  \ \  / /| |_) | |__) |  \ \  / /    / / | | |
  / /\ \ \/ / |  _ <|  ___/    \ \/ /    / /| | | |
 / ____ \  /  | |_) | |         \  /    / / | |_| |
/_/    \_\/   |____/|_|          \/    /_/ (_)___/

Using branch  : D7_0
Version date  : Mon Jan 5 22:03:30 2015 +0100
Last commit   : 9ae0dd172d8145496e8d62f8bd34f25ae2595956

Computation #1/1

AVBP version : 7.0 beta


 ----> Building dual graph
    >> generation took 0.118s

 ----> Decomposition library: pmetis
    >> Partitioning took  0.161s

       | Boundary patches (no reordering)                                                |
       | Patch number   Patch name                         Boundary condition            |
       | ------------   ----------                         ------------------            |
       | 1              CanyonBottom                       INLET_FILM                    |
       | 15             PerioRight                         PERIODIC_AXI                  |

       | Info on initial grid                                        |
       | number of dimensions              : 3                       |
       | number of nodes                   : 34684                   |
       | number of cells                   : 177563                  |
       | number of cell per group          : 100                     |
       | number of boundary nodes          : 9436                    |
       | number of periodic nodes          : 2592                    |
       | number of axi-periodic nodes      : 0                       |
       | Type of axi-periodicity           : 3D                      |
       | After partitioning                                          |
       | number of nodes                   : 40538                   |
       | extra nodes due to partitioning   : 5854 [+  16.88‰]        |


 ----> Starts the temporal loop.

       Iteration #      Time-step [s]         Total time [s]        Iter/sec [s-1]

                1    0.515455426286E-06    0.515455426286E-06    0.357797003731E+01
             ---> Storing isosurface: isoT
               50    0.517183938876E-06    0.258723148413E-04    0.184026441956E+02
              100    0.510691871555E-06    0.515103318496E-04    0.241920225176E+02
              150    0.517872233906E-06    0.772251848978E-04    0.239538511163E+02
              200    0.523273983650E-06    0.103271506216E-03    0.241928988318E+02
            27296    0.547339278850E-06    0.150002454353E-01    0.241129046630E+02

 ----> Solution stored in file : ./SOLUT/solut_00000007_end.h5

             ---> Storing cut: sliceX
             ---> Storing cut: cylinder
             ---> Storing isosurface: isoT


 ----> End computation.


       | 20 MPI tasks             Elapsed real time [s]       [s.cores]      [h.cores]             |
       | AVBP                   :      1137.27               0.2275E+05     0.6318E+01             |
       | Temporal loop          :      1134.31               0.2269E+05     0.6302E+01             |
       | Per iteration          :       0.0416               0.8311E+00                            |
       | Per iteration and node :   0.1198E-05               0.2396E-04                            |
       | Per iteration and cell :   0.2340E-06               0.4681E-05                            |

 ----> End of AVBP session

 ***** Maximum memory used : 12922716 B ( 0.123241E+02 MB)

The log file is your absolute compass. Let’s break it down:

The header

The header is providing the exact version of the code. If you ask for help or support, please include this header in your communications.

Using branch  : D7_0
Version date  : Mon Jan 5 22:03:30 2015 +0100
Last commit   : 9ae0dd172d8145496e8d62f8bd34f25ae2595956

Computation #1/1

AVBP version : 7.0 beta
The pre-processing

The pre-processing is before the line:

----> Starts the temporal loop.

It is all the actions AVBP takes before actually solving the NS equations. It shows what it understood from your setup. Reading it with attention often give clues.

For example, as the mesh is read, this is where you may detect that your mesh was in millimeters and not in meters.

Some actions maybe be long, or even ridiculously long (computing wall distances, partitioning).

You will see that one of these steps is taking 5% of your total queuing job, maybe you can do better. AVBP often offer to read the pre-processing data of a previous run, instead of re-generating it for each step.

The temporal-loop

The temporal loop is between these lines

----> Starts the temporal loop.
----> End computation.

Pay attention to the time-step. You may discover there that is it perfectly constant (fixed times-step, not always optimal and potentially unstable) or dramatically dropping.

Pay also attention to the amount of post processing you generate. The frequency of outputs should be adapted both to your characteristic time and your simulation time.

  • The characteristic time t_char is configuration dependent. It can be a distance divided by velocity, or a volume divide by a mass flow and density.
  • The simulation time t_sim is the physical time you can simulate on one run.

Once you know these two quantities, check that you have enough, but no too much of your outputs:

  • temporals output: each iteration for the first 1000 its, then maybe 10 000-100 000 in your t_sim
  • instantaneous: more than 100 in one t_char is probably a waste.
  • iso-surfaces: more than 100 in one t_char is probably a waste.
  • cuts: more than 10O0 in one t_char is probably a waste.
  • averages: no point to have more than one average per t_char with an averaging frequency of 20/t_char.
The HPC monitoring

After the temporal loop, some HPC monitoring is available, do not skip it! You should try to identify what is a fair performance for your own case. Compare with your comrades, and ask your supervisor.

Remember that AVBP need at least 10 000 cells per process. (Asking 16 process on a 2D 10x100 cartesian gris is a no-no.)

Ooops: log is ending with an error

Your computation is ending with an error. Take you time reading this error. Here a some of the most current error messages:

The “Error in temperature.F”

This is the most common error found in the temporal loop. It occurs when the fluid is going out of the nominal range of operation (Temperature > 5000K, or Temperature < 0K).

          1294803    0.111838215685E-08    0.137970567894E-02    0.217249658544E+01
          1294813    0.111838150869E-08    0.137971686275E-02    0.217847536081E+01
          1294823    0.111838077592E-08    0.137972804657E-02    0.217845113224E+01

 >>>>> ERROR detected in subroutine temperature

 >>>>> Temperature problem at local node 7155, x =  0.57287906E-01, y =  0.28629386E-01, z = -0.17347235E-15

 >>>>> ERROR detected in subroutine temperature
 >>>>> Temperature problem at local node 7602, x =  0.56900699E-01, y =  0.29165336E-01, z =  0.97071109E-02
 >>>>> ERROR detected in subroutine temperature
 >>>>> Temperature problem at local node 4912, x =  0.56900699E-01, y =  0.29165336E-01, z =  0.97071109E-02

Use the locations of the error given (x = 0.56900699E-01, y = 0.29165336E-01, z = 0.97071109E-02), to pinpoint the position of you problem in the 3D geometry. You can use an iso-surface of temperature or pressure on the instantaneous solut_crash.sol.h5 file to help in your search.

Reading a backtrace

When the error is unexpected, AVBP gives you a “backtrace”, which looks like this:

AVBP_V75RGbeta.TA 0000000000E1B4ED Unknown              Unknown Unknown
libpthread-2.17.s 00007FB79509C630 Unknown              Unknown Unknown
AVBP_V75RGbeta.TA 0000000000783636 mod_collective_in        157 mod_collective_inputs.f90
AVBP_V75RGbeta.TA 0000000000789BC7 mod_collective_in       1374 mod_collective_inputs.f90
AVBP_V75RGbeta.TA 00000000004D9179 mod_pmesh_load_so        623 mod_pmesh_load_solut.f90
AVBP_V75RGbeta.TA 00000000004D680A mod_pmesh_load_so        700 mod_pmesh_load_solut.f90
AVBP_V75RGbeta.TA 000000000044F6CC slave_pre_tempora        158 slave_pre_temporal.f90
AVBP_V75RGbeta.TA 0000000000412163 slave_                    68 slave.f90
AVBP_V75RGbeta.TA 0000000000411A1B avbp_                    162 avbp.f90
AVBP_V75RGbeta.TA 000000000041156D MAIN__                    27 avbp_main.f90
AVBP_V75RGbeta.TA 000000000041151E Unknown              Unknown Unknown
libc-2.17.so      00007FB794CE1555 __libc_start_main    Unknown Unknown
AVBP_V75RGbeta.TA 0000000000411429 Unknown              Unknown Unknown

mod_pmesh_load_solut.f90:623 => CALL grid%inputs%read_vector_set ( hdf_groupname(17),hdf_nr_nscbc_setname(1),1,nr_nscbc_value(1) )

Like many error messages, you should start from the end, where the error occur, and move up in the message to get the context of the previous information. Here the error occurred in a file named mod_pmesh_load_solut.f90 at the line 623. The line itself is even provided:

CALL grid%inputs%read_vector_set ( hdf_groupname(17),hdf_nr_nscbc_setname(1),1,nr_nscbc_value(1) )

Or for the humans, “while loading a solution, reading a vector related to NSCBC values failed”.

Dealing with the dreaded SEGFAULT

The SEGFAULT means the software tried to access a memory that does not fit to what was expected. This is usually sue to a wrong dimension in an array.

A word on the WRITE(*,*) black magic. A WRITE statement is also asking a bit of memory. Since the memory shape changed, adding a WRITE statement can “mute” the SEGFAULT but the problem is not fixed : the computer is feeding your algorithm with the data it found at the memory address you gave, probably garbage.

A real-life SEGFAULT look like:

    ERROR  Problem while running command :/scratch/cfd/rossi/AVBP/dev/HOST/KRAKEN/BIN/check_perio.e_KRAKEN ../RUN_CURRENT
    forrtl: severe (174): SIGSEGV, segmentation fault occurred
    Image              PC                Routine            Line        Source
    check_perio.e_KRA  00000000008CADAD  Unknown               Unknown  Unknown
    libpthread-2.17.s  00002B562BFB55D0  Unknown               Unknown  Unknown
    check_perio.e_KRA  0000000000412F16  MAIN__                    355  check_perio.f90
    check_perio.e_KRA  000000000040F7DE  Unknown               Unknown  Unknown
    libc-2.17.so       00002B562C1E43D5  __libc_start_main     Unknown  Unknown
    check_perio.e_KRA  000000000040F6E9  Unknown               Unknown  Unknown

Again, do not panic, but take a closer look, the very line of your problem is given.

In the source we can read:

check_perio.f90, l355:
   diff = ABS( grid%w_spec(k,n1) - grid%w_spec(k,n2) )

This narrows your research of dimension mismatch down to two origins:

  • The number of species (index k)
  • The number of nodes (indexes n1 and n2)

Temporal monitoring

The global evolution of the run xm

The temporal tool xm (documentation) gives you the temporal evolution of global quantities. See first is the Kinetic energy is reaching or has reached a quasi steady state for steady computations.

Take care of the extrema in pressure and temperature, which can be quite different from the space average.

The operating point x_mc_exact

The temporal tool x_mc_exact (documentation) gives you the temporal evolution on the boundaries. You can check there that your inflow is reaching the operating point you want. If not, three options:

  • The target is wrong
  • The relaxation coefficient is not strong enough to impose the target. In other words, a stronger phenomenon prevent the BCs to converge to its operating point.
  • A numerical correction is activated and is no more negligible.
The time advancement xt

The temporal tool xt (documentation) gives insights on the time step evolution. Be careful on the interpretations, because the tool giving information at the cell level, while the actual time-step is known later. Dramatic drop of the time-step are especially visible on the xt.

The inlets/outlet coherence xu

This simple tool xu (documentation) simply shows the higher and lowest velocity on a patch. Outlets are negative and Inlets are positive/

Make sure that no boundary is involving both a positive AND an negative velocity. You would probably inject garbage into your simulation…

Solution monitoring

Visualizing your flow is useful even if the run seems to behave well. Remember however that your sight will me limited by your saving frequency. A saving frequency too high, and you will see a slow-motion evolution of your actual flow and pay attention to details that do not matter. A saving frequency too low, and you may miss some big motions in the intervals of your snapshots. Here again, you cannot visualize the flow without keeping a clear picture of the characteristic times involved.

A common mistake is looking at cutting planes on important position (constant radius, middle plane, etc…). A lot occur outside of these planes. Start by looking what is happening on the external frontier, and see if every part of it make sense to you.

Alternate between Pressure and Velocity fields to create a correct mental image of the motion: the right hand side of Navier-Stokes involves both Pressure gradient and Viscous effect.

Use the additionals fields of your solutions to get more accurate information. For example:

  • zeta_pis the artificial viscosity sensor. High levels means that your LES is having a hard time at these positions.
  • dt_visual is the local time step. Use this one to find where is the place that limit your resolution speed: a combination of cell size, flow velocity and sound speed.

Make sure that your run_params, keyword save_solution.additional allows the storage of these variables.

Moreover, a large collection of field-specific packages is also available on averaged solutions, triggered by the run_params keyword save_average.packages. Be sure to include the relevant packages for your situation.

General good practices when debugging

  • Try to reproduce your problem on a smaller mesh. Short time-to-solution is your timesaver. Your CAD is something unusual? Remember that HIP can help you to drastically reduce your mesh in one line (constant coarsening with HIP).
  • If you feel lucky, de-activate one-by-one the models and peculiarities of your setup.
  • If you want to play safe, Re-start from a simple case that works, ad re-enable, one-by-one, the models and specificities of your configuration. Yes a gaseous run with only walls and no-reaction is a good starting point.
  • Check your executables. Make sure you are using the version you expect in its nominal state. In particular, if you compiled the code yourself, rebuilding the executable >make clean ; make will never hurt.

Like this post? Share on: TwitterFacebookEmail

Antoine Dauptain is a research scientist focused on computer science and engineering topics for HPC.

Keep Reading





Stay in Touch