This document is part of an AVBP training course, where students must monitor several versions of the same run. Most of these versions have crashed, the others have flaws. The objective is to find the origin of the problem in each run.
To help AVBP users in this predicament, we gathered here what experienced users usually do. This is, therefore, an heuristic collection of good practices centered on the AVBP usage.
When the run has ended, you must start before anything else with avbp log file. It should look something like
Number of MPI processes : 20 __ ______ _____ __ ________ ___ /\ \ / / _ \| __ \ \ \ / /____ / _ \ / \ \ / /| |_) | |__) | \ \ / / / / | | | / /\ \ \/ / | _ <| ___/ \ \/ / / /| | | | / ____ \ / | |_) | | \ / / / | |_| | /_/ \_\/ |____/|_| \/ /_/ (_)___/ Using branch : D7_0 Version date : Mon Jan 5 22:03:30 2015 +0100 Last commit : 9ae0dd172d8145496e8d62f8bd34f25ae2595956 Computation #1/1 AVBP version : 7.0 beta (...) ----> Building dual graph >> generation took 0.118s ----> Decomposition library: pmetis >> Partitioning took 0.161s ___________________________________________________________________________________ | Boundary patches (no reordering) | |_________________________________________________________________________________| | Patch number Patch name Boundary condition | | ------------ ---------- ------------------ | | 1 CanyonBottom INLET_FILM | (...) | 15 PerioRight PERIODIC_AXI | |_________________________________________________________________________________| _______________________________________________________________ | Info on initial grid | |_____________________________________________________________| | number of dimensions : 3 | | number of nodes : 34684 | | number of cells : 177563 | | number of cell per group : 100 | | number of boundary nodes : 9436 | | number of periodic nodes : 2592 | | number of axi-periodic nodes : 0 | | Type of axi-periodicity : 3D | |_____________________________________________________________| | After partitioning | |_____________________________________________________________| | number of nodes : 40538 | | extra nodes due to partitioning : 5854 [+ 16.88‰] | |_____________________________________________________________| (...) ----> Starts the temporal loop. Iteration # Time-step [s] Total time [s] Iter/sec [s-1] 1 0.515455426286E-06 0.515455426286E-06 0.357797003731E+01 ---> Storing isosurface: isoT 50 0.517183938876E-06 0.258723148413E-04 0.184026441956E+02 100 0.510691871555E-06 0.515103318496E-04 0.241920225176E+02 150 0.517872233906E-06 0.772251848978E-04 0.239538511163E+02 200 0.523273983650E-06 0.103271506216E-03 0.241928988318E+02 (...) 27296 0.547339278850E-06 0.150002454353E-01 0.241129046630E+02 ----> Solution stored in file : ./SOLUT/solut_00000007_end.h5 ---> Storing cut: sliceX ---> Storing cut: cylinder ---> Storing isosurface: isoT (...) ----> End computation. ________________________________________________________________________________________________________ _____________________________________________________________________________________________ | 20 MPI tasks Elapsed real time [s] [s.cores] [h.cores] | |___________________________________________________________________________________________| | AVBP : 1137.27 0.2275E+05 0.6318E+01 | | Temporal loop : 1134.31 0.2269E+05 0.6302E+01 | | Per iteration : 0.0416 0.8311E+00 | | Per iteration and node : 0.1198E-05 0.2396E-04 | | Per iteration and cell : 0.2340E-06 0.4681E-05 | |___________________________________________________________________________________________| ----> End of AVBP session ***** Maximum memory used : 12922716 B ( 0.123241E+02 MB)
The log file is your absolute compass. Let’s break it down:
The header is providing the exact version of the code. If you ask for help or support, please include this header in your communications.
Using branch : D7_0 Version date : Mon Jan 5 22:03:30 2015 +0100 Last commit : 9ae0dd172d8145496e8d62f8bd34f25ae2595956 Computation #1/1 AVBP version : 7.0 beta
The pre-processing is before the line:
----> Starts the temporal loop.
It is all the actions AVBP takes before actually solving the NS equations. It shows what it understood from your setup. Reading it with attention often give clues.
For example, as the mesh is read, this is where you may detect that your mesh was in millimeters and not in meters.
Some actions maybe be long, or even ridiculously long (computing wall distances, partitioning).
You will see that one of these steps is taking 5% of your total queuing job, maybe you can do better. AVBP often offer to read the pre-processing data of a previous run, instead of re-generating it for each step.
The temporal loop is between these lines
----> Starts the temporal loop. (...) ----> End computation.
Pay attention to the time-step. You may discover there that is it perfectly constant (fixed times-step, not always optimal and potentially unstable) or dramatically dropping.
Pay also attention to the amount of post processing you generate. The frequency of outputs should be adapted both to your characteristic time and your simulation time.
- The characteristic time t_char is configuration dependent. It can be a distance divided by velocity, or a volume divide by a mass flow and density.
- The simulation time t_sim is the physical time you can simulate on one run.
Once you know these two quantities, check that you have enough, but no too much of your outputs:
- temporals output: each iteration for the first 1000 its, then maybe 10 000-100 000 in your t_sim
- instantaneous: more than 100 in one t_char is probably a waste.
- iso-surfaces: more than 100 in one t_char is probably a waste.
- cuts: more than 10O0 in one t_char is probably a waste.
- averages: no point to have more than one average per t_char with an averaging frequency of 20/t_char.
After the temporal loop, some HPC monitoring is available, do not skip it! You should try to identify what is a fair performance for your own case. Compare with your comrades, and ask your supervisor.
Remember that AVBP need at least 10 000 cells per process. (Asking 16 process on a 2D 10x100 cartesian gris is a no-no.)
Your computation is ending with an error. Take you time reading this error. Here a some of the most current error messages:
This is the most common error found in the temporal loop. It occurs when the fluid is going out of the nominal range of operation (Temperature > 5000K, or Temperature < 0K).
1294803 0.111838215685E-08 0.137970567894E-02 0.217249658544E+01 1294813 0.111838150869E-08 0.137971686275E-02 0.217847536081E+01 1294823 0.111838077592E-08 0.137972804657E-02 0.217845113224E+01 >>>>> ERROR detected in subroutine temperature >>>>> Temperature problem at local node 7155, x = 0.57287906E-01, y = 0.28629386E-01, z = -0.17347235E-15 >>>>> ERROR detected in subroutine temperature >>>>> Temperature problem at local node 7602, x = 0.56900699E-01, y = 0.29165336E-01, z = 0.97071109E-02 >>>>> ERROR detected in subroutine temperature >>>>> Temperature problem at local node 4912, x = 0.56900699E-01, y = 0.29165336E-01, z = 0.97071109E-02
Use the locations of the error given (
x = 0.56900699E-01, y = 0.29165336E-01, z = 0.97071109E-02), to pinpoint the position of you problem in the 3D geometry. You can use an iso-surface of temperature or pressure on the instantaneous
solut_crash.sol.h5 file to help in your search.
When the error is unexpected, AVBP gives you a “backtrace”, which looks like this:
AVBP_V75RGbeta.TA 0000000000E1B4ED Unknown Unknown Unknown libpthread-2.17.s 00007FB79509C630 Unknown Unknown Unknown AVBP_V75RGbeta.TA 0000000000783636 mod_collective_in 157 mod_collective_inputs.f90 AVBP_V75RGbeta.TA 0000000000789BC7 mod_collective_in 1374 mod_collective_inputs.f90 AVBP_V75RGbeta.TA 00000000004D9179 mod_pmesh_load_so 623 mod_pmesh_load_solut.f90 AVBP_V75RGbeta.TA 00000000004D680A mod_pmesh_load_so 700 mod_pmesh_load_solut.f90 AVBP_V75RGbeta.TA 000000000044F6CC slave_pre_tempora 158 slave_pre_temporal.f90 AVBP_V75RGbeta.TA 0000000000412163 slave_ 68 slave.f90 AVBP_V75RGbeta.TA 0000000000411A1B avbp_ 162 avbp.f90 AVBP_V75RGbeta.TA 000000000041156D MAIN__ 27 avbp_main.f90 AVBP_V75RGbeta.TA 000000000041151E Unknown Unknown Unknown libc-2.17.so 00007FB794CE1555 __libc_start_main Unknown Unknown AVBP_V75RGbeta.TA 0000000000411429 Unknown Unknown Unknown mod_pmesh_load_solut.f90:623 => CALL grid%inputs%read_vector_set ( hdf_groupname(17),hdf_nr_nscbc_setname(1),1,nr_nscbc_value(1) )
Like many error messages, you should start from the end, where the error occur, and move up in the message to get the context of the previous information.
Here the error occurred in a file named
mod_pmesh_load_solut.f90 at the line 623. The line itself is even provided:
CALL grid%inputs%read_vector_set ( hdf_groupname(17),hdf_nr_nscbc_setname(1),1,nr_nscbc_value(1) )
Or for the humans, “while loading a solution, reading a vector related to NSCBC values failed”.
The SEGFAULT means the software tried to access a memory that does not fit to what was expected. This is usually sue to a wrong dimension in an array.
A word on the
WRITE(*,*) black magic. A WRITE statement is also asking a bit of memory. Since the memory shape changed, adding a WRITE statement can “mute” the SEGFAULT but the problem is not fixed : the computer is feeding your algorithm with the data it found at the memory address you gave, probably garbage.
A real-life SEGFAULT look like:
************************************************** *********************Species********************** ************************************************** ERROR Problem while running command :/scratch/cfd/rossi/AVBP/dev/HOST/KRAKEN/BIN/check_perio.e_KRAKEN ../RUN_CURRENT =============StdErr================= forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source check_perio.e_KRA 00000000008CADAD Unknown Unknown Unknown libpthread-2.17.s 00002B562BFB55D0 Unknown Unknown Unknown check_perio.e_KRA 0000000000412F16 MAIN__ 355 check_perio.f90 check_perio.e_KRA 000000000040F7DE Unknown Unknown Unknown libc-2.17.so 00002B562C1E43D5 __libc_start_main Unknown Unknown check_perio.e_KRA 000000000040F6E9 Unknown Unknown Unknown
Again, do not panic, but take a closer look, the very line of your problem is given.
In the source we can read:
check_perio.f90, l355: diff = ABS( grid%w_spec(k,n1) - grid%w_spec(k,n2) )
This narrows your research of dimension mismatch down to two origins:
- The number of species (index k)
- The number of nodes (indexes n1 and n2)
The temporal tool
xm (documentation) gives you the temporal evolution of global quantities. See first is the Kinetic energy is reaching or has reached a quasi steady state for steady computations.
Take care of the extrema in pressure and temperature, which can be quite different from the space average.
The temporal tool
x_mc_exact (documentation) gives you the temporal evolution on the boundaries. You can check there that your inflow is reaching the operating point you want. If not, three options:
- The target is wrong
- The relaxation coefficient is not strong enough to impose the target. In other words, a stronger phenomenon prevent the BCs to converge to its operating point.
- A numerical correction is activated and is no more negligible.
The temporal tool
xt (documentation) gives insights on the time step evolution. Be careful on the interpretations, because the tool giving information at the cell level, while the actual time-step is known later. Dramatic drop of the time-step are especially visible on the
This simple tool
xu (documentation) simply shows the higher and lowest velocity on a patch. Outlets are negative and Inlets are positive/
Make sure that no boundary is involving both a positive AND an negative velocity. You would probably inject garbage into your simulation…
Visualizing your flow is useful even if the run seems to behave well. Remember however that your sight will me limited by your saving frequency. A saving frequency too high, and you will see a slow-motion evolution of your actual flow and pay attention to details that do not matter. A saving frequency too low, and you may miss some big motions in the intervals of your snapshots. Here again, you cannot visualize the flow without keeping a clear picture of the characteristic times involved.
A common mistake is looking at cutting planes on important position (constant radius, middle plane, etc…). A lot occur outside of these planes. Start by looking what is happening on the external frontier, and see if every part of it make sense to you.
Alternate between Pressure and Velocity fields to create a correct mental image of the motion: the right hand side of Navier-Stokes involves both Pressure gradient and Viscous effect.
additionals fields of your solutions to get more accurate information. For example:
zeta_pis the artificial viscosity sensor. High levels means that your LES is having a hard time at these positions.
dt_visualis the local time step. Use this one to find where is the place that limit your resolution speed: a combination of cell size, flow velocity and sound speed.
Make sure that your
save_solution.additional allows the storage of these variables.
Moreover, a large collection of field-specific
packages is also available on averaged solutions, triggered by the
save_average.packages. Be sure to include the relevant packages for your situation.
- Try to reproduce your problem on a smaller mesh. Short time-to-solution is your timesaver. Your CAD is something unusual? Remember that HIP can help you to drastically reduce your mesh in one line (constant coarsening with HIP).
- If you feel lucky, de-activate one-by-one the models and peculiarities of your setup.
- If you want to play safe, Re-start from a simple case that works, ad re-enable, one-by-one, the models and specificities of your configuration. Yes a gaseous run with only walls and no-reaction is a good starting point.
- Check your executables. Make sure you are using the version you expect in its nominal state. In particular, if you compiled the code yourself, rebuilding the executable
>make clean ; makewill never hurt.