A Hitchhiker’s guide to pre- and post- processing using Python at Cerfacs

Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live. - John Woods.

If you work at Cerfacs, you will eventually have to bring your creations to the community. This article will give you several ideas to avoid frustrating moments.

Foreword

If you want to plot a result using Matplotlib on Kraken, be aware that you may experience some backend troubles. For more information about Matplotlib and backend, click here.

When trying to plot a figure with Matplotlib on Kraken, if you get an error, PyQt5 can solve your problem. Just run pip install PyQt5 in your virtual environment.

Keep it Single purpose

This aspect is the most important, because a multi-purpose code is incredibly harder to maintain. Let us see an example:

Bob’s code was making several types of averaging: 2-D fields averaged in time, 1-D averaged in space and time, 0-D both in space and time, 0-D in space but time dependent, etc… IFs allowed to skip some of the averaging steps.

Then Bob met a wall. As all was tightly coupled, a single change could silently break the whole pipeline. At the same time there were too many cases to memorize while reading. Refactoring was more expensive than rewriting from scratch.

Bob could not ship this code, which died, in the rain.

Some Cerfacs code were never shipped only because of their complexity.

Make the data flow obvious

If you need to develop several outputs at the same time, your code should show it with the the data flow. If we take Bob’s code example, instead of a big bundle of smart coding in the same scope, and several outputs:

def myawesomecode(...)
    ...
    return avg2D, avg1D, avg0D_alltimes, avg_0D

One should prefer:

avg2D = myawesomeavg_2D(...)
avg1D = myawesomeavg_1D(avg2D)
avg0D = myawesomeavg_0D(avg2D)
avg0D_alltimes = myawesomeavg_alltimes(avg0D)

Monitor your code with pylint and radon

A good Pylint score is a clue that the reading difficulty of the code is low. Read our article Is pylint adapted to scientific software crafting? for further reference.

If you feel your code is a bit too smart, you can measure the complexity easily with radon. Complexity dramatically increases when you cover many cases in the same code, and this could be bad news. In CFD, pay a particular attention to code focusing on the various boundary conditions, and/or the various models implemented.

“Reduce complexity. The single most important reason to create a routine is to reduce a program’s complexity. Create a routine to hide information so that you won’t need to think about it.”― Steve McConnell, Code Complete

Handle errors

Prefer throwing an exception than stopping yte process (sys.exit()). A higher level tool may use your code, leave them the freedom to choose if the error is critical or not.

Design informative and positive error messages. You can read this error message do’s and don’t for further advice.

Beware of sanity checks

In compiled language like FORTRAN, runtime error are usually cryptic. This is why programmers are adding sanity checks, to catch errors with an explicit message. This is called the “Look before you Leap” (LBYL) strategy. Unfortunately, these lines can bring new crash causes. An example is needed here:

Bob’s code is computing smart stuff from the output of a large software. It uses need two variables Tand P. He also uses a global sanity check to see if the input binary file match is test case nominal file U, V, W, T, P.

Enter Brenda, working on a 2-D case, where there is no third dimension -no W-. Here, the sanity check fails. Brenda say “there is a bug” to Bob.

Enter Brian. His case involves other phenomenons. The output file features U, V, W, T, P, Rho, T. Again, the sanity check fails. Brian say “there is a bug” to Bob.

Enter Bambi. She uses a wrong file. The sanity check say “Tand P are missing”. Bambi is in a hurry and ask “there is a bug, can you help me?” to Bob.`

It is easy to say “an error has occurred”, much more harder to say what the user should do. Meanwhile, The code is bound to get more complex to cover all possible cases where the check was never needed. Maintaining such a code is expensive in the long run. Sanity check are expensive. Keep only the very smart ones -if any-.

Easier to Ask forgiveness than permission

In high-level languages like Python, you can do differently, and this is called “Easier to Ask forgiveness than permission” (EAFP).

Indeed, there is a smart error manager giving quite good and informative errors messages for free. It is called the traceback, and ends usually with an Exception:

Traceback (most recent call last):
  File "acces_with_h5.py", line 39, in <module>
    sol['Average']['P'] *= 2
  File "/Users/dauptain/Python_envs/dev_opentea/lib/python3.8/site-packages/hdfdict-0.3.1-py3.8.egg/hdfdict/hdfdict.py", line 66, in __getitem__
    item = super().__getitem__(key)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py", line 1010, in __getitem__
    raise KeyError(key)
KeyError: 'P'

Once the exception is known,

  1. you say to the angry user “Ah, sorry”.
  2. You prettify the message (RuntimeError: Field Average/P not found in this file. Aborting):
    try:
        sol['Average']['P'] *= 2
    except KeyError as err:
        raise KeyError(Field Average/P not found in this file. Aborting")         

Read and write small data in ASCII

If you need a human readable/writable file for your inputs or output, use one of the standards serialization formats. By doing so:

  • you will not have to write, explain and maintain your own parser.
  • you can rely on a huge amount of helpers tools online.
  • standards formats come with impressive features out-of-reach for a single programmer : automatic type conversion, dynamic structures, automatic checking, and many more.

Let us introduce several solutions:

CSV

If you have a simple table of numbers, The CSV format is simple, Excel-friendly and widely supported. You can use the build-in python csv writer, the Numpy writer, or the panda writer, and their respective readers.

A small example:

Serial Number,Company Name,Employee Markme,Description,Leave
9788189999599,TALES OF SHIVA,Mark,mark,0
9780099578079,1Q84 THE COMPLETE TRILOGY,HARUKI MURAKAMI,Mark,0
9780198082897,MY KUMAN,Mark,Mark,0
9780007880331,THE GOD OF SMAAL THINGS,ARUNDHATI ROY,4TH HARPER COLLINS,2
9780545060455,THE BLACK CIRCLE,Mark,4TH HARPER COLLINS,0

This kind of data is particularly adapted to pandas. TO install pandas:

pip install pandas

To read, print, edit, and write

import pandas as pd

# read file
data = pd.read_csv('in.csv')  

# print data
print(data)

# edit data
data["Serial Number"] += 1

# dump data
data.to_csv("data_out.csv")
YAML

For more complex structures with various shapes, you can rely on YAML, very widely supported, (but a bit slow). The fact that blank lines and comments are allowed really help the understanding.

# Enable YAML syntax highlighting in you rIDE for the best experience.

# input files
inst_solut: ./dummy.sol.h5
mesh: ./dummy.mesh.h5
inst_solut_output: ./gasouted.sol.h5

# list (each - block) of actions to take
actions:
- type: directional_linear_mask 
  direction: "x"             #direction 0: x , 1: y, 2: z
  max_abcissa: 0.11        #If None, takes  70% of CAO size
  min_abcissa: 0.1         #If None, takes  30% of CAO size
  new_pressure: null       #use null if you dont want to edit this field
  new_temperature: 2000.0. # Temp in K.
  new_yk: null

- type: spherical_tanh_mask
  center: [0.1, 0.1, 0]    # center sphere. If None, takes CAO center.
  rad_0: 0.01              # radius sphere [m]. If None, takes from CAO size
  delta: 0.05              # transition thickness [m], If None, takes half rad_0
  new_pressure: null
  new_temperature: 1600.
  new_yk:                  # dictionnary of species Yk
    N2: 0.8                # order and competion does not matter
    O2: 0.2                # Sum MUST be 1!

To install:

> pip install yaml

To read, access, edit and write the content of a file:

import yaml

# read the data
with open('in.yml', "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)

# access to on element of the data
print(data["actions"][1]["new_yk"])

# change on element
data["mesh"] = "awesomemesh.h5"

# dump de new data
with open('out.yml', "w") as fout:
    yaml.dump(data, fout)
NOB serialize into JSON

If your data is extremely complex, a nested mix of lists, dictionaries and even Numpy arrays, maybe you could take a look at NOB the nested object handler of COOP. Read Nob objects. (resp. write) using the .np_deserialize() (resp .np_serialize()) method.

Read and write big binary files using HDF5

There are several ways to read HDF5. We will look at four of them from the lowest level to the highest level.

Ehh, I just want to peek

Often you just want some info without coding. You can use one of the many h5 utilities such as h5ls, h5dump, hdfview. Let us boast a bit a COOP creation, h5nav, to navigate, unix-like, in your file, then fetch your info:

>h5nav CAS01/RUN_001/AVE/ave_00000030.h5 
Welcome to the h5nav command line (V 0.1.6)
Type help or ? for a list of commands,
     ?about for more on this app
h5nav .../ave_00000030.h5/ > ls
Average/ Parameters/
h5nav .../ave_00000030.h5/ > cd Average
h5nav .../ave_00000030.h5/Average/ > ls
P P2 T T2 Y2CO Y2CO2 Y2H2O Y2KERO_LUCHE Y2N2 Y2O2 YCO YCO2 YH2O YKERO_LUCHE YN2 YO2 gamma_bar r_bar rho ss_bar u u2 v v2 w w2
h5nav .../ave_00000030.h5/Average/ > h5nav .../ave_00000030.h5/Average/ > stats T
Type           mean +/- std*2       [        min, max        ] (Shape)
--------------------------------------------------------------------------
float64  6.6647e+02 +/-  9.2180e+02 [ 2.8901e+02,  1.9253e+03] (8288419,)

If you prefer a global information, we have thie other COOP creation , H5cross , to scan and compare two h5files. In the present case, the statsmode is all you need :

>h5cross stats AVE/ave_00000018_end.h5 
+---------------------------+---------------------+--------------------+------------------------+--------------------+--------------------+
| Dataset                   |         min         |        mean        |          max           |       st dev       |       median       |
+---------------------------+---------------------+--------------------+------------------------+--------------------+--------------------+
| /Average/HR               | -3086049827.3707995 | 170894033.6285233  |    65031763525.5007    | 905016697.1607009  |   15343.1318443    |
| /Average/P                |   339110.78174958   |  400655.31584555   |    422259.15154229     |   3078.97788994    |  399689.99195807   |
| /Average/P2               |  114996405055.76947 | 160534565811.71048 |   178305849630.52454   | 2474437551.6456594 | 159752521173.54858 |
| /Average/PHR              | -1232686987454650.2 | 68308707105944.75  | 2.5981158707509748e+16 | 361822226147803.0  | 6119571547.631945  |
| /Average/T                |     319.59674548    |   1300.33782424    |     2528.98693507      |    589.01787434    |   1330.14476087    |
| /Average/T2               |    102142.0832204   |  2070547.27483687  |    6395779.09016946    |  1688856.28945759  |  1809831.96637881  |
| /Average/Y2CO             |         0.0         |     0.00054164     |       0.00697124       |     0.00143603     |        0.0         |
| /Average/Y2CO2            |         0.0         |     0.00595402     |       0.0283705        |     0.00681201     |     0.00391219     |
| /Average/Y2H2O            |         0.0         |     0.00175708     |       0.00929521       |     0.00255393     |     0.00066479     |
| /Average/Y2KERO_LUCHE     |         0.0         |     0.00268532     |       1.00408617       |     0.02129689     |        0.0         |
| /Average/Y2N2             |        2e-08        |     0.54989768     |       0.58923636       |     0.05945364     |     0.56549275     |
| /Average/Y2O2             |         0.0         |     0.02995375     |       0.05453285       |     0.02014759     |     0.0266755      |
| /Average/YCO              |      -0.0001162     |     0.00882507     |       0.0828723        |     0.02075583     |      2.8e-07       |
| /Average/YCO2             |     -0.00044873     |     0.05657467     |       0.16842254       |     0.0497323      |     0.05973481     |
| /Average/YH2O             |     -0.00018367     |     0.02883394     |       0.09639336       |     0.02929298     |     0.02461185     |
| /Average/YKERO_LUCHE      |     -0.00080482     |     0.01265918     |        1.002041        |     0.04847676     |       -5e-08       |
| /Average/YN2              |     -0.00156545     |     0.74007008     |       0.76761731       |     0.04565138     |     0.7519775      |
| /Average/YO2              |     -0.00047555     |     0.15303707     |       0.23352268       |     0.07795984     |     0.16176179     |
| /Average/gamma_bar        |      1.03422559     |     1.30653757     |       1.36953937       |     0.05664241     |     1.30193652     |
| /Average/ksi              |     -0.00080485     |     0.03465032     |          1.0           |     0.05922417     |     0.01918564     |
| /Average/phi              |      -0.0120734     |    11.17464054     |    1501373.57464976    |    3662.2584582    |     0.29426107     |
| /Average/q2_01            |         0.0         |   80043.52321728   |   449747179.41849583   |  1916669.22563046  |        0.0         |
| /Average/q2_02            |         0.0         |  6809726.3341922   |   27212249454.92934    | 92992696.56978296  |     1.48642386     |
| /Average/q_01             |         0.0         |    28.09673787     |     17308.75096788     |    173.21254089    |        0.0         |
| /Average/q_02             |   -12110.42777223   |    271.20910404    |    159993.24698726     |   1655.78093793    |     0.00368811     |
| /Average/r_bar            |     60.13169424     |    286.79621898    |      298.53762146      |     9.76395706     |    288.28884759    |
| /Average/rho              |      0.53646641     |     1.37465047     |      21.91834814       |     0.7083422      |     1.06759107     |
| /Average/ss_bar           |     141.00768707    |    675.26301683    |      964.65334981      |    148.96910232    |    702.19539274    |
| /Average/u                |    -145.22385972    |     4.60281819     |      163.97780619      |    28.22784398     |    -0.70743543     |
| /Average/u2               |         0.0         |    874.04106501    |     26889.65786448     |   1416.17202391    |    195.52479074    |
| /Average/v                |    -123.01736988    |     -8.2968666     |      163.20288453      |    17.02120361     |    -4.28127887     |
| /Average/v2               |         0.0         |    404.67864268    |     26635.62158816     |    925.76104737    |    101.1098022     |
| /Average/w                |    -118.51770246    |    20.55499835     |      387.40234306      |    16.83837715     |    18.69572502     |
| /Average/w2               |         0.0         |    769.14037649    |    150085.72742717     |    887.38456293    |    457.59152389    |
| /Parameters/dtsum         |          NA         |         NA         |           NA           |         NA         |         NA         |
| /Parameters/gitid         |          NA         |         NA         |           NA           |         NA         |         NA         |
| /Parameters/nit_av        |          NA         |         NA         |           NA           |         NA         |         NA         |
| /Parameters/niter         |          NA         |         NA         |           NA           |         NA         |         NA         |
| /Parameters/nnode         |          NA         |         NA         |           NA           |         NA         |         NA         |
| /Parameters/t_av          |          NA         |         NA         |           NA           |         NA         |         NA         |
| /Parameters/versionstring |          NA         |         NA         |           NA           |         NA         |         NA         |
+---------------------------+---------------------+--------------------+------------------------+--------------------+--------------------+

Using h5py

The h5py package is the Python binding to your HDF library. Use this low level approach if you want to read only several fields, without writing.

import h5py
with h5py.File("CAS01/RUN_001/AVE/ave_00000030.h5", "r") as h5pf:
    awesomefield = h5pf['Average/T'][()]
    print(awesomefield.max())

Using h5py is Low Level: Powerful but often clunky and dangerous. For example if you open the file in write mode instead of read ("w"instead of "r" in the h5py.File( statement) , you will erase the content of your file. See the online h5py documentation for more info.

You can build smart things over h5py, but it will probably look like hdfdict of h5py-wrapper.

For the record, is the reading , edition and writing using h5py. Some of the parts are really h5py-ninja style. Note in particular the peculiar aspect of :

  • the iterator .visit(
  • access by value syntax data[...]
  • rem:
with h5py.File("CAS01/RUN_001/AVE/ave_00000030.h5", "r") as h5pf:
    awesomefield = h5pf['Average/T'][()]
    print(awesomefield.max())
h5file = "CAS01/RUN_001/AVE/ave_00000030.h5"
#h5file="./ave_00000002.h5"
all_h5_objs = []
fh5 = h5py.File(h5file, 'r')

# Visit h5 file
fh5.visit(all_h5_objs.append)

# Create output file
newh5 = h5py.File('out.h5', 'w')

# Get all datasets (includes group path)
all_datasets = [obj for obj in all_h5_objs if isinstance(fh5[obj], h5py.Dataset)]

for obj in all_datasets:
    data = fh5[(obj)]
    if obj == "Average/T":
        dset = newh5.create_dataset(obj, data=2*data[...])
    else:
        dset = newh5.create_dataset(obj, data=data)
newh5.close()

Using Hdfdict

This package shows the content as a python dictionary. Use this higher level approach if you want to read all fields, edit or change the structure the way you want, then write them

import h5py
import hdfdict

with h5py.File("CAS01/RUN_001/AVE/ave_00000030.h5", "r") as h5pf:
    sol = hdfdict.load(h5pf, lazy=False)
    awesomefield = sol['Average']['T']
    sol['Average']['T'] *= 2
    with h5py.File("out.h5", 'w') as fout:
        hdfdict.dump(sol, fout)

The reading is slower because all fields are loaded in the memory. You can go faster with lazy=True. On the other hand, it is easy to edit the fields or the structure using common python objects.

Using Antares

Antares is a Cerfacs python project limited to Airbus and Safran perusal. This is a much higher level approach. You will need Antares if you are in one -or more- of the following cases:

  • you want to work on the same field at multiple instants.
  • your field is distributed on multiple zones (i.e. connected meshes).
  • you need to apply an “Antares treatment”.
  • you want to apply an “Antares compute” for a quantity pre-recorded.

The same example as the previous one would look like

ard = ant.Reader("hdf_avbp")
#r["base"] = base
ard["filename"] = "CAS01/RUN_001/AVE/ave_00000030.h5"
base = ard.read()

base[0][0]['T']*=2

writer = ant.Writer('hdf_antares')
writer['filename'] = 'out' 
writer['base'] = base
writer.dump()

Note that the out.h5, the output will follow the built-in writer structure. You cannot just add a field this way. In the same way, the built-in reader will ignore any addition to the structure unless you upgrade the Antares reader, which is highly recommended.

Note : To our knowledge, performances are very similar between Antares and Hdfdict. For example the full copy of a file with a single edition, on 8.3 millions of nodes:

H5py Hdfdict Antares
2.411s 3.544s 3.731s
100% (ref.) 145% 153%

IMPORTANT NOTICE: These numbers are for a full execution, including Python startup. The import time is also included. These numbers can change a lot if you process a lot of files or larger files.

Bad habits found in Antares scripts

If you work on pre and post processing, you cannot ignore the Cerfacs post-processing suite Antares. First, remember this is a high level tool, with an incredible amount of features. So keep the Antares documentation in your bookmarks, and do read what is behind you are using.

The documentation tells you what to do, but we felt necessary to highlight some anti patterns:

Importing the galaxy

from antares import *

Never import Antares, nor any other library, with the wildcard *. If a function whateverthename() arise in your code, the reader would have no way to understand where does it come from. There are lot of resource about why import star in python is a bad idea.

Antares bases are stateful

This pattern is widely used:

r = ant.Reader("hdf_avbp")
r["filename"] = "awesome.mesh.h5"
r["shared"] = True
base = r.read()

r = ant.Reader("hdf_avbp")
r["base"] = base
r["filename"] = "awesome_average.h5"
base = r.read()

The name base is assigned two times, giving the false impression that the first will be erased (and the garbage collector will free the memory, someday… hopefully.

Some people could then be tempted to build a new different base over the initial one:

r = ant.Reader("hdf_avbp")
r["filename"] = "awesome.mesh.h5"
r["shared"] = True
base_meshonly = r.read()

r = ant.Reader("hdf_avbp")
r["base"] = base_meshonly
r["filename"] = "awesome_average.h5"
base_meshandsol = r.read()

Here, the reader might think that base_meshandsol is base_meshonly augmented with the solution. Unfortunately, wrong !. The second read statement will change the first base. You can test the following:

> print(base_meshandsol is base_meshonly)
True

This will become a large problem if you want to use the initial base as a stepping stone to build several other bases : many names for the same object, but you never though about it.

To clear this misunderstanding, show clearly that your Antares base is updated by the Antares .read() (in other words, “stateful”), and not by the script. Simply remove the unnecessary base =. This will help a lot people in charge of refactoring.

avbp_reader_mesh= ant.Reader("hdf_avbp")
avbp_reader_mesh["filename"] = "awesome.mesh.h5"
avbp_reader_mesh["shared"] = True
base = avbp_reader_mesh.read()

avbp_reader = ant.Reader("hdf_avbp")
avbp_reader["base"] = base
avbp_reader["filename"] = "awesome_average.h5"
avbp_reader.read()

Use Antares .compute() when needed

The .compute() is a powerful method, but you should use it wisely. This method :

  1. compute on all zones and instants (loops on zones and time)
  2. comes with built-in variables (R, theta, etc… ) see the equation manager documentation

On the other hand, some Numpy/Scipy functions with not be available through the string-based .compute() method. Therefore it is good practice to use .compute() only when needed, and stick to usual Numpy/Scipy by default.

For example the following, we need a kind of Heaviside function:

theta_threshold = "0.01"
base.compute("FLAME_N2 =(tau - " + theta_threshold + ")")
base.compute("FLAME_D2 =(tau - " + theta_threshold + ")")
base[0][0]["FLAME_D2"] = np.abs(base[0][0]["FLAME_D2"])
base.compute("FLAME =0.5*(FLAME_N2/FLAME_D2+1)")

Is more explicit, and avoid divide by zero situations, this way:

base[0][0]["flame"] = np.where(base[0][0]["tau"]>0.01, 1., 0.)

If you had to do a stunt on multiple zones and multiple times on a recurrent basis, please contribute to Antares.

To be completed eventually…

Like this post? Share on: TwitterFacebookEmail


Antoine Dauptain is a research scientist focused on computer science and engineering topics for HPC.

Keep Reading


Published

Category

Pitch

Tags

Stay in Touch