A Hitchhiker’s guide to pre- and post- processing using Python at Cerfacs
Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live. - John Woods.
If you work at Cerfacs, you will eventually have to bring your creations to the community. This article will give you several ideas to avoid frustrating moments.
Foreword
If you want to plot a result using Matplotlib on Kraken, be aware that you may experience some backend troubles. For more information about Matplotlib and backend, click here.
When trying to plot a figure with Matplotlib on Kraken, if you get an error, PyQt5 can solve your problem. Just run pip install PyQt5
in your virtual environment.
Keep it Single purpose
This aspect is the most important, because a multi-purpose code is incredibly harder to maintain. Let us see an example:
Bob’s code was making several types of averaging: 2-D fields averaged in time, 1-D averaged in space and time, 0-D both in space and time, 0-D in space but time dependent, etc… IFs allowed to skip some of the averaging steps.
Then Bob met a wall. As all was tightly coupled, a single change could silently break the whole pipeline. At the same time there were too many cases to memorize while reading. Refactoring was more expensive than rewriting from scratch.
Bob could not ship this code, which died, in the rain.
Some Cerfacs code were never shipped only because of their complexity.
Make the data flow obvious
If you need to develop several outputs at the same time, your code should show it with the the data flow. If we take Bob’s code example, instead of a big bundle of smart coding in the same scope, and several outputs:
def myawesomecode(...)
...
return avg2D, avg1D, avg0D_alltimes, avg_0D
One should prefer:
avg2D = myawesomeavg_2D(...)
avg1D = myawesomeavg_1D(avg2D)
avg0D = myawesomeavg_0D(avg2D)
avg0D_alltimes = myawesomeavg_alltimes(avg0D)
Monitor your code with pylint and radon
A good Pylint score is a clue that the reading difficulty of the code is low. Read our article Is pylint
adapted to scientific software crafting? for further reference.
If you feel your code is a bit too smart, you can measure the complexity easily with radon. Complexity dramatically increases when you cover many cases in the same code, and this could be bad news. In CFD, pay a particular attention to code focusing on the various boundary conditions, and/or the various models implemented.
“Reduce complexity. The single most important reason to create a routine is to reduce a program’s complexity. Create a routine to hide information so that you won’t need to think about it.”― Steve McConnell, Code Complete
Handle errors
Prefer throwing an exception than stopping yte process (sys.exit()). A higher level tool may use your code, leave them the freedom to choose if the error is critical or not.
Design informative and positive error messages. You can read this error message do’s and don’t for further advice.
Beware of sanity checks
In compiled language like FORTRAN, runtime error are usually cryptic. This is why programmers are adding sanity checks, to catch errors with an explicit message. This is called the “Look before you Leap” (LBYL) strategy. Unfortunately, these lines can bring new crash causes. An example is needed here:
Bob’s code is computing smart stuff from the output of a large software. It uses need two variables T
and P
. He also uses a global sanity check to see if the input binary file match is test case nominal file U, V, W, T, P
.
Enter Brenda, working on a 2-D case, where there is no third dimension -no W-. Here, the sanity check fails. Brenda say “there is a bug” to Bob.
Enter Brian. His case involves other phenomenons. The output file features U, V, W, T, P, Rho, T
. Again, the sanity check fails. Brian say “there is a bug” to Bob.
Enter Bambi. She uses a wrong file. The sanity check say “T
and P
are missing”. Bambi is in a hurry and ask “there is a bug, can you help me?” to Bob.`
It is easy to say “an error has occurred”, much more harder to say what the user should do. Meanwhile, The code is bound to get more complex to cover all possible cases where the check was never needed. Maintaining such a code is expensive in the long run. Sanity check are expensive. Keep only the very smart ones -if any-.
Easier to Ask forgiveness than permission
In high-level languages like Python, you can do differently, and this is called “Easier to Ask forgiveness than permission” (EAFP).
Indeed, there is a smart error manager giving quite good and informative errors messages for free. It is called the traceback, and ends usually with an Exception:
Traceback (most recent call last):
File "acces_with_h5.py", line 39, in <module>
sol['Average']['P'] *= 2
File "/Users/dauptain/Python_envs/dev_opentea/lib/python3.8/site-packages/hdfdict-0.3.1-py3.8.egg/hdfdict/hdfdict.py", line 66, in __getitem__
item = super().__getitem__(key)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py", line 1010, in __getitem__
raise KeyError(key)
KeyError: 'P'
Once the exception is known,
- you say to the angry user “Ah, sorry”.
- You prettify the message (
RuntimeError: Field Average/P not found in this file. Aborting
):
try:
sol['Average']['P'] *= 2
except KeyError as err:
raise KeyError(Field Average/P not found in this file. Aborting")
Read and write small data in ASCII
If you need a human readable/writable file for your inputs or output, use one of the standards serialization formats. By doing so:
- you will not have to write, explain and maintain your own parser.
- you can rely on a huge amount of helpers tools online.
- standards formats come with impressive features out-of-reach for a single programmer : automatic type conversion, dynamic structures, automatic checking, and many more.
Let us introduce several solutions:
CSV
If you have a simple table of numbers, The CSV format is simple, Excel-friendly and widely supported. You can use the build-in python csv writer, the Numpy writer, or the panda writer, and their respective readers.
A small example:
Serial Number,Company Name,Employee Markme,Description,Leave
9788189999599,TALES OF SHIVA,Mark,mark,0
9780099578079,1Q84 THE COMPLETE TRILOGY,HARUKI MURAKAMI,Mark,0
9780198082897,MY KUMAN,Mark,Mark,0
9780007880331,THE GOD OF SMAAL THINGS,ARUNDHATI ROY,4TH HARPER COLLINS,2
9780545060455,THE BLACK CIRCLE,Mark,4TH HARPER COLLINS,0
This kind of data is particularly adapted to pandas. TO install pandas:
pip install pandas
To read, print, edit, and write
import pandas as pd
# read file
data = pd.read_csv('in.csv')
# print data
print(data)
# edit data
data["Serial Number"] += 1
# dump data
data.to_csv("data_out.csv")
YAML
For more complex structures with various shapes, you can rely on YAML, very widely supported, (but a bit slow). The fact that blank lines and comments are allowed really help the understanding.
# Enable YAML syntax highlighting in you rIDE for the best experience.
# input files
inst_solut: ./dummy.sol.h5
mesh: ./dummy.mesh.h5
inst_solut_output: ./gasouted.sol.h5
# list (each - block) of actions to take
actions:
- type: directional_linear_mask
direction: "x" #direction 0: x , 1: y, 2: z
max_abcissa: 0.11 #If None, takes 70% of CAO size
min_abcissa: 0.1 #If None, takes 30% of CAO size
new_pressure: null #use null if you dont want to edit this field
new_temperature: 2000.0. # Temp in K.
new_yk: null
- type: spherical_tanh_mask
center: [0.1, 0.1, 0] # center sphere. If None, takes CAO center.
rad_0: 0.01 # radius sphere [m]. If None, takes from CAO size
delta: 0.05 # transition thickness [m], If None, takes half rad_0
new_pressure: null
new_temperature: 1600.
new_yk: # dictionnary of species Yk
N2: 0.8 # order and competion does not matter
O2: 0.2 # Sum MUST be 1!
To install:
> pip install yaml
To read, access, edit and write the content of a file:
import yaml
# read the data
with open('in.yml', "r") as fin:
data = yaml.load(fin, Loader=yaml.FullLoader)
# access to on element of the data
print(data["actions"][1]["new_yk"])
# change on element
data["mesh"] = "awesomemesh.h5"
# dump de new data
with open('out.yml', "w") as fout:
yaml.dump(data, fout)
NOB serialize into JSON
If your data is extremely complex, a nested mix of lists, dictionaries and even Numpy arrays, maybe you could take a look at NOB the nested object handler of COOP. Read Nob objects. (resp. write) using the .np_deserialize() (resp .np_serialize()) method.
Read and write big binary files using HDF5
There are several ways to read HDF5. We will look at four of them from the lowest level to the highest level.
Ehh, I just want to peek
Often you just want some info without coding. You can use one of the many h5 utilities such as h5ls
, h5dump
, hdfview
.
Let us boast a bit a COOP creation, h5nav, to navigate, unix-like, in your file, then fetch your info:
>h5nav CAS01/RUN_001/AVE/ave_00000030.h5
Welcome to the h5nav command line (V 0.1.6)
Type help or ? for a list of commands,
?about for more on this app
h5nav .../ave_00000030.h5/ > ls
Average/ Parameters/
h5nav .../ave_00000030.h5/ > cd Average
h5nav .../ave_00000030.h5/Average/ > ls
P P2 T T2 Y2CO Y2CO2 Y2H2O Y2KERO_LUCHE Y2N2 Y2O2 YCO YCO2 YH2O YKERO_LUCHE YN2 YO2 gamma_bar r_bar rho ss_bar u u2 v v2 w w2
h5nav .../ave_00000030.h5/Average/ > h5nav .../ave_00000030.h5/Average/ > stats T
Type mean +/- std*2 [ min, max ] (Shape)
--------------------------------------------------------------------------
float64 6.6647e+02 +/- 9.2180e+02 [ 2.8901e+02, 1.9253e+03] (8288419,)
If you prefer a global information, we have thie other COOP creation , H5cross , to scan and compare two h5files. In the present case, the stats
mode is all you need :
>h5cross stats AVE/ave_00000018_end.h5
+---------------------------+---------------------+--------------------+------------------------+--------------------+--------------------+
| Dataset | min | mean | max | st dev | median |
+---------------------------+---------------------+--------------------+------------------------+--------------------+--------------------+
| /Average/HR | -3086049827.3707995 | 170894033.6285233 | 65031763525.5007 | 905016697.1607009 | 15343.1318443 |
| /Average/P | 339110.78174958 | 400655.31584555 | 422259.15154229 | 3078.97788994 | 399689.99195807 |
| /Average/P2 | 114996405055.76947 | 160534565811.71048 | 178305849630.52454 | 2474437551.6456594 | 159752521173.54858 |
| /Average/PHR | -1232686987454650.2 | 68308707105944.75 | 2.5981158707509748e+16 | 361822226147803.0 | 6119571547.631945 |
| /Average/T | 319.59674548 | 1300.33782424 | 2528.98693507 | 589.01787434 | 1330.14476087 |
| /Average/T2 | 102142.0832204 | 2070547.27483687 | 6395779.09016946 | 1688856.28945759 | 1809831.96637881 |
| /Average/Y2CO | 0.0 | 0.00054164 | 0.00697124 | 0.00143603 | 0.0 |
| /Average/Y2CO2 | 0.0 | 0.00595402 | 0.0283705 | 0.00681201 | 0.00391219 |
| /Average/Y2H2O | 0.0 | 0.00175708 | 0.00929521 | 0.00255393 | 0.00066479 |
| /Average/Y2KERO_LUCHE | 0.0 | 0.00268532 | 1.00408617 | 0.02129689 | 0.0 |
| /Average/Y2N2 | 2e-08 | 0.54989768 | 0.58923636 | 0.05945364 | 0.56549275 |
| /Average/Y2O2 | 0.0 | 0.02995375 | 0.05453285 | 0.02014759 | 0.0266755 |
| /Average/YCO | -0.0001162 | 0.00882507 | 0.0828723 | 0.02075583 | 2.8e-07 |
| /Average/YCO2 | -0.00044873 | 0.05657467 | 0.16842254 | 0.0497323 | 0.05973481 |
| /Average/YH2O | -0.00018367 | 0.02883394 | 0.09639336 | 0.02929298 | 0.02461185 |
| /Average/YKERO_LUCHE | -0.00080482 | 0.01265918 | 1.002041 | 0.04847676 | -5e-08 |
| /Average/YN2 | -0.00156545 | 0.74007008 | 0.76761731 | 0.04565138 | 0.7519775 |
| /Average/YO2 | -0.00047555 | 0.15303707 | 0.23352268 | 0.07795984 | 0.16176179 |
| /Average/gamma_bar | 1.03422559 | 1.30653757 | 1.36953937 | 0.05664241 | 1.30193652 |
| /Average/ksi | -0.00080485 | 0.03465032 | 1.0 | 0.05922417 | 0.01918564 |
| /Average/phi | -0.0120734 | 11.17464054 | 1501373.57464976 | 3662.2584582 | 0.29426107 |
| /Average/q2_01 | 0.0 | 80043.52321728 | 449747179.41849583 | 1916669.22563046 | 0.0 |
| /Average/q2_02 | 0.0 | 6809726.3341922 | 27212249454.92934 | 92992696.56978296 | 1.48642386 |
| /Average/q_01 | 0.0 | 28.09673787 | 17308.75096788 | 173.21254089 | 0.0 |
| /Average/q_02 | -12110.42777223 | 271.20910404 | 159993.24698726 | 1655.78093793 | 0.00368811 |
| /Average/r_bar | 60.13169424 | 286.79621898 | 298.53762146 | 9.76395706 | 288.28884759 |
| /Average/rho | 0.53646641 | 1.37465047 | 21.91834814 | 0.7083422 | 1.06759107 |
| /Average/ss_bar | 141.00768707 | 675.26301683 | 964.65334981 | 148.96910232 | 702.19539274 |
| /Average/u | -145.22385972 | 4.60281819 | 163.97780619 | 28.22784398 | -0.70743543 |
| /Average/u2 | 0.0 | 874.04106501 | 26889.65786448 | 1416.17202391 | 195.52479074 |
| /Average/v | -123.01736988 | -8.2968666 | 163.20288453 | 17.02120361 | -4.28127887 |
| /Average/v2 | 0.0 | 404.67864268 | 26635.62158816 | 925.76104737 | 101.1098022 |
| /Average/w | -118.51770246 | 20.55499835 | 387.40234306 | 16.83837715 | 18.69572502 |
| /Average/w2 | 0.0 | 769.14037649 | 150085.72742717 | 887.38456293 | 457.59152389 |
| /Parameters/dtsum | NA | NA | NA | NA | NA |
| /Parameters/gitid | NA | NA | NA | NA | NA |
| /Parameters/nit_av | NA | NA | NA | NA | NA |
| /Parameters/niter | NA | NA | NA | NA | NA |
| /Parameters/nnode | NA | NA | NA | NA | NA |
| /Parameters/t_av | NA | NA | NA | NA | NA |
| /Parameters/versionstring | NA | NA | NA | NA | NA |
+---------------------------+---------------------+--------------------+------------------------+--------------------+--------------------+
Using h5py
The h5py package is the Python binding to your HDF library. Use this low level approach if you want to read only several fields, without writing.
import h5py
with h5py.File("CAS01/RUN_001/AVE/ave_00000030.h5", "r") as h5pf:
awesomefield = h5pf['Average/T'][()]
print(awesomefield.max())
Using h5py is Low Level: Powerful but often clunky and dangerous. For example if you open the file in write mode instead of read ("w"
instead of "r"
in the h5py.File(
statement) , you will erase the content of your file. See the online h5py documentation for more info.
You can build smart things over h5py, but it will probably look like hdfdict of h5py-wrapper.
For the record, is the reading , edition and writing using h5py
. Some of the parts are really h5py-ninja style. Note in particular the peculiar aspect of :
- the iterator
.visit(
- access by value syntax
data[...]
- rem:
with h5py.File("CAS01/RUN_001/AVE/ave_00000030.h5", "r") as h5pf:
awesomefield = h5pf['Average/T'][()]
print(awesomefield.max())
h5file = "CAS01/RUN_001/AVE/ave_00000030.h5"
#h5file="./ave_00000002.h5"
all_h5_objs = []
fh5 = h5py.File(h5file, 'r')
# Visit h5 file
fh5.visit(all_h5_objs.append)
# Create output file
newh5 = h5py.File('out.h5', 'w')
# Get all datasets (includes group path)
all_datasets = [obj for obj in all_h5_objs if isinstance(fh5[obj], h5py.Dataset)]
for obj in all_datasets:
data = fh5[(obj)]
if obj == "Average/T":
dset = newh5.create_dataset(obj, data=2*data[...])
else:
dset = newh5.create_dataset(obj, data=data)
newh5.close()
Using Hdfdict
This package shows the content as a python dictionary. Use this higher level approach if you want to read all fields, edit or change the structure the way you want, then write them
import h5py
import hdfdict
with h5py.File("CAS01/RUN_001/AVE/ave_00000030.h5", "r") as h5pf:
sol = hdfdict.load(h5pf, lazy=False)
awesomefield = sol['Average']['T']
sol['Average']['T'] *= 2
with h5py.File("out.h5", 'w') as fout:
hdfdict.dump(sol, fout)
The reading is slower because all fields are loaded in the memory.
You can go faster with lazy=True
.
On the other hand, it is easy to edit the fields or the structure using common python objects.
Using Antares
Antares is a Cerfacs python project limited to Airbus and Safran perusal. This is a much higher level approach. You will need Antares if you are in one -or more- of the following cases:
- you want to work on the same field at multiple instants.
- your field is distributed on multiple zones (i.e. connected meshes).
- you need to apply an “Antares treatment”.
- you want to apply an “Antares compute” for a quantity pre-recorded.
The same example as the previous one would look like
ard = ant.Reader("hdf_avbp")
#r["base"] = base
ard["filename"] = "CAS01/RUN_001/AVE/ave_00000030.h5"
base = ard.read()
base[0][0]['T']*=2
writer = ant.Writer('hdf_antares')
writer['filename'] = 'out'
writer['base'] = base
writer.dump()
Note that the out.h5
, the output will follow the built-in writer structure. You cannot just add a field this way.
In the same way, the built-in reader will ignore any addition to the structure unless you upgrade the Antares reader, which is highly recommended.
Note : To our knowledge, performances are very similar between Antares and Hdfdict. For example the full copy of a file with a single edition, on 8.3 millions of nodes:
H5py | Hdfdict | Antares |
---|---|---|
2.411s | 3.544s | 3.731s |
100% (ref.) | 145% | 153% |
IMPORTANT NOTICE: These numbers are for a full execution, including Python startup. The import time is also included. These numbers can change a lot if you process a lot of files or larger files.
Bad habits found in Antares scripts
If you work on pre and post processing, you cannot ignore the Cerfacs post-processing suite Antares. First, remember this is a high level tool, with an incredible amount of features. So keep the Antares documentation in your bookmarks, and do read what is behind you are using.
The documentation tells you what to do, but we felt necessary to highlight some anti patterns:
Importing the galaxy
from antares import *
Never import Antares, nor any other library, with the wildcard *
.
If a function whateverthename()
arise in your code, the reader would have no way to understand where does it come from. There are lot of resource about why import star in python is a bad idea.
Antares bases are stateful
This pattern is widely used:
r = ant.Reader("hdf_avbp")
r["filename"] = "awesome.mesh.h5"
r["shared"] = True
base = r.read()
r = ant.Reader("hdf_avbp")
r["base"] = base
r["filename"] = "awesome_average.h5"
base = r.read()
The name base
is assigned two times, giving the false impression that the first will be erased (and the garbage collector will free the memory, someday… hopefully.
Some people could then be tempted to build a new different base over the initial one:
r = ant.Reader("hdf_avbp")
r["filename"] = "awesome.mesh.h5"
r["shared"] = True
base_meshonly = r.read()
r = ant.Reader("hdf_avbp")
r["base"] = base_meshonly
r["filename"] = "awesome_average.h5"
base_meshandsol = r.read()
Here, the reader might think that base_meshandsol
is base_meshonly
augmented with the solution. Unfortunately, wrong !. The second read statement will change the first base. You can test the following:
> print(base_meshandsol is base_meshonly)
True
This will become a large problem if you want to use the initial base as a stepping stone to build several other bases : many names for the same object, but you never though about it.
To clear this misunderstanding, show clearly that your Antares base is updated by the Antares .read()
(in other words, “stateful”), and not by the script. Simply remove the unnecessary base =
. This will help a lot people in charge of refactoring.
avbp_reader_mesh= ant.Reader("hdf_avbp")
avbp_reader_mesh["filename"] = "awesome.mesh.h5"
avbp_reader_mesh["shared"] = True
base = avbp_reader_mesh.read()
avbp_reader = ant.Reader("hdf_avbp")
avbp_reader["base"] = base
avbp_reader["filename"] = "awesome_average.h5"
avbp_reader.read()
Use Antares .compute() when needed
The .compute()
is a powerful method, but you should use it wisely. This method :
- compute on all zones and instants (loops on zones and time)
- comes with built-in variables (
R
,theta
, etc… ) see the equation manager documentation
On the other hand, some Numpy/Scipy functions with not be available through the string-based .compute()
method.
Therefore it is good practice to use .compute()
only when needed, and stick to usual Numpy/Scipy by default.
For example the following, we need a kind of Heaviside function:
theta_threshold = "0.01"
base.compute("FLAME_N2 =(tau - " + theta_threshold + ")")
base.compute("FLAME_D2 =(tau - " + theta_threshold + ")")
base[0][0]["FLAME_D2"] = np.abs(base[0][0]["FLAME_D2"])
base.compute("FLAME =0.5*(FLAME_N2/FLAME_D2+1)")
Is more explicit, and avoid divide by zero situations, this way:
base[0][0]["flame"] = np.where(base[0][0]["tau"]>0.01, 1., 0.)
If you had to do a stunt on multiple zones and multiple times on a recurrent basis, please contribute to Antares.
To be completed eventually…