acropol

Acropolis of Athena : Detail of a restoration. Once the shape is known, filling the holes is possible. The same goes for datasets, if you describe them with json-SCHEMA

Reading time 5 min

Using the json-SCHEMA standard for scientific applications

JSON Schema is a standard used in web applications, defining what data is needed for an application and how it can be modified. This standard is now extremely mature and ubiquitous, with an excellent documentation.

Meanwhile, scientific applications often use large and very complex inputs without any standard. For engineering applications, for example, having four shape-shifting families of inputs is common. The more you go into high fidelity simulation, the more the inputs you get. An aeronautical combustion chamber setup shows typically more than 3000 d.o.f.

We will see how SCHEMA standard can help to validate the input, add a precise documentation, auto-fill the missing part, and even create graphical user interfaces.

Our test-case, the flow past an obstacle.

Vortex shedding behind an obstacle is the origin of many natural manifestations, from converting expiration into voice to destroying bridges by the wind. The flow past a cylinder is an academical Computational Fluid Dynamics (CFD) engineering test case on vortex shedding.

The input could be expressed, using the YAML standard format, like this

mesh:                # Geometry
  lenght: 1.         # x_direction [m]
  width: 0.3         # y direction [m]
  resolution: 0.01   # delta_x [m]

obstacle:
  type: cylinder
  size: 0.05         # diameter [m]

fluid:
  density: 1.2       # [Kg/m3]
  viscosity: 1.8e-5   # [Kg/(m.s)]
  init_speed: 3.     # [m/s]

numerics:
  poisson_tol: 0.05           # [-] tolerance 
  poisson_maxsteps: 4         # [it.] max. iterations to converge
  scheme: "first_order"  # either first_order or second order

Generating a SCHEMA

SCHEMA is associated to the JSON serialization standard. First we convert the data into a JSON string, using for example an online conversion from YAML to JSON.

The same information is therefore expressed in a more rigid format using extensively nested braces {} instead of nested indentations:

{
   "mesh": {
      "lenght": 1,
      "width": 0.3,
      "resolution": 0.01
   },
   "obstacle": {
      "type": "cylinder",
      "size": 0.05
   },
   "fluid": {
      "density": 1.2,
      "viscosity": 0.00001,
      "init_speed": 3
   },
   "numerics": {
      "poisson_tol": 0.05,
      "poisson_maxsteps": 4,
      "scheme": "centered"
   }
}

Now we infer the SCHEMA adapted to the data, using one of the multiple online tools. If you have many example of your data, you could use skinfer, which is able the make much more advanced inferences.

In the present case we get:

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "mesh": {
      "type": "object",
      "properties": {
        "lenght": {
          "type": "integer"
        },
        "width": {
          "type": "number"
        },
        "resolution": {
          "type": "number"
        }
      },
      "required": [
        "lenght",
        "width",
        "resolution"
      ]
    },
    "obstacle": {
      "type": "object",
      "properties": {
        "type": {
          "type": "string"
        },
        "size": {
          "type": "number"
        }
      },
      "required": [
        "type",
        "size"
      ]
    },
    "fluid": {
      "type": "object",
      "properties": {
        "density": {
          "type": "number"
        },
        "viscosity": {
          "type": "number"
        },
        "init_speed": {
          "type": "integer"
        }
      },
      "required": [
        "density",
        "viscosity",
        "init_speed"
      ]
    },
    "numerics": {
      "type": "object",
      "properties": {
        "poisson_tol": {
          "type": "number"
        },
        "poisson_maxsteps": {
          "type": "integer"
        },
        "scheme": {
          "type": "string"
        }
      },
      "required": [
        "poisson_tol",
        "poisson_maxsteps",
        "scheme"
      ]
    }
  },
  "required": [
    "mesh",
    "obstacle",
    "fluid",
    "numerics"
  ]
}

You can now tinker with this SCHEMA, using the SCHEMA reference to make the validation extremely precise:

"density": {
    "title": "Density",
    "description": "The density of the fluid, expressed in Kg/m3.",
    "type": "number",
    "default": 1.2,
    "minimum": 0.001,
    "exclusiveMaximum": 100.,
},

You can also limit the options allowed:

"scheme": {
    "title": "Numerical scheme."
    "description": "The scheme is used to express operators",
    "type": "string"
    "enum" : ["centered", "upwind"]
    "default" : "centered"
}

Data validation

Once we have this SCHEMA, we can validate the original data. In python you can use for example the jsonschema package. To install it:

>pip install jsonschema

Then to use it:

import yaml
import json
from jsonschema import validate

# read the data
with open('input.yml', "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)

# read the schema
with open('schema.json', "r") as fin:
    myschema = json.load(fin)

# validation
validate(instance=data, schema=myschema)

What would happen then if, the keyword scheme was replaced byderivative in in the input? Of course an exception, but with a quite informative output:

Traceback (most recent call last):
  File "valida.py", line 14, in <module>
    validate(instance=data, schema=myschema)
  File "/Users/dauptain/Python_envs/dev_opentea/lib/python3.8/site-packages/jsonschema-3.2.0-py3.8.egg/jsonschema/validators.py", line 934, in validate
    raise error
jsonschema.exceptions.ValidationError: 'scheme' is a required property

Failed validating 'required' in schema['properties']['numerics']:
    {'properties': {'poisson_maxsteps': {'type': 'integer'},
                    'poisson_tol': {'type': 'number'},
                    'scheme': {'type': 'string'}},
     'required': ['poisson_tol', 'poisson_maxsteps', 'scheme'],
     'type': 'object'}

On instance['numerics']:
    {'derivatives': 'first_order',
     'poisson_maxsteps': 4,
     'poisson_tol': 0.05}

Customizing the validation message.

This warning is a bit verbose, and can become unreadable if the SCHEMA is too large. However you can hack the validate() of jsonschema. In COOP’s package opentea, you can use the validate_light() function like this:

import yaml
import json

from opentea.noob.validate_light import validate_light

# read the data
with open('input.yml', "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)

# read the schema
with open('schema.json', "r") as fin:
    myschema = json.load(fin)

# validation
validate_light(data, myschema)

Indeed, it is the same as `validate()’ with a more human readable output :

Traceback (most recent call last):
  File "valida.py", line 15, in <module>
    validate_light(data, myschema)
  File "/Users/dauptain/GITLAB/opentea/src/opentea/noob/validate_light.py", line 32, in validate_light
    raise ValidationErrorShort(err_msg)
opentea.noob.validate_light.ValidationErrorShort: 
========================
derivatives: first_order
poisson_maxsteps: 4
poisson_tol: 0.05

 does not validate against 
properties:
  poisson_maxsteps:
    type: integer
  poisson_tol:
    type: number
  scheme:
    type: string
required:
- poisson_tol
- poisson_maxsteps
- scheme
type: object

We have here an efficient and systematic way to validate extremely large and complex information.

Extension to an HDF5 structure validation

The HDF5 files are also nested objects, and their structure can be scanned as a dictionary. In COOP’s package opentea, the tool get _h5_structure() provide the structure of the file. The use is the following:

from opentea.tools.visit_h5 import get_h5_structure
dict_ = get_h5_structure("awesome_mesh.h5")
print(yaml.dump(dict_, default_flow_style=False))

The structure look like this:

Connectivity:
  dtype: int32
  tet->node:
    dtype: int32
    value: array of 181879640 elements
  value: array of 181879640 elements
Coordinates:
  dtype: float64
  value: array of 8043365 elements
  x:
    dtype: float64
    value: array of 8043365 elements
  y:
    dtype: float64
    value: array of 8043365 elements
  z:
    dtype: float64
    value: array of 8043365 elements
(...)

This dictionary can be used like the input file of the initial example. Therefore, one can validate the compatibility of an HDF5 file with a program using the SCHEMA standard.

Data completion

If the the SCHEMA is known, it is possible to infer the missing part of an input, by filling with default values. Assuming we have only the following input:

obstacle:
  type: cylinder
  size: 0.09        # diameter [m]

We can infer the full data from the schema. The nob_complete() function is an implementation of this feature in opentea:

import yaml
import json
#from jsonschema import validate
from opentea.noob.inferdefault import nob_complete

# read the data
with open('input2.yml', "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)

# read the schema
with open('schema.json', "r") as fin:
    myschema = json.load(fin)

# Completion
dict_ = nob_complete(myschema, update_data=data)
print(yaml.dump(dict_, default_flow_style=False))

The result is then :

fluid:
  density: 1.2
  init_speed: 1.4
  viscosity: 1.0e-05
mesh:
  lenght: 3.0
  resolution: 0.01
  width: 1.0
numerics:
  poisson_maxsteps: 10
  poisson_tol: 0.001
  scheme: centered
obstacle:
  size: 0.09
  type: cylinder

Using this property, we can reduce drastically the amount of information for a complex setup, if we assume the user want to use default values elsewhere

Create graphical user interfaces from SCHEMA

The SCHEMA was created to allows two web services to exchange information, and the usual way to get the information from the end user is a form. There is therefore plenty of ways to generate forms from the SCHEMA, in other words to generate a Graphical User Interface.

Takeaway

SCHEMA is a core component of data exchange in the web today. A ridiculous amount of tools and people are available on this technology, by far more numerous than the small HPC field.

If you are in the HPC business, make up your mind on what SCHEMA can do for your setups using this interactive page. It could replace for free many in-house parsers and global validation processes.

This work has been supported by the EXCELLERAT project which has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 823691.

Like this post? Share on: TwitterFacebookEmail


Antoine Dauptain is a research scientist focused on computer science and engineering topics for HPC.

Keep Reading


Published

Category

Deep Dive

Tags

Stay in Touch