Estimating the cost of a Computational Fluid Dynamics computation

A. Dauptain, I. Monnier, G. Staffelbach

t_simu (s)

100

processes

360

dofs

1.00e+7

Ideal scaling

0.9964

FLOP to solve

6.02e+15

Time to sol. (h)

1.53

Nb. of jobs.

0.1

Energy cons. kWh

1.5

Human activities must become greener. What about supercomputing? Understanding the many costs of simulations is the key to stay on a sustainable path. Using an example of Computational Fluid Dynamics (CFD), this tutorial breaks down the problem into several steps, from the problem definition to the hardware resources involved.

The CFD configuration

Photo Nicholas Thomas on unsplash

In a Computational Fluid Dynamics simulation, you want to study the flow induced by your problem : For exemple the water motions induced by a fish in a tank. This problem shows specific traits:

First there is the object (e.g. the fish) speed : m/s .

Then we have the size of the object (e.g. the fish): m

This object is moving in a domain (e.g. the tank) of typical size: m

From this, we already have a characteristic time, for exemple the time needed by the object to cross the domain : 10.0 seconds.

t_char=domain_size/speed

We need to consider the timescale of the simulation. Are we looking at one fish crossing the aquarium? or a fish circling one hundred times in the domain? The non dimensional duration of the problem is the number of characteristic times we simulate : 10.00 char. times.

The full time to simulate is then 100 seconds.

Finally some properties of the fluid will also influence the simulations, for exemple the speed of sound : m/s Such properties allow to compute adimensional numbers such as the Mach number, ratio between the object speed and the speed of sound: 0.0294. Below 0.3 the flow is almost incompressible and the numerical problem is quite different to solve.

Discretization

The first step in a simulation is to define the degrees of freedom (d.o.f.). Each degree of freedom, or dof, is a space location tracking what happen in its vicinity. The corresponding control volume is called a “cell”. This sampling of the problem space is called a discretization, because the continuous problem is converted into discrete numerical values. The cells, taken together, are a tesselation) of the computational domain.
The amount of dofs is however a consequence of the precision of the simulation. Your typical cell size is smaller than the object size you are computing. This parameter is controlled here as the precision ratio.

An other element will be important : the smallest control volume will have an heavy impact on the numerical approach, limiting the global speed of resolution. This is why we cannot afford discretizations with huge differences between the scale resolved (domain_size) and the smallest control volume (min_dx). The size of the smallest cell (usually the edge lenght of the cell) is typically 10 time smaller than the typical cell size for a realistic heterogeneous mesh.

30.00 This is the ratio between the object size and the typical cell size

typ_dx=object_size*precision_ratio

10.00 This is the ratio between the typical cell size and the smallest cell size.

min_dx=typ_dx*ratio_cells

According to these ratios, the typical cell size in use is 1.000e-1 m, and the smallest cell size 1.000e-2 m.

We can get an estimation of the number of degrees of freedom by divinding the volume of the domain by the typical cell volume :1.000e+7 d.o.f s.

dofs= (domain_size/typ_dx)^3

Equations

Equations are how we approximate the physics of the problem. More physical phenomenons to capture means more equations to solve at the same time. When brute force computation is not enough - usually at the smallest scales of the simulation - some terms are modelled : an approximation of physics is chosen to provide an acceptable behaviour of the term.

We assume here that the equations and models are known and never change.

We will solve explicitely the Naviers Stokes equations using a Large Eddy Simulation approach.

For compressible flows, the fastest phenomenon to catch is a sound wave propagating downstream the flow. For incompressible flows, the fasted phenomenon is simply the flow speed. The mathematical resolition is however different : In brief, each iteration takes more time to compute, but with a bigger timestep. Use the following tickbox to switch between compressible and incompressible equations:

Compressible.

The current maximum velocity to take care is here 350 m/s.

Algorithm

Once the problem is discretized, and the equations are decided, there are usually many algorithms available. Developers can be interested in:

Simplicity because a simple code is easy to maintain.
Speed when the time-to-solution is critical.
Scalable for very large number of degrees of freedom.
Generic for wide range of applicability

Obviously all these qualities are tightly coupled, and often mutually exclusive.
The degrees of freedom in space are iteratively updated. Each iteration is separated by a time step delta_t. The limiting parameter of the method is, for example in this case, the Courant-Friedrichs-Law (CFL) limit, which defines maximum speed of information propagation (delta_t/min_dx / max_speed) in the method. (Other limitations are found, like the Fourier limit for cases driven by diffusion).

0.70 Courant-Friedrichs-Law Number is, in a way, the adimensional speed of information in this algorithm.

The time step in use is therefore :2.000e-5 seconds. As a consequence, the iterations to be done is : 5e+6 iterations.

Floating operations

Depending on the equations and the implementation, the solver will have to solve many FLoating point OPerations (FLOP).

You can approximate this aspect by a fixed amount of FLOP per d.o.f per iteration.

120.00

Note that the quantity of FLOP/dof/it can change abruptly from one software to the other, for example from a incompressible implicit to a compressible explict one.

Parallel computin and Domain decomposition

With Domain decomposition, the problem is distributed over many processing units to enable parallel computing. Each processing unit work on a fraction of the initial domain. The ideal scaling is the utopia of parallel processing : if you double the number of processors, you are twice faster.

The information at the frontier of these fractions is duplicated among neighboring processing units, ensuring communication between processes. More processes induces more d.o.fs. The best performance reachable is therefore the ideal strong scaling, taking into account the duplication of points.

The amount of duplication depends on the Algorithm and its domain decomposition method. We can approximate it to a function proportional to the inverse of the dofs number per process. We set it at 10% when the number of dof per partition is equal to dof_p_0, a parameter you can play with by changing the next ruler.

dof_duplication = 0.1* dof_p_0 / (partitions/dofs)

Proportion of d.o.f. duplicated .vs. dofs/Partition nb.

dof_p_01.000e+3 : 100 40000

Strong scaling limit available versus partitions number

The strong scaling limit is the impact of dof duplications on your ideal linear scaling. This is intrinsic to the software and independant from the machine. However the strong scaling limit highly depends on the initial dof number of your problem. That is why choosing a problem with a very large nb. of dofs makes the strong scaling curve look sooo much better. Just look at the following equation to grasp the power of Insanely large meshes.

strong_scaling_limit = (x)/ (1. + (x)*0.1*dof_b_p/dofs)

You can instead change the mesh simultaneously with the number of partitions. For example, keeping the same amount of dofs/partition is called weak scaling. Used in the wrong context weak scaling can be a way to fool the masses with parallel performance.

However, the bigger the mesh is, the smaller the time step. Huge meshes easily translate into very short simulation durations. Do not get fooled. Roughly speaking the cost of your simulation is, without care:

global_cost =~ 1. / typ_dx^4

This is why the meshing step is so critical : you aim for the minimal amount of dof for a given precision on a given phenomenon.

A fun analogy: The crop field

In a way, your problem is translated by the software into a large set of computations to make. Assume these billions of computations are crops in a large field. The processing units, in other words the cores inside your GPUs or CPUs, are reaping these computations.

Photo by James Baltz on Unsplash.

Think about the following assertions:

The bigger the field is, the more you can use machines.
Two machines will be usually faster than one.
Many machines means, for each driver, more communication and waiting.
One thousand machines will have a hard time to work efficiently.
For the same reaping power on the same field, using more, smaller, machines implies losses by additionnal communications.
Partitioning the field between machines is done according to the number and type of machines.

The computing resources

Now the computation is all set up, and we heard about the scaling. However scaling is not performance!. Indeed scaling is a non-dimensional number and its reference can easily be misunderstood -or purposely hidden-. Mixing scaling and performance is, by the way, an other stunt to fool the masses with parallel performances.

We need to talk now about the actual time-to-solution of the solver on the machine. The computing resource to consider is not a given processing unit, but the actual queue available to the end-user : a fixed nb. of processing units, each featuring a fixed nb. of cores for a fixed allocated time. These processing unit deliver a computational power expressed in FLoating-point OPerations Por Second (FLOPS), the Peak performance

Number of nodes: 10.00

Number of cores per node : 36.00

Job duration in hours : 12.00

Peak performance per node in TFLOPs : 2.18

Power consumption W per node: 100.00

Smart people managed to port the code with a given efficiency on this machine. The trick is to shape the computational algorithm according to the architecture. They call this sustained efficiency.

Sustained efficiency : 0.05

In other words, with the present hardware, the peak, or ideal, performance is 2.180e+13 FLOPs but the present combination of setup/software/machine is only 1.090e+12 FLOPs.

actual_perf = sustained_efficiency*peak_perf_tflops*n_cpu

You can finaly estimate the time to solution using your actual dofs and performance:

time_to_solution = dofs*iteration * dof_duplication * flop_dof_it / actual_perf

Power and Energy comparisons

Even if your computing resource often provide energy or power figures on your consumption report, kW.H and kW are always a bit abstract. We can convert these figures into more explicit numbers.

In terms of raw power first: 1.00 kW is equivalent to 0.38 electric ovens (2.6kW) at full power for 1.5 hours. If this power was produced by humans, you would need 2e+1 slaves working non-stop during 1.5 hours (nights and meals excluded of course!).

Considering the energy consumed, imagine a fuel-powered computing cluster. 1.53 kW.hours can be produced by burning efficiently 0.000942 oil barrels (aprrox. 1628.2kW.h). Luckily oil is not powering our cluster. The Carbon FootPrint totally depends on the Energy Mix of the country. In france, assuming 38.95 gCO2/kWh, the simulation releases 0.0598 kG of CO2. This CO2 is equivalent to 0.504 km in an average car (118gCO2/km) , or 0.00120 flights Paris-London (50KgCO2/seat).

If you like these comparison, read more on green algorithms dynamic page.

Comparing two machines

Comparing two different hardwares is never easy. In the following table, you can see all the parameters that will matter. The initial values provided here are typical of CPU to GPU comparisons. Note that, as the structure of newer machines are more complex, the sustained efficiency of a software is naturally dropping when porting. As a consequence, keeping the sustained efficiency up always requires more effort.

Ratio on nb. of nodes:

Ratio on nb. of core per node:

Ratio on peak perf per node:

Ratio on power consumption:

Ratio on efficiency code/machine:

Nodes

0.1

Cores per node

13.8

497

Node Peak perf. TFLOPS

2.18

Node Power cons. W

100

400

Sustained Efficiency

0.05

0.0078

0.00039

Final dofs.

10036003

1.0

10049703

time_to_solution(h))

1.5

1.4e+2

Energy kWh

1.5

With this table, see if the new hardware helps you to save energy while you decrease the time to solution. If you are not, well you are not getting greener with your simulations... bosco

This work has been supported by the Coe Excellerat grand agreement.