simulation of fluid flows with lattice boltzmann on gpu’sa cheap, common 0 card, can be used...
Post on 27-Sep-2020
1 Views
Preview:
TRANSCRIPT
Simulation of fluid flows with Lattice Boltzmann onGPU’s
Joaquim José da Costa e Simas
Master Thesis on:
Mechanical Engineering
Jury
President: Prof. Doutor Luís Rego da Cunha de Eça
Supervisors: Prof. Doutor José Carlos Fernandes Pereira
Mestre Ricardo José Nunes dos Reis
Examiner: Doutor Mário António Prazeres Lino da Silva
December, 2010
Abstract
In this work, two numerical schemes of Lattice Boltzmann are implemented in three different parallel
machines, each with it’s own programming model: a multiprocessor computer, a cluster of computers
and a GPU. The latter, is a graphics processor with a parallel architecture. This architecture allows it
to achieve high computational power with very competitive performances in terms of energetic efficience.
Lattice Boltzmann methods are known for their efficient and simple numerical schemes. Despite their
simplicity, these particle methods top almost any other CFD implementation when dealing with shape-
shifting irregular and disordered boundary conditions. This fact alone gives them a very competitive
edge. But, it is the combination of theses characteristics with it’s highly parallel nature and the still
growing fields of research and applications, that all leads to believe in the present and future importance
of this method.
Noticeable results have been achieved, in terms of simulations speed. The more promising of these
have been obtained using the GPU: speedups in the order of the hundredths. A cheap, common 0 card,
can be used almost as a mini-cluster. This fact will lead to numerous work and investigation to be
devoted to Lattice Boltzmann and GPU’s.
Keywords: Lattice Boltzmann, LGBK, MRT, OpenMP, MPI, GPU, CUDA
Resumo
Neste trabalho vão ser implementados dois esquemas numéricos do método de Lattice Boltzmann,
em três meios diferentes de computação em paralelo, tendo cada um destes meios um modelo e uma
maneira própria de pensar no paralelismo: num computador com múltiplos processadores, num cluster
de computadores e num GPU. Este último, um processador gráfico, tem uma arquitectura inerentemente
paralela. Esta arquitectura permite que o GPU atinja valores muito elevados no que se refere ao poder
computacional e uma eficiência energética muito competitiva.
Os métodos de Lattice Boltzmann são reconhecidos como sendo esquemas numéricos simples mas
eficientes. São métodos de particulas que, apesar da sua simplicidade, estão demonstrados como sendo
mais eficazes que outras implementações de CFD no que toca a geometrias irregulares, que mudam a sua
forma com o tempo e que estão espalhadas aleatoriamente pelo seio do fluido. Só esta caracteristica já
lhe confere um certo estatuto especial. Mas, é na combinação desta com a natureza altamente paralela
do esquema e com o ainda crescente campo de conhecimentos e aplicações, que tudo indicam que este
método irá ser fundamental no futuro do CFD.
Foram obtidos resultados impressionantes no que toca à rapidez das simulações. A que mais se
destacou foi a implementação no GPU: foram atingidos speedups na ordem das centenas. Baratos, são um
elemento quase banal, existente em qualquer computador. Encerram um enorme poder computacional,
pelo que são quase como um mini-cluster. Estes factos levam a que muita investigação se esteja a focar
na união entre os modelos de partículas e os GPU’s.
Palavras-chave: Lattice Boltzmann, LGBK, MRT, OpenMP, MPI, GPU, CUDA
Contents
1 Introduction 6
1.1 The problem and it’s relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Organizations of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Lattice Boltzmann Equation (LBE) 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 LBGK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Multi-Relaxation Time method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Boundary Conditions (BC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Parallel processing 22
3.1 Classifications of parallel computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 A primer on CPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Classifying parallel machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Massively Parallel Processors: GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.2 CUDA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.3 CUDA programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Implementations of the LB methods 37
4.1 Serial implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Shared memory implementation: OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Distributed memory implementation: MPI . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1
4.4 Massively Parallel Processors: CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Results 45
5.1 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Poiseulle Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Shared Memory parallelization: OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Distributed memory : MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Massively parallel processors: CUDA . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Conclusion 56
.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
.2 Rules of thumb for higher performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2
List of Figures
2.1 Particles’ direction of travel for the D2Q9 (left) and the D3Q19 (right). . . . . . . . . . . 12
2.2 Algorithm of LGBK’s method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Algorithm of MRT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Periodic boundary conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Simplified scheme of a CPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Scheme of a UMA layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Scheme of a NUMA layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Distributed memory system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Hybrid memory layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Representation of MPI’s hierarchical chain. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Layout of a typical MPI program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8 The GPU’s architecture. The small squares are CUDA cores. Two rows of eight Streaming
processors can be seen. The remaining blocks represent memories. . . . . . . . . . . . . . 31
3.9 The Tesla C1060 board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.10 Code running in the host with a single thread. Kernels are offloaded to the GPU and ran
with thousands of threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.11 Availability of the different types of memory to the CUDA working structures. . . . . . . 34
3.12 Illustration of the organization of one bi-dimensional grid into blocks and one block into
threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.13 Memory layout in the device and it’s accessibility to the processors. . . . . . . . . . . . . 35
4.1 For a given node, each particle’s PDF is stored one after the other, according to the
numeration used for each direction. Nodes also follow a lattice numeration. . . . . . . . . 38
4.2 Domain divided in two parts, one white, one gray. Each part is assigned to a node of the
Cluster. In the case of the white sub-domain, there is an extra frame , where data from
the frontier cells of the neighbor sub-domain are made available: Ghost cells. . . . . . . . 40
4.3 Each direction of propagation of a particle has an array. For each array, nodes are assigned
consecutively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3
5.1 Longitudinal velocity profile obtained for the Poiseuille flow test case. . . . . . . . . . . . 46
5.2 The error decreasing with the number of iterations. The criterion for convergence was set
to 10−6. It took 9556 iterations in a 256 × 64 mesh. . . . . . . . . . . . . . . . . . . . . . 47
5.3 Bidimensional plot of the horizontal velocity of a Poiseuille flow for a converged case.
The maximum velocity is in the middle of the channel section, and the velocity profile is
constant throughout the channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Bi-dimensional plot of the moduli of the velocity for a Reynolds number of 1000 . . . . . 48
5.5 Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse
section. Obtained with SRT for a Re=100. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse
section. Obtained with MRT for a Re=1000. . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 A solution is considered converged, for the cavity, when the residue is inferior to 10−7.
We have used a logarithmic scale, both to the residue and to the number of iterations for
visualization purposes: small oscillations in the residue were small enough to look like a
straight line if the plot was residue vs iteration number. . . . . . . . . . . . . . . . . . . . 49
5.8 Horizontal velocity profile the top half nodes in the vicinity of the Wall (x=1). The code
is set to run 100 iterations for a Reynolds number of 5848. . . . . . . . . . . . . . . . . . . 50
5.9 Code scaling with the number of cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.10 The increase in performance, up to a certain point, with the increase of the lattice dimensions. 54
5.11 Comparison between the three implementations. The filled nodes use the scale on the left,
while the others the scale on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1 A 3 × 3 matrix stored in a row major fashion. . . . . . . . . . . . . . . . . . . . . . . . . 60
2 The loop on the left accesses data non-contiguously in memory. The loop on the right
uses a row major acessing pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3 The loop on the right is the unrolled version of the loop on the left. . . . . . . . . . . . . 61
4
List of Tables
4.1 General properties of the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Speedups obtained for the OpenMP implementation for up to four processors. . . . . . . 51
5.2 Speedups for different mesh sizes and cluster nodes. . . . . . . . . . . . . . . . . . . . . . 52
5.3 Speedups for different mesh sizes and cluster nodes. . . . . . . . . . . . . . . . . . . . . . 53
5.4 Comparison of the peak values in the three implementations. . . . . . . . . . . . . . . . . 54
5
Chapter 1
Introduction
1.1 The problem and it’s relevance
Computational Fluid Dynamics (CFD) is deeply and firmly rooted in the Navier-Stokes equation, but the
sheer computational demand, required to numerically solve the equations, always led the quest for faster
algorithms or ways to decrease simulations time. Examples of both strategies include simplifications of
the boundary layer and using parallel computing.
Lattice Boltzmann methods hold all the physical validity of the Navier-Stokes while bringing a simpler
set of equations that are highly parallelizable. Adding these factor to the exceptionally fast simulations
performed with low cost Graphics Processing Units (GPU’s), has generated high expectations regarding
this method as a future leader in many fields of the CFD world.
Parallelism is an intuitive solution to increase the productivity of a given activity. In every day life
there are numerous examples of how a system could increase it’s productivity by simply performing in a
parallel manner. One can find examples ranging from biological systems, to industrial production lines.
The procedure is simple: divide the problem into smaller parts and introduce more active elements to
work on these parts. But bare in mind that not all tasks can be parallelized. As Fred Brooks said, “Nine
woman can’t make a baby in one month”.
Imagine a coffee house with a queue of people wanting to buy a coffee. This coffee house has only one
employee, one coffee machine - which can take only one coffee at a time - and one cash register. In this
case, the employee can only serve one client at a time. The rate of service would increase if the coffee
machine could make two coffees at the same time; or if a second employee was introduced to the system,
taking coffees while the original employee makes the registers. Better yet would be two have two of each:
the two employees, two single-coffee machines and two cash registers.
This kind of examples, seemingly simple, provide useful analogies to the analysis of the advantages
and disadvantages of parallelism. And, as one introduces more realism, the more trickier things will
6
become, and the more they will resemble with the typical problems of computational parallelism. In
the example above, one could introduce constraints, such the number of expected clients or the profit
to be made. This constraints are determinant to decide what level of parallelism is needed to face the
constraints.
Many scientific problems can take advantage of parallelization. Examples include the simulation of
fluid flows, such as weather and climate forecasting, the interaction between atoms and molecules, code
cracking, medic visualization, finances, and so on.
So it’s no surprise that most of the ground works of parallelism, both theory and designs, trace back
to the 60’s and 70’s. This era was also characterized by Moore’s law, telling us that in every couple
of years, the number of transistors on a die will double. Back then it meant that some of the best
computers could easily became obsolete in a matter of months. Moore’s law was found to extend to
computer memories, bandwidths and other digital components and devices.
But very recently, the number of transistors on a die could not be increased. Not that the technology
to do so isn’t available, but physical and economic limitations, such as the heat dissipation needed and
the high energy consumption, made the number of transistors reach a plateau. In order to keep Moore’s
law alive, the architecture of a typical home computer has changed. Instead of having one CPU, we now
have more than one CPU, either on the same die - multi-core - or on different dies - multi-processor -. In
theory, this shift allowed for the revival of the Moore’s law, but, in order to actually notice the increased
performance, software has to be purposely written to take advantage of the multicore paradigm.
Although recently, servers and workstations can have processors with 8 or even 12 cores. Expensive,
but due to their high-demand activity and desired reliability, it’s a price worth paying. But, nowadays,
a 4-core computer can be cheaply acquired. It is the case of AMD’s quad-core CPU Phenom. And a
six-core, by Intel, is already rumored for next year. So, ranging from the hardware design up to the
applications, the use of parallelism has been a reality in the last years and, without any doubt, to the
nearby future.
The state of the art in parallel computing are Supercomputers, Clusters, the Grid, and, more recently,
the GPU’s.
Supercomputers are a special type of computer, whose architecture is different from the one com-
monly find on Desktop or laptop computers. They have thousands of processors. Some of the top
Supercomputers include the Cray Jaguar, with a peak speed of 2.3 petaflops, and IBM’s Roadrunner,
peaking 1.7 petaflops.
A cluster is the connection, with a high speed network, of many separate computers, each with it’s
own memory resources and processors. Clusters can be seen as a single computing machine.
If the cluster nodes, the computers, are spread across large distances, the cluster goes by the name
Grid. One of the biggest Grids in the world is the TeraGrid, with more than 1 petaflop of computing
capability and 30 petabytes of distributed storage.
7
One advantage of the Grid is the possibility of regular computer users sign up and, when their comput-
ers are idle, letting the Grid use their resources to perform some type of work and store data. Examples
of this include the seti@home, to help processing spacial radio signal, which may reveal extraterrestrial
intelligence, folding@home, to aid the study of protein folding, leading to better and innovative drugs,
or the lch@home, to work on the elementary particle physics, fundamental for unraveling, for example,
the origins of the Universe.
The GPU’s are processors that co-work with the CPU’s, relieving them of the intensive work of graphic
rendering. Built with that purpose, their architecture allows a much higher computational power, and
a more energetic efficiency. These characteristics are making them an attractive and cheaper way of
’competing’ with Supercomputers and Clusters, as one can pack a bunch of cheap GPU’s together and
achieve an exceptional computational power.
Numerous applications are already taking advantage of the GPU’s. In terms of CFD, Euler methods
have successfully been ported to the GPU. The achieved speedup rounds 12. A Burguer’s Equation
solver reached 18. But Lattice Boltzmann hit one of the strongest speedups: a 341 times improvement
over the CPU implementation [15].
Since Lattice Boltzmann have it’s strong points in handling well with traditionally cumbersome
boundary conditions, the combination of these characteristics with that of a highly parallel implementa-
tion are very promising.
1.2 Objectives
At the present moment, Lattice Boltzmann methods are an active field of research and development.
As some of the investigation being made in the laboratory includes scenarios where recent papers point
to LB as the best choice for a fast numerical solution, LASEF decided to start, from the very bottom,
with these methods. Thus, the main purpose of this thesis was to establish some groundwork related to
Lattice Boltzmann’s methods and to it’s parallel implementations.
1.3 Organizations of the thesis
Chapter 2 introduces in a historical perspective, the Lattice Boltzmann methods and the theoretical
background of two numerical implementations. Chapter 3 makes a brief description of the machines
that are used for parallel computation, over-viewing the programming models used in each machine.
Following, in chapter 4, all the developed implementations are described in terms of the modifications
performed to adjust the serial code to each parallel machine. All the results obtained with the different
parallel approaches are compared and contrasted in Chapter 5. The most relevant aspects of the present
work are presented in Chapter 6. Finally, one can find, in the appendices, a small glossary and a brief
description on basic principles of code optimization.
8
Chapter 2
Lattice Boltzmann Equation (LBE)
2.1 Introduction
The Universe is composed of atoms and molecules interacting with each other. It’s like a random dance,
whose erratic behavior is better described within the field of statistical mechanics. In this sea of particles,
the probability of finding one particle in the position −→x , at the time t and with the velocity −→v , is given
by f . This f is paramount to the Boltzmann Equation (BE) (eq. 2.1 ) and it’s called the probability
density function (PDF). It’s expressed in terms of
[
1
m3
(m
s
)
−3]
.
[
∂t +−→pm
∂−→x +−→F ∂−→p
]
f(−→x ,−→p , t) =∂f
∂t= C (2.1)
The left side of the Boltzmann equation represents the movement described by the particles (coupled
with the force field−→F ); while the right side represents the inter-particle collision operator. The letters ~p
and m are used to represent the momentum and the mass of the atoms/molecules.
Boltzmann considered that particles are point-like entities in a dilute gas. These particles can only
interact due to short range 2-body potentials and, as a consequence of the previous considerations, there
is no relation between particles entering a collision ( the molecular chaos assumption). After the collision,
however, the particles are coupled by the conservation laws of mass and momentum [14]. As such, the
collision operator calculated for a two body collision is the following
C = C1,2 =
∫
(f1′,2′ − f1,2) gσ (g, Ω) dΩd−→p2 (2.2)
Here the collision operator, C, was renamed to C1,2 to reinforce the idea of a two particle collision.
Also, being fi the PDF of the particle i, due to the molecular chaos assumption, f1,2 = f1f2. The other
parameters are σ, the differential cross section,−→Ω , the solid angle, −→g , the velocity of the molecules
relative to the solid angle and −→p , the momentum.
Boltzmann later showed, through his famous H-Theorem [4], the unification between two previously
unrelated fields - mechanics and thermodynamics - and that the BE could compute properties, such as
9
diffusivity, viscosity, thermal conductivity, and that it could also transport parameters [14].
Following in the wake of this idea, it was later showed, through the Chapman-Enskog procedure [4]
that the interaction between particles could lead to the hydrodynamic mean macroscopic properties.
This means that we can simulate the fluid dynamics through the underlying statistical particle collisions.
In 1986, it was shown that a cellular automata could reproduce the complexity of a fluid flow by
simulating the underlying microscopic interactions. Cellular Automatons are simple evolving models
defined by a regular lattice, where a certain number of particles take residence in each node and change
their states through interactions between these particles [3]. The interactions are defined by a set
of previously given rules, and the states tend to be boolean ( zero or one), hence, the microscopic
interactions are called boolean microdynamics [14].
This led to the Lattice Gas Cellular Automata (LGCA), where the boolean microdynamics gave rise
to the desired macroscopic parameters. However, this approaches had some crucial faults, namely the
inability to achieve high Reynolds numbers, the statistical noise and the difficulty to move to the third
dimension [14].
The attempt to avoid these drawbacks led to the creation of the Lattice Boltzmann Equation (LBE)
[3]. A smooth transition since the LBE was hidden inside the LGCA formulations. Despite having
eliminated the fore mentioned LGCA problems it still required some tuning to achieve the desired
behavior. The main equation of LBE is as follows
ifi = C1,2 = Aij
(
fj − feqj
)
(2.3)
The left side represents the streaming step, while the right side represents the collision operator. All
the non-linearities of the Boltzmann Equation are placed on the Collision operator (more precisely in the
feq term), while the streaming step is totally linear [1]. The matrix Aij is called the scattering matrix.
It’s entries are calculated in each iteration as they depend on the boolean states of the lattices’ node.
Experiments made with LBE models revealed that one can go as low in viscosity ( or as high in
the Reynolds number) as the lattice allows, and, for numerical simulation of turbulence, LBE is fairly
competitive with spectral methods.
More recently, further simplifications and improvements on the LBE led to the creation of the most
used models of the LBE: the LGBK, MRT and Entropic models [1]. This last one holds the promise of
a very strong stability. Analysis of this statement is still to be done. But it seems to be a promising
method.
This leads us to what is being done with Lattice Boltzmann (LB) in our days and why.
LB methods can be found in many different areas. Examples include the simulation of combustion
inside porous media, fluid flow of blood inside the arteries with the interaction between the blood and
the vessel’s wall, microfluidics, specially in the area of liquid crystals, in the cinema industry, for CG
fluid flows and particle-suspensions, etc. A unique niche can be found in the linkage between molecular
dynamics and the ’continuum’ computational dynamics.
10
Overall, this method is simple and flexible, easy to implement and ’easily’ adapted to non-uniform
grids and to time-changing irregular geometries. But it’s the fact that they are highly parallel that have
led to their recent re-insurgence.
2.2 Numerical Methods
In general, every LB implementation is composed of a temporal loop where, to a given time and to every
node of the lattice, there is a collision between particles, followed by the propagation of these particles
to neighbor nodes. The propagation path is well defined by the lattice used.
Differences between implementations result from the number of particles present in each node (at any
given time), the way the collision step is made, and what type of formulation was used for the boundary
conditions [14]. Lattice Boltzmann schemes are all explicit second-order accurate ( more in the case of
turbulence) with the possibility of plummeting to first-order on the boundaries if first order boundary
conditions are used[14].
Generally, the number of particles used per node and the type of space where they exist are represented
as DnQm. n stands for the number of the dimensions and m the number of particles. It is important
to choose properly the number of particles for a given space, as some models may be unable to capture
certain physical phenomena. This is due to the stability criterion of the LB: the lattice has to be able to
support the fastest speed of physical information (Courant number). For this reason, most LB methods
are limited to incompressible regimes ( numbers of Mach inferior to 0.33) [18] [8]. Also, the wrong choice
of the number of particles and their directions of propagation may lead, through the Chapman-Enskog
procedure, to an equation different from the Navier-Stokes.
To two dimensional simulations the D2Q9 model was used. In this model, every node has 9 particles,
8 of which move to 8 neighboring nodes and one particle that remains at rest. For three dimensional
lattices, the D3Q19 model was used.
Even between models DnQm that can restore the Navier-Stokes, the number of particles and their
propagation paths are still important in terms of numerical errors and to the quality of the obtained
results. For example, the D3Q15 model is very susceptible to short-wave oscillations [8]. The D3Q19 is
not. That is due, not only to the extra four particles, but also to the way that particles are free to move.
Choosing the D3Q27 model would not bring much more stability and precision than the D3Q19, but it
would bring more complexity and computational demand.
According to the coordinate system and the indexing given, it’s necessary to define the translation
vectors of the particles and their weights, e and W respectively. These weights can be seen as a adimen-
sional masses. So, the lighter the mass, the further it can travel (for the same time step). For the D2Q9
11
Figure 2.1: Particles’ direction of travel for the D2Q9 (left) and the D3Q19 (right).
model represented in figure 2.1
eiD2Q9 =
( 0, 0), i = 0
( ± 1, 0), ( 0, ± 1), i = 1, 3, 5, 7
( ± 1, ± 1), i = 2, 4, 6, 8
(2.4)
WiD2Q9 =
4/9, i=0
1/9, i = 1, 3, 5, 7
1/36, i = 2, 4, 6, 8
(2.5)
For the D3Q19, figure 2.1
eiD3Q19 =
( 0, 0, 0), i = 0
( ± 1, 0, 0), ( 0, ± 1, 0), ( 0, 0, ± 1), i = 1, 2, ... , 6
( ± 1, ± 1, 0), ( 0, ± 1, ± 1), ( ± 1, 0, ± 1), i = 7, 8, ..., 18
(2.6)
WiD3Q19 =
1/3, i=0
1/18, i = 1, 2, ... , 6
1/36, i = 7, 8, ... , 17, 18
(2.7)
(2.8)
In Lattice Boltzmann methods, it’s typical to work on unitary units - the lattice spacing and the time
step are taken as unitary. As a result, the obtained hydrodynamic properties need to be scaled in order
to be physical congruent with experimental values. The main scale factors are for time, space and mass.
All other scale factors can be deduced from these three.
Let N be the number of lattice nodes in one dimension of the domain, L the physical dimension of
the domain (in centimeters), then the lattice spacing, x, is given (in centimeters) as
12
Sl = x =L
N − 1(2.9)
The two other scale factors are:
St = t =cs x
Cs
(2.10)
Sm = m (2.11)
Where Cs and cs are, respectively, the physical and the lattice’s speed of sound, m is the physical
mass of the fluid molecules (in grams) and t the time (in seconds). The lattice speed of sound is a
constant, cs =1√3, and holds valid for both the two and the three dimensional lattices used.
In the presented cases, two collision operators were used, both simplifications of the LBE model: the
BGK and the MRT.
2.2.1 LBGK
The Lattice Bhatnagar-Gross-Krook (LBGK) method, sometimes called Single Relaxation Time (SRT),
is the name of the model that uses the BGK collision operator. This collision operator was obtained by
simplifying the scattering matrix Aij , since only one of the numerous parameters present in the matrix
is actually responsible for controlling viscosity and transport phenomena. All the other elements of this
matrix are there to control the non-hydrodynamics modes (called ghost modes) that induce numerical
instability [14]. This simplification reduces the scattering matrix to a single relaxation parameter: ω.
CBGK = −ω(f − feq) (2.12)
This operator removes some of the mathematical complexity and computational demand of LBE
without compromising the physics - the Navier Stokes equation can still be recovered from the LBGK
model [1].
Nonetheless, all the modes, the rate of transfer of mass, momentum, heat, etc, all take place in the
same time step [14]. This implies, for example, that the Prandtl number is unitary and that short-wave
oscillations and numerical instabilities will acquire enough weight to take it’s toll in the model’s stability
[18]. In theory, LGBK can simulate any given Reynolds number as long as the lattice is tight enough to
capture the small scales. For example, a 128 × 128 lattice cannot capture Reynolds bigger than 2000.
But, if we increase the mesh to 200 × 200, then we can achieve Reynolds up to 8000. This has to do
with gradients that occur in singular points. These gradients create short-wave oscillation that are only
loosely captured by the lattice. It’s an alias effect that leads to numerical instabilities that devoid the
results of meaning.
So, generally, the LBGK model is given by
13
fi(x + ci t, t + t) − fi(x, t) = −ω(fi(x, t) − feqi (x, t)) (2.13)
The terms on the left side represent the advection, and the ones on the right the collision.
ω is defined as the inverse of the time scale and is linked with the kinematic viscosity.
ω =1
τ, 0 < ω < 2 (2.14)
ν =2τ − 1
6, (2.15)
feq is the Maxwell-Boltzmann equilibrium distribution approximation, function of the local conserved
quantities : ρ, −→v , etc. It’s inside this member that all the non-linearities of the Navier-Stokes are hidden.
feqi ≈ ρWi
(
1 +eiu
c2s
+(eiu)
2
2c4s
+u2
2c2s
+(eiu)
3
2c6s
+(eiu)u
2c4s
)
+ O[u4] (2.16)
Here cs represents the speed of sound in lattice units.
Finally, conservation of mass (eq. 2.17) and momentum (eq. 2.18) have to be reinforced, since they
were implicitly included in the scattering matrix Aij .
ρ =
m−1∑
i=0
fi (2.17)
ρuα =
m−1∑
i=0
fieiα (2.18)
Since this method only holds for incompressible situations (or, more precisely, cases with small density
variations), if one wants to obtain the pressure:
p = c2sρ (2.19)
In case there are external forces acting upon the fluid, each particle will receive a momentum input
in accordance to the mass of the particle and it’s direction of propagation [14].
14
Figure 2.2: Algorithm of LGBK’s method.
15
2.2.2 Multi-Relaxation Time method
With the purpose of having a model that could reduce the ghost modes, allow high Reynolds numbers
in (relatively) small lattices, while maintaining the simplicity of the LBGK, the Multi-Relaxation Time
(MRT) method was proposed [8]. It was obtained by simplifications made on the scattering matrix
of the LBE, while maintaining the possibility to control how the ghost modes and other instabilities
behave ( although they couldn’t be completely eliminated) [1]. Reynolds numbers as high as 106 have
been obtained for a 2D lid driven cavity, although the accuracy of the results could not be verified [18] [14].
The collision term of the MRT model is given by:
C12 = −[M ]−1[S] (|R〉 − |Req〉 ) (2.20)
To be in accordance with the literature, we have used [ ] to represent a matrix and |〉 to represent a
column vector.
S is the relaxation matrix. It is diagonal and every element is given by a sii = ωi =1
τi
. The possibility
of having multiple relaxations is what offers this model the increased stability. Note that the entries of S
that act upon mass and momentum are taken to be zero to reinforce the conservation of these properties.
If all non-zero values of S take the same ω, the MRT model will have the same behavior has the LGBK
(with the same relaxation) [8].
S2D = diagonal (0, s2, s3, 0, s5, 0, s7, s8, s9) (2.21)
S3D = diagonal (0, s1, s2, 0, s4, 0, s4, 0, s4, s9, s10, s9, s10, s13, s13, s13, s16, s16, s16) (2.22)
The fluid viscosity is obtained by:
ν =1
3
(
1
s9
− 1
2
)
(2.23)
R is a column vector that stores the macroscopic properties of the node. The number of elements of
the column are equal to the number of particles living in a node. The elements of R are density ρ, the
part of the kinetic energy independent of the density ǫ, the momentum j, the energy flux independent
of the mass flux q, viscous stress tensor p and an anti-symmetric third-order moment m [8].
R2D = (ρ, e, ε, jx, qx, jy, qy, pxx, pxy) (2.24)
R3D = (ρ, e, ε, jx, qx, jy, qy, jz, qz, pxx, πxx, pww, πww, pxy, pyz, pxz, mx, my, mz) (2.25)
R is obtained by performing a matrix transformation of f through the matrix M. This matrix depends
upon the numeration given to the directions a particle can travel (fig. 2.1). A use of a different numeration
16
makes it necessary to re-order the columns of M, and is common in the literature to spot different versions
of these matrices. But keep in mind it’s function is to summarize the conservation laws, as it is the matrix
version of eq. 2.17 and eq. 2.18, among others.
|R〉 = [M ] |f〉 (2.26)
MD2Q9 =
1 1 1 1 1 1 1 1 1
−4 −1 2 −1 2 −1 2 −1 2
4 −2 1 −2 1 −2 1 −2 1
0 1 1 0 −1 −1 −1 0 1
0 −2 1 0 −1 2 −1 0 1
0 0 1 1 1 0 −1 −1 −1
0 0 1 −2 1 0 −1 2 −1
0 1 0 −1 0 1 0 −1 0
0 0 1 0 −1 0 1 0 −1
(2.27)
MD3Q19 =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-30 -11 -11 -11 -11 -11 -11 8 8 8 8 8 8 8 8 8 8 8 8
12 -4 -4 -4 -4 -4 -4 1 1 1 1 1 1 1 1 1 1 1 1
0 1 -1 0 0 0 0 1 -1 1 -1 1 -1 1 -1 0 0 0 0
0 -4 4 0 0 0 0 1 -1 1 -1 1 -1 1 -1 0 0 0 0
0 0 0 1 -1 0 0 1 1 -1 -1 0 0 0 0 1 -1 1 -1
0 0 0 -4 4 0 0 1 1 -1 -1 0 0 0 0 1 -1 1 -1
0 0 0 0 0 1 -1 0 0 0 0 1 1 -1 -1 1 1 -1 -1
0 2 2 -1 -1 -1 -1 1 1 1 1 1 1 1 1 -2 -2 -2 -2
0 -4 -4 2 2 2 2 1 1 1 1 1 1 1 1 -2 -2 -2 -2
0 0 0 1 1 -1 -1 1 1 1 1 -1 -1 -1 -1 0 0 0 0
0 0 0 -2 -2 2 2 1 1 1 1 -1 -1 -1 -1 0 0 0 0
0 0 0 0 0 0 0 1 -1 -1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1
0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 0 0 0 0
0 0 0 0 0 0 0 1 -1 1 -1 -1 1 -1 1 0 0 0 0
0 0 0 0 0 0 0 -1 -1 1 1 0 0 0 0 1 -1 1 -1
0 0 0 0 0 0 0 0 0 0 0 1 1 -1 -1 -1 -1 1 1
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
(2.28)
Req is a column vector responsible for storing the equilibrium macroscopic properties (similar to the
feq in the LGBK). For the bi-dimensional case, the equilibrium terms are given by:
17
eeq = −2ρ + 3(j2x + j2
y)
ǫeq = ρ − 3(j2x + j2
y)
qeqx = −jx
qeqy = −jy
peqxx = j2
x − j2y
peqxy = jxjy
And for the three dimensional case:
eeq = −11ρ +19
ρ0
(j2x + j2
y + j2z )
ǫeq = wǫρ +wǫj
ρ0
(j2x + j2
y + j2z )
qeqx = −2
3jx
qeqy = −2
3jy
qeqz = −2
3jz
peqxx =
1
ρ0
[2j2x − (j2
y + j2z )]
peqxy =
1
ρ0
jxjy
peqyz =
1
ρ0
jyjz
peqxz =
1
ρ0
jxjz
peqww =
1
ρ0
[j2x − j2
z ]
πeqxx = wxxpeq
xx
πeqww = wxxpeq
ww
meqx = meq
y = meqz = 0
Here wǫ, wǫj and wxx are weights used together with the relaxation matrix to provide stability.
According to [8], the optimized stability uses wǫ = 0, wǫj =475
63, wxx = 0 and s1 = 1.19 s2 = s10 = 1.4
s4 = 1.2 s16 = 1.98.
More concisely, this method consists in working in the momentum space for the collision and in the
velocity space for the propagation. That means that in the collision phase we start by calculating the
macroscopic properties, R and Req, then calculating a post-collision macroscopic property, R∗ (eq. 2.29)
and then calculate the post-collision PDF as:
18
|f〉 = [M ]−1 (|R〉 − [S] (|R〉 − |Req〉 )) = [M ]−1 |R∗〉 (2.29)
The streaming step is equal to the one of the LGBK.
Figure 2.3: Algorithm of MRT.
Overall, this method is only 15 % slower than the SRT [8] [17]. All this extra time takes place in the
collision step.
19
2.3 Boundary Conditions (BC)
Boundary conditions have an important effect on the stability of the models [1]. Although complex
boundary conditions are still an active field of research [14], the simpler boundary conditions used in this
work are pretty well matured. They are the no-slip, the sliding wall and the periodic BC’s.
The no-slip boundary condition is replaced by the bounce-back boundary condition (BBBC). In the
no-slip case, we assure that the velocity tangentially to the wall is zero (perpendicular if the wall is non-
permeable). In the BBBC this is not needed. Since LB can work on the mesoscale, the particles velocity
at the interface may be different from zero, although macroscopically it can be perceived that way.
Therefore, the BBBC only guarantees the conservation of the particles energy. They collide elastically
with the wall and return in the opposite direction. Forcing the macroscopic velocity on the wall to be
zero will dampen the flow near the wall, as energy will not be conserved.
The wall where particles bounce-back can either be placed on the node or in-between nodes of the
grid. If the wall is placed on a grid node, then it takes two iterations for the presence of the wall to
be felt by the neighboring flow (one iteration to receive the particles, other to propagate them back).
On-grid walls give rise to first order accurate boundary conditions. If the wall is in between nodes, the
particle collides with the wall at a mid point, which means it bounces back to the fluid nodes at the
same iteration. This type of approach offers a second order accuracy.
A sliding wall is a special case of a bounce-back condition where the particles reflected from the wall
are impinged by the momentum of the wall, propagating this momentum to the fluid below.
The simplest way to apply a sliding wall BC is using the macroscopic properties, such as the velocity
of the wall, to calculate the equilibrium PDF’s of the particles propagating out of the wall. This is a
first order approximation [8] [1].
fi = Wiρ0
eiUwall
c2s
(2.30)
A second order approximation would take into account the interaction of the particles colliding with
the wall and being reflected back [8] [1].
f∗
i = fi + 2Wiρ0
eiUwall
c2s
(2.31)
This is a super-set of the first order approximation.
The periodic boundary condition, used for infinite domains, consists in passing the particles leaving
the computational domain nodes and making them re-enter through the opposite side. For example,
let’s suppose the periodic boundaries are the left and right edges of a bi-dimensional rectangular domain.
A node placed in the right frontier will have the east, north-east and south-east particles leaving the
20
domain. These particles are placed in the first node of the left boundary. This left node shares the same
y value. The same occurs for the particles leaving the left frontier.
Figure 2.4: Periodic boundary conditions.
In terms of the PDF’s, this condition is translated in the following way
fin(E) = fout(E)
fin(NE) = fout(NE)
fin(SE) = fout(SE)
fin(W ) = fout(W )
fin(NW ) = fout(NW )
fin(SW ) = fout(SW )
21
Chapter 3
Parallel processing
3.1 Classifications of parallel computing
Roughly speaking, parallel computing consists in dividing a problem into smaller ones, sending each one
of these parts to be solved on a different processor, and combining all of the results. If all goes well, the
time it took the problem to be solved in parallel, Tpar, should be less than the time it would take to
solve the same problem in a sequential way, Tseq. The ratio between the two, the speedup S, should be
higher than the unity, and is a way of comparing the performances.
S =Tseq
Tpar
(3.1)
The processors mentioned above are either the CPU or the GPU. This last one, more correctly called
a co-processor, shall be dealt with later. For now, we will stick with the CPU-based parallel machines.
Nonetheless, some of the given concepts will be applicable to the GPUs.
3.1.1 A primer on CPU architecture
Given it’s pivotal nature in parallel machines, it is important to understand how the CPU works. This
kind of processor is better understood with the von Neumann architecture model [13]. It consists in
three major components: a central processing unit (or CPU), the main memory and the Input/Output
systems.
22
Figure 3.1: Simplified scheme of a CPU architecture.
A program, which is a set of instructions, is stored in the global memory, the RAM. An input, such
as a key stroke, tells the Control Unit, the CU, to retrieve a certain instruction from memory - fetch
operation. It then decodes the instruction and executes it. If a logic or arithmetic operation is required,
the ALU, the Arithmetic and Logic Unit, will be summoned and fed with the data.
If intermediate values are present, they are stored in the nearby memory. First the registers, then the
caches and, finally, the main memory. This hierarchy of memory storage has to do with the time it takes
to read/write to the memory. The register memory is the closest, therefore the fastest memory. It’s also
the smallest. Caches have bigger storage than registers but are further away. The slowest memory, yet
the biggest, is the main memory. This is where final values are recorded.
Finally, the result of the program is sent to the output. The display, for example.
All the connections between elements of the CPU, and between the CPU and the main memory is
done by buses, a collection of physical wires.
This was an example of how a CPU works. Nowadays, CPU architectures are much more complex
than this one, and they use way more complex instruction cycles than the fetch-decode-execute model
[13]. Mainly because a CPU has to be able to perform very well in many different kinds of solicitations.
This ends up making the CPU a compromise solution to adjust the demands.
3.1.2 Classifying parallel machines
It’s the way processors and memories are assembled ’together’ that determines the type of parallel ma-
chine. The most common ways to classify this machines are according to the Flynn’s taxonomy and to
the memory layout architecture [6].
Flynn’s taxonomy classifies a computer according to the number of concurrent instructions and the
data streams available in the architecture. There are four classifications possible:
• Single Instruction, Single Data (SISD) - The processor executes one instruction set over a
23
single data stream.
• Single Instruction, Multiple Data (SIMD) - Each processor executes the same instruction
but on different data streams. It’s the case of GPU’s and CPU’s of Supercomputers.
• Multiple Instruction, Single Data (MISD) - The same data stream will be sent to different
processors, each one of this executing a different set of instructions over the data.
• Multiple Instruction, Multiple Data (MIMD) - Every processor may run a different set of
instructions over different sets of data. It’s a superset of SIMD. Most parallel computers fall into
this category: Supercomputers, Clusters, Grids.
According to the memory architecture, parallel machines can be classified as shared, distributed or
hybrid [6].
In the shared memory architecture, all processors have access to the same global memory resource.
If the distribution of the processors around the memory is done in a way that the memory access times
are equal, shared memory machines can be called UMA - Uniform Memory Access - (fig. 3.2). They are
called NUMA otherwise - Non Uniform Memory Access - (fig. 3.3). The former is the case of Symmetrical
Multiprocessor machines - SMP- found in desktop PC’s, laptops and Supercomputers boards. The latter
is made by linking two or more SMP’s by means of a bus interconnect.
Since processors act independently, sharing access to the same pool of memory holds the danger of
having one processor retrieving a certain value from memory unaware that another processor had just
changed that value. This leads to faulty results in the first processor. It also leads to apparent non-
deterministic behavior, as sometimes the first processor may retrieve the value before it was changed,
others after. This type of situations, where results are affected by the timing, are called race condi-
tion [10]. It’s an undesired event. Normally, this type of machines have built-in cache-coherence (cc),
which means that if one processor updated a certain value in memory, all other processors are notified.
Nonetheless is up to the programmer to be careful with memory accesses. Tools such as locks/unlocks
and semaphores are used to guarantee correct data accesses [6].
The main advantages of this type of architectures are the quick memory accesses, due to the memory
proximity to the CPU’s. The main disadvantage are the lack of scalability between CPU’s and memories.
What this means is that if we increasing the number of processors we would expect an increase in the
parallel performance. But, due to the increase memory traffic, that may not be the case. Race conditions
also became more probable.
Distributed memory systems consist in the connection, through a network, of processors, each with
a local pool of memory. Each processor has free access to it’s local memory (registers, caches and main)
but, to access main memories of other processors, a protocol has to be established between the two.
24
Figure 3.2: Scheme of a UMA layout. Figure 3.3: Scheme of a NUMA layout.
Scalability of memory, which is the increase of main memory size with the increase of the number of
processors, proves to be one of the main advantages of this type of architecture. But, since communication
between processors has to go through the network’s connection bandwidth, communication times are
elevated [6]. This poses the main disadvantage of this memory architecture, as it affects the overall time
of a simulation. To achieve an overall good performance, strategies to hide the communication time have
to be applied.
Figure 3.4: Distributed memory system.
The hybrid memory architecture is the combination of the two previous systems: locally a shared
memory system, either UMA or NUMA, globally connected to other systems. It is the most commonly
used. Grid, Clusters and Supercomputer all follow this memory layout.
Figure 3.5: Hybrid memory layout.
3.2 Parallel Programming Models
In theory, any parallel programming model can be applied to any type of parallel machine or memory
architecture. Nonetheless, some models may prove to be more fitting to certain machines than others.
25
The major parallel programming models are the following: shared memory, threads, message passing,
data parallel and hybrid. From these, only two were implemented in the CPU: threads and message
passing. The former using OpenMP; the latter with MPI. Commonly one finds OpenMP and MPI used
together in a hybrid approach. Cluster and Supercomputers use it.
3.3 OpenMP
OpenMP is commonly used as a mean to split tasks among different processors in a shared memory
architecture [5]. It acts upon the threads, hence being a threads-type parallel programming model.
Threads are defined as “execution states that are able to process an instruction stream”[5]. This
means that they are a sub-task of a program, just like a sub-routine.
It is normally associated that a sequential program running on only one processor, consists of only one
thread being executed at a time. Although not entirely false, CPU’s can execute more than one thread
’simultaneously’ using an interleaved method. It’s an instruction-level parallelism used to minimize CPU
idle times [10] [5].
But to us, multiple threads make more sense with more than one processor, each processor running
a thread.
Threads are generated in forks of the execution, where different sets of instructions, with local data,
can be running concurrently. Besides their own private data, they all share the resources available to the
program. This is why one of the main problems with threads is coordinating their actions and avoid-
ing data corruption due to race conditions. For that purpose, methods like barriers and locks are used [5].
OpenMP consists of compiler directives, runtime library routines and environment variables, all to
take the burden of sorting the low level details and work assignments. The programmer has to identify
where work can be divided and run concurrently, by putting a special comment or a pragma in the code.
Let’s look at an example of a matrix, M, times a vector, v.
#pragma omp p a r a l l e l f o r d e f a u l t ( none ) shared (m, n , a , b , c ) p r i va t e ( i , j )
f o r ( i = 0; i < m ; i++)
a [ i ] = 0 . 0 ;
f o r ( j = 0; j < n ; j++)
a [ i ] += b [ i∗n+j ] ∗ c [ j ] ;
The two for loops are a sequential implementation of the multiplication. The only addition to a
normal code was the first line. #pragma omp is the OpenMP directive that tells the compiler this is an
26
OpenMP directive. It is followed by the construct parallel, meaning the command following this special
comment will be able to have more than one thread running. So, the thread that found the OpenMP
directive forks into a team of threads, giving to each thread an ID number. But it does not tell the
threads what to do, hence the construct for. This is a loop construct responsible for assigning some of
the loop iterations to each thread. Next is an indication of what variables are able to be modified by all
threads and which ones are only available to each thread.
Although not explicit, at the end of the parallel region there is a barrier responsible for synchronizing
the threads and, in this case, to join them into a single thread that continues running the program.
It should be noted that we cannot make any assumptions on the order of the execution of threads.
They are not deterministic.
3.4 MPI
Since in parallel computation a processor is given only a small part of the whole problem, it is common
that the processor may need data calculated by another processor. In a shared memory architecture this
can be done through the use of main memory. In the case of a distributed memory architecture, a form
of communication must be established between the two. The processor in need of data has to receive
this data, and the processor responsible for producing the data must send it to the right processor. This
communication problem can be done with MPI.
MPI, which stands for Message Passing Interface, is the standard library specification for message-
passing between processors [10] [9]. It is a type of explicit parallelism - it’s up for the programmer
to identify how the parallelism should be done and use the appropriate MPI functions. Although not
obligatory, MPI is mostly used in distributed or hybrid memory architectures such as Clusters and Su-
percomputers.
In order to keep track of the processors and control their activities, MPI established a hierarchy. The
main entities are Communicators, groups and ranks [9]. A group is a set of processors. Each of the
processors inside the group are given an identification number starting with zero, it’s rank. Each group
has a communicator which is a unique handler to the group.
27
Figure 3.6: Representation of MPI’s hierarchical chain.
In order for a message to be sent, a protocol established that each message has to define the commu-
nicator - to reach the group -, the rank - to reach the specific processor inside the group-, the type of
data to be sent - integer, float, etc -, the size of the data to be sent, a tag - an integer to define the order
of receiving, in case many messages reach the processor at the same time - and the source - the rank of
the processor that sent the message. Only if the receiving processor has concordant data will it be able
to receive it.
All send and receive actions can be performed in a blocking or non-blocking way [9]. A blocking
event means that only after data has been safely sent or received, will the processor continue to run the
program. It may bring performance down, as the processor becomes idle, waiting. Non-blocking events
mean the opposite. The processor will continue to run the program even if it did not sent or received the
message yet. This will allow the processor to continue working. Sooner or later though, it will indeed
need to use the received data. This is done through the use of a barrier, which will only let the program
advance after all pending sends and requests are complete.
A typical code using MPI takes the following format
28
Figure 3.7: Layout of a typical MPI program.
Every processor receives a copy of the code and, based upon it’s position inside the hierarchy, executes
certain part of the problem, communicating with other nodes when necessary.
MPI can perform point-to-point communication - one processor sends data to another (specific)
processor - as well as collective communications - broadcast, scatter, gather, all to all, etc [9]. It also has
ways to synchronize the program between nodes.
Following is an example of a C MPI implementation where two processors send one character to each
other.
#inc lude "mpi . h"
#inc lude <s t d i o . h>
in t main ( i n t argc , char ∗ argv [ ] )
i n t numtasks , rank , dest , source , rc , count , tag=1;
char inmsg , outmsg='x ' ;
MPI_Status Stat ;
MPI_Init(&argc ,& argv ) ;
MPI_Comm_size ( MPI_COMM_WORLD , &numtasks ) ;
MPI_Comm_rank ( MPI_COMM_WORLD , &rank ) ;
i f ( rank == 0)
dest = 1;
source = 1;
rc = MPI_Send(&outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ;
rc = MPI_Recv(&inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD , &Stat ) ;
29
e l s e i f ( rank == 1)
dest = 0;
source = 0;
rc = MPI_Recv(&inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD , &Stat ) ;
rc = MPI_Send(&outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ;
MPI_Finalize ( ) ;
The program starts with the inclusion of the MPI library through the header #include “mpi.h“.
Next, all the variables are declared and the number of processes available are assessed through the
use of MPI_Comm_size. Here, MPI_COMM_WORLD is the default communicator and the number
of processes is an input argument of the executing the program.
Following, each processor retrieves it’s rank inside the communicator. Then, based upon the rank,
each processor accesses it’s task. In this case they have similar tasks: each processor sends and receives
one character to and from the other processor. They use the MPI_Send and MPI_Recv to do this.
These are blocking commands, which means the program will only continue after both functions return.
Both arguments are the address where the variable to send is, the size of the variable, the type of variable,
the rank of the processor to receive or who sent the message, and the communicator.
3.5 Massively Parallel Processors: GPU
3.5.1 Introduction
A Graphics Processing Unit is a specialized processor with the main purpose of aiding the CPU in
the high-arithmetic-intensive calculation of graphic rendering. With that purpose, GPU’s architecture
evolved to perform in the SIMD paradigm, which means that a GPU can perform much more floating
point operations per second than a CPU [10]. For example, the Pentium 4 has a peak computational
power, in single precision, of 125 GFLOP/s, while the NvIDIA GPU GTX480 has 1375 GFLOP/s, or
1.375 Teraflops. In order to achieve this peak power, 10 Pentium’s 4 were needed, which would end up
being more expensive as the ratio between price and computational power is one of the attractiveness of
GPU’s [12].
It was only natural that non-graphic applications found a way to take advantage of this condensed
computational power. But, if one wanted to use a GPU to make scientific computations, the code had
to be transformed to elements known to the GPU, such has vertexes, textures, shaders, etc. Although
a enormous nuisance, numerous application were developed and the trend of using GPU’s was coined as
GPGPU or General-Purpose computation on GPU’s [?] [10].
NVIDIA took the next step by ’opening’ the architecture of the GPU: one no longer needed to convert
the algorithm to fit into the device. This new architecture and it’s API’s ( Application Programming
Interface) were named Compute Unified Device Architecture, CUDA for short.
30
GPU’s are still riding on Moore’s law and there is transparent scalability between devices, meaning
that the same code will scale it’s performance with newer devices.
The increased interest in massively parallel processors led to the creation of the OpenCL standard.
With OpenCL one can have a program that can be ran across different plataforms, independent of
architecture or vendor.
At the moment, many algorithms are being ported to the GPU to test whether or not they can bring
an increased performance. An example of this is the GROMACS code used for molecular dynamics. In
some cases, the transition to the GPU can be done very smoothly. But others may require a fresh way
of viewing the problem. Years of optimizations of codes to perform in extremely efficient ways in the
CPU may be an obstacle, as a step back must be taken.
3.5.2 CUDA architecture
The GPU’s architecture is represented in fig. 3.8. Is is composed of various Streaming Multiprocessors,
SM’s for short. Each one of these SM’s contains numerous scalar processors, commercially named CUDA
cores, which are a type of ALU.
Each of this cores has it’s own private memories - registers and caches - a memory accessible to all
cores inside an SM - shared memory - and a memory accessible to all cores of all the SM’s - global
memory -. Like in the CPU, the nearer a memory bank is, the smaller and faster it is.
Figure 3.8: The GPU’s architecture. The small squares are CUDA cores. Two rows of eight Streaming
processors can be seen. The remaining blocks represent memories.
As can be seen in the figure, the high density of ALU’s makes the device optimal for high-intensive
arithmetic operations, being one of the key differences between the architecture of a CPU and GPU. In
single precision, the peak processing power of the Tesla is 933.12 GFLOPs ( see table 4.1 ). For compari-
son, let’s use the AMD Opteron quad-core CPU. It’s a state of the art CPU used in Supercomputers such
as the Cray Jaguar. The Opteron has a peak processing power of 33.6 GFLOPs (8.4 per core) roughly 30
times less than that of the GPU [2]. So, if processing power was our only criteria, the Jaguar would need
31
30 times less processors, which would meant a decrease in the volume occupied by the infrastructure.
Tesla supports double precision, but it’s peak computing power plummets to 77.76 GFLOPs. Roughly,
a one twelfth drop. Nonetheless, almost twice of the Opteron in single precision
Figure 3.9: The Tesla C1060 board.
Also, the relative power consumed by the GPU is lower than that of the CPU. Tesla, in full load,
consumes roughly 190 Watt, while the Opteron, in the same situation, consumes about 230. So, in
general, the absolute energy consumption of the GPU is toe-to-toe with that of the CPU ( although
there are GPU’s whose power consumptions are way over 200 Watts), but it’s in the ratio between the
rate of floating point operations performed and the energy consumption that the scale is tipped to the
GPU side: 933/190 vs 34/230.
Returning to the Jaguar, if we used GPU’s instead of CPU’s, the peak power consumption would
reduce from 10.5MW to 0.29MW, roughly 37 times less. Highly noticeable in the power bill. Note that
saying to replace the CPU with GPU’s may lead to some misconception. The GPU is not meant to
replace the CPU, but only to aid it. Not all the nodes of the cluster could be a GPU. Also, using a
cluster of GPU’s still suffers from drawbacks. At the moment two GPU’s cannot communicate directly
between themselves. They have to use the CPU as a middleman to message-passing. Due to the rel-
ative small bandwidth of the physical connections between GPU’s and CPU’s, communication will be
the performance’s bottleneck. It is expected that in future devices direct communication will be possible.
As data has to be loaded from memory before any operation takes place, the throughput of the GPU
is limited by the memory bandwidth [7] . This is an important bottleneck. For example, let’s look at a
loop iteration that takes two different numbers and makes a sum and a multiplication between the two.
If we only take into account the two operations, the peak power of the iteration is 933.12 GFLOPs. But,
before any operation can be performed, the numbers have to be retrieved from memory. Two memory
fetches to two operations gives a ratio of 1.0 of compute to global memory access (CGMA). Since global
memory access bandwidth is 102 GB/sec, and if we retrieve two floats, 4 bytes each, then one cannot
perform faster than 102/4, limiting the throughput to 102/4.
32
3.5.3 CUDA programming model
The key working element of the GPU is the thread. A thread is an entity that takes the instructions to
be performed, along with the information of which data should the instruction use. They are different
from the previously mentioned CPU threads, as they are more lightweight [12].
In the GPU, threads are executed by CUDA cores. Each core can run multiple threads at the same
time, as it takes advantage of unproductive times of a thread. For example, if a thread has to retrieve
data from global memory, which is an operation that takes some time ( long latency operation), instead
of the core being idle, waiting for data to work on, it will start running another thread. This way a
GPU can hide the long-latency operations and achieve higher performances. The higher the number of
threads, the easier it is to hide these operations. That is why GPU’s have thousands of threads running
at the same time, contrasting with the CPU.
Threads are organized into Blocks, which, in turn, are organized into a Grid. This way, we have an
abstract model of programming that fits the hardware model of the GPU: the thread corresponds to the
CUDA core, the Block to the SM and the Grid with the GPU (fig. 3.12).
The procedure is as follows. The CPU, called the host, is running the program. It’s typically one
CPU thread. When it encounters a part of the code that says it is supposed to be ran on the GPU, a
kernel, it handles it to the GPU. According to runtime parameters, the GPU will generate the Grid. The
Grid will be divided, sending a certain number of blocks to each SM. In the SM, groups of 32 threads
are taken from the Blocks. Each 32 thread group is called a warp. The warp divides itself among the
CUDA cores, filling long-latency operations with work from other threads [7].
It’s up for the programmer to assign the size of the grid and of the blocks. A grid can be assigned
in a one or two dimensional fashion, and each block can be assigned in a one, two or three dimensional
manner (fig.3.12). This multidimensional disposition may prove useful to better fit the natural indices
of certain problems, as the threads ID may be used as an index.
But the values chosen for the size of Blocks and Grids should not be taken lightly, as different values
give different performances. A good block size has to take into account the resources usage and the
occupancy of the SM [7].
For example, the maximum number of blocks that can be be assigned to any SM is 8 ( it is a fixed
value, independent on the device generation). But, if each block has a high demand of the SM’s resources,
then, less blocks can be assigned to the SM, as there are not enough resources to be shared among blocks.
The extreme case is having a single block that consumes almost all of the SM’s resources, leaving less
than the needs of another block. That way, the SM can only digest a single block bringing performance
down.
Other important issue is assigning work to all the threads inside the SM. A SM has a limit of threads
it can run. Let’s say 1024. If we create a block with 8x8 = 64 threads, this means that 12 blocks are
needed to occupy all the 1024 threads. But only 8 blocks are allowed to a SM, leaving the occupancy to
half: 512 threads. Having lesser threads running means less efficiency in hiding long-latency operations
33
Figure 3.10: Code running in the host with a single
thread. Kernels are offloaded to the GPU and ran
with thousands of threads.
Figure 3.11: Availability of the different types of
memory to the CUDA working structures.
and a worst performance.
Using the occupancy calculator 1 will help to visualize which combination of parameters provide the
better performance.
There are other tricks that can be used to one up the performance. Using, whenever possible, regis-
ters and shared memory instead of global memory, also brings performance up as less time is expended
accessing the memory. Accessing contiguous memory banks, similar to the row major in C, makes the
hardware coalesce the accesses [7] [12].
One limitation of this programming model has to do with the communication between threads.
Threads can only communicate among themselves through the memories: threads inside a block can
communicate with each other through the shared memory, and threads within different blocks can only
communicate through the use of the global memory. But, for optimized use, it’s always preferred to
minimize accesses to the global memory, which implies a quest for a more fitting algorithm that minimizes
communication between threads of different blocks. The scope of accessible memory is expressed in fig
1http://developer.download.nvidia.com/compute/cuda/CUDA_OCCUPANCY_CALCULATOR.xls
34
Figure 3.12: Illustration of the organization of one
bi-dimensional grid into blocks and one block into
threads.
Figure 3.13: Memory layout in the device and it’s
accessibility to the processors.
3.13.
Another limitation refers to synchronism. Threads may be synchronized, but only within the same
block. At this moment, there is no barrier that coordinates threads among blocks and, by hardware
constraints, there may not be one. This is a huge nuisance, which, once again, implies a new algorithm
and extra careful.
3.5.4 An example
To better understand how CUDA works, an example of a sum between two matrices will be used.
First the host code. We declare the two matrices A and B and initialize then. Then we give the sizes
of the Grid and of the Blocks by storing them in a cuda built-in type, the dim3.
Then we allocate the needed space in memory of the device. This is done with the cudaMalloc()
function, whose arguments are the address of the variable in the host’s memory that points to the
memory address in the GPU, and the size to be allocated. We call A_d and B_d to differentiate the
variables living in the device and the ones living in the host.
We then have to copy both matrices from the CPU to the GPU. By telling where to copy, from where,
what size, and specify the direction of the data flow, the function cudaMemcpy() does the job.
35
We then call the kernel, which is the function that runs on the GPU. It is like calling a normal
function but with the dimensions of grid and blocks inside the pointy parentheses.
The GPU then does it’s deed and, after finishing the calculations, we have to copy the results from
the GPU to the CPU.
dim3 dimGrid ( dim_grid_x , dim_grid_y , 1) ;
dim3 dimBlock ( dim_grid_x , dim_grid_y , dim_grid_z ) ;
cudaMalloc ( ( void ∗∗) , &A_d , n_col ∗ n_rows∗ s i z e o f ( f l o a t ) ) ;
cudaMalloc ( ( void ∗∗) , &B_d . . . ) ;
cudaMalloc ( ( void ∗∗) , &C_d . . . ) ;
cudaMemcpy ( A_d , A , n_col ∗ n_rows∗ s i z e o f ( f l o a t ) , cudaMemcopyHostToDevice ) ;
cudaMemcpy ( B_d , B , . . . ) ;
mat_sum_kernel<<< dimGrid , dimBlock >>> ( A_d , B_d , C_d , ) ;
cudaMemcpy ( C , C_d , n_cols∗ n_rows∗ s i z e o f ( f l o a t ) , cudaMemcopyDeviceToHost ) ;
The GPU kernel consists of a function or sub-routine. It has the special keyword __global__ to tell
the CPU that this function is called from the host but to run on the device.
Since the GPU has thousands of threads running simultaneously, each entry of the output matrix,
C_d, can be performed at the same time. Therefore there is no need for ’for’ loops. Also, the number
of the thread is used as the index to access entries of matrices A and B.
__global__ f l o a t mat_sum_kernel ( f l o a t ∗ A_d , f l o a t ∗ B_d , f l o a t ∗ C_d , . . . )
i n t trd_id_x , trd_id_y ;
trd_id_x = blockIdx . x∗ blockDim . x + threadIdx . x ;
trd_id_y = blockIdx . y∗ blockDim . y + threadIdx . y ;
C [ trd_id_y ] [ trd_id_x ] = A [ trd_id_y ] [ trd_id_x ] + B [ trd_id_y ] [ trd_id_x ] ;
This is a first implementation. Optimizations could be performed to increase performance, such
has coalesce the accesses to global memory. The example of a matrix multiplication found in [7] is
recommended.
36
Chapter 4
Implementations of the LB methods
All the implementations were made using the C language, it’s extensions and libraries. OpenMP and
OpenMPI were used, respectively, for the threads and the message-passing parallel programming model.
All codes were compiled with gcc 4.3.2 except for the CUDA application, compiled with the nvcc 2.2.
In terms of the machines, the OpenMP and the CUDA implementation used a computer with four
Intel Xeon processors and four GPU’s (three Tesla C1060 and one Quadro FX1800). The properties of
the GPU’s can be found in table 4.1. For the OpenMPI application, a cluster of four AMD’s Opteron
252 processor was used.
To measure and compare performances we will basically use the speedup and the MLUPs. The
speedup ( eq. 3.1 ) should only be used to compare among implementations on the same computer. To
compare implementations across different architectures we will use the MLUPS ’units’. MLUPs stands
for Mega Lattice Updates per Second. It is obtained by inverting the elapsed time of the temporal loop
divided by the number or iterations and the number of nodes.
First, a few remarks about the code.
There are many arrangements on how to store the PDF’s of all particles of all nodes. A very
organized way would be creating a structure, where each node has it’s own particles’ properties. We did
not implement it this way. Instead, and independently of the number of spatial dimensions, we used a
one-dimensional array, where, for each node, we stored the PDF’s of the particles in the order of the
numeration given (fig. 4). This minimizes memory accesses.
To access one specific particle of a certain node we calculated it’s location on the array. For example,
in the bi-dimensional case, if we had L nodes in the horizontal direction, H in the vertical, we started
enumerated the nodes from the lefto to the right, bottom up; then, to access the particle i of the node
with coordinates x and y would be:
idx = (y × L + x) × m + i (4.1)
37
Table 4.1: General properties of the GPU
Tesla C1060
Global Memory [GB] 4.29
Number CUDA Cores 240
Number Multiprocessors 30
Shared memory per block [KB] 16,384
Number of 32-bit Registers per block 16384
Clock rate 1.3 GHz
Memory Bandwidth 102GB/sec
Maximum Power Consumption [Watt] 187.8
Warp size 32
Maximum number of threads per block 512
Maximum size of each dimension of a block 512 × 512 × 64
Maximum size of each dimension of a grid 65535 × 65535 × 1
Figure 4.1: For a given node, each particle’s PDF is stored one after the other, according to the numer-
ation used for each direction. Nodes also follow a lattice numeration.
Being idx the location in the memory array and m the number of links per node (DnQm).
The propagation step and the boundary conditions can be applied in different ways. One could take
the option of fixating a node and propagating out all the particles living in that node (like if the node
exploded). Or we could fixate the node and propagate into it, particles from the surrounding nodes.
We chose the latter. Also, to determine which particle comes/goes to which node, we can take the node
position and, with the aid of the translation versors ei, calculate the new node of each particle. This
bores numerous calculations needed to perform per iteration. A better way is to use a fixed vector used
to obtain the index, in memory, of the new location. For example, if a particle moves to the node at
it’s right, all we need to do is to sum, to the actual node address in the memory, the necessary offset
for reaching the memory entry of the node to the right. In this case m ( DnQm). Since each of the
directions has a unique constant value, we avoid performing the translation calculation. We also take
advantage of this method and used it to perform the bounce-back and periodic boundary conditions.
For a bi-dimensional situation this offered no problem. But to a three dimensional one, the number of
vectors needed made it a bit clumsy approach.
38
Typically, to deal with generic fluid scenarios, each node is assigned a boolean value to tell if it’s
a fluid node or an obstacle one. So, per node, one has to evaluate the nodes nature and perform in
accordance. Since our test cases had the bulk free of obstacles, we avoided this conditionals.
Independently of the ways chosen, it’s unavoidable the use of a secondary array to store the propa-
gated PDF’s.
4.1 Serial implementation
The serial implementation is a straightforward transcription of the fluxogram (??). In pseudocode, one
can describe the flowchart process as
initialize_pdfs(initial properties)
for t < tmax do
collision()
propagation()
boundary_conditions()
end for
As will be seen later, the collision step consumes about three quarters of the total time of the temporal
loop.
4.2 Shared memory implementation: OpenMP
The best approach of parallelizing LB with OMP is to create multiple threads in the beginning of the
temporal loop (fig.2.2). Each thread executes the collision and propagation of a certain number of
nodes and, at the end of the iteration, they synchronize with each other through a barrier. After the
temporal loop is over, all the threads join into a single thread, responsible for finishing the program. This
way we reduce the overhead of creating and destroying threads, as there is only one creation and one
destruction. Unfortunately, this approach could not be implemented due to restrictions of the OpenMP’s
parallel construct: only one if clause is supported, unpredictable behavior occurs when branching in or
out (like when calling a function), to name a few.
Of course, one could ’streamline’ the code to try to fit in the scope of the construct rules. But our
second approach, although with a higher overhead, achieved the theoretical speedups, making a total
make-up of the code unnecessary.
The second approach consists in creating the team of threads at the beginning of both spatial loops
( fig. 2.2 ). Threads are joined at the end of each loop, meaning that, per iteration, there is overhead
due to two creations and destruction’s of threads. The impact of the total time of the overheads can
39
become neglectable if the number of nodes in the mesh is high and if the number of threads is not too
high (they normally are in the order of the tenths). This holds true for most cases.
4.3 Distributed memory implementation: MPI
In LB, the collision step is totally local, independent of the neighboring nodes. It is in the stream step
that interaction between nodes occur. It is during this step that one needs to use the message-passing
standards when using a distributed memory system.
The assignment of work to the nodes has to be done in a way to minimize the amount of communica-
tion needed and, at the same time, maintain similar amounts of work to every processor ( load-balancing).
Minimizing communication is the same as minimizing the number of nodes that have to share data
across processors. So, for example, if we had a rectangular domain with height H and length L = 4*H,
dividing across the domain would imply that L nodes need to exchange data. A transverse division
meant only H. Since both ways hold the same number of nodes for each sub-domain, the latter would
provide a better solution, as there are less nodes to exchange data.
Load-balancing is not as easy as assigning the same number of nodes to each processors, as some
nodes need different cares. Example of this are boundary or obstacle nodes. This leads to an unbalanced
work load. This problem is even worse in irregular meshes. The typical solution found for this cases is
assigning a weight to each node, in accordance to the work it gives to the processor. Then divide the
domain in such a way that summing all the weights of the nodes of each sub-domain holds identical
values. To more on that see [3].
Independently on how the division is processed, each sub-domain will have to have three sections:
the bulk, the frame and the frame-buffer.
Figure 4.2: Domain divided in two parts, one white, one gray. Each part is assigned to a node of the
Cluster. In the case of the white sub-domain, there is an extra frame , where data from the frontier cells
of the neighbor sub-domain are made available: Ghost cells.
In the bulk everything processes normally, as all needed data for propagation is in the sub-domain
where the these nodes live.
40
The frame consists on the lattice nodes in contact with the border of the domain division. These are
the last nodes of the sub-domain to be updated but, in order to do so, they have to use data from nodes
living in another sub-domain. This is why there is the frame-buffer. Sometimes called ghost cells, these
serve to hold data from other domain and make it available to the frame nodes. At the end of every time
iteration the ghost cells must be updated and the loops synchronized. This is where message-passing is
used.
The MPI implementation had only two nodes of the cluster available. But since each node had 2 cores,
we used all 4 processors as 4 separate nodes, assuring that all the cores have their own memory bank,
and do not use memory to communicate between them. That way we could use up to four processors.
The code was made in a way that the domain was divided in a number of sub-domains in accordance
with the number of processors available. Since the lattice is regular and there are no obstacles in the
midst of the fluid, a regular division suffices. One of the processor will be the master, responsible for
assigning the different sub-domains to the processors, to begin the temporal loop, to guarantee that all
messages have been sent and received at the end of each iteration and to finalize the program.
According to the sub-domain that a processor has, proper work is assigned. For example, the proces-
sor responsible for the top domain in the lid-cavity must apply the top boundary condition, as well as
the two side bounce-backs. The bottom domain, on the other hand, will have three bounce-backs to apply.
MPI_
initialize_flow_properties()
allocate_vars_
if rank == 0 then
initialize_bulk()
initialize_BC()
initialize_lid()
else
initialize_bulk()
initialize_BC()
end if
for t < tmax do
if rank == 0 then
collision() has a loop inside that sweeps across the nodes of the sub-domain
MPI_non-blocking-send()
MPI_non-blocking-receive()
propagation()
boundary_conditions()
41
MPI_barrier garantir recepçao e envio
propagation_frame()
CF_frame()
else if rank == num_processors -1 then
similar to rank 0
else
similiar to rank 0
end if
end for
4.4 Massively Parallel Processors: CUDA
Since a GPU can have hundredths of thousands of threads running ’simultaneously’, we will make a
unique association between each thread and a node. This is done with the threads ID. Then each thread
will perform the collision and streaming phases of that node. The spatial loops to sweep across all the
nodes are no longer needed. Although one still might use them to loop over all the directions of a node.
An important fact valid to all approaches is that the temporal loop must be held by the CPU, as the
GPU cannot synchronize all the threads at the end of each iteration.
Our first approach consisted of a direct conversion from the serial version: we used the same data
structures as in the CPU, with the difference of replacing the indices used to access the arrays with the
threads ID’s.
First, and before entering in the temporal loop, we copy all the needed data to the GPU’s global
memory. Then we perform the temporal loop, calling the kernel in every iteration and using a barrier
to impede the CPU of proceeding to the following time step. This is done because the GPU will return
control to the CPU as soon as the grid is generated, meaning that the CPU will proceed with the pro-
gram. So we have to guarantee that the GPU has finished it’s task in the same time step as it was called.
After the temporal loop is over, we copy data back to the host.
initialize-flow-properties()
cuda-alloc-memory()
cuda_copy_host_to-device()
for t < tmax do
kernel()
cuda_barrier()
end for
cuda_copy_device_to-host()
42
store_macroscopic_properties()
node_index = threadID
collision( node_index)
propagate-outwards( node_index)
Subsequently we delve into more careful approaches, taking into account the "rules of thumb for
higher speedups": coalescent memory accesses and low global memory usage.
First we reshaped the way the PDF’s were stored in the array. To use coalescence we started by
putting the values, of a specif direction, sweeping all the domain nodes. Then sweep the domain again
for the following direction, and so on. That way, when threads were accessing the global memory, they
would be accessing consecutive data in global memory.
We copy the PDF’s from the global memory to the shared memory. Then perform the collision step.
After that, we have a situation similar to a sub-domain of MPI. As threads of different blocks cannot
communicate through shared memory, the propagation step needed ghost cells in shared memory, to
store values of neighboring PDF’s. But, since there is no global synchronism between threads, we need
to break the GPU code into two kernel, just after the collision.
Although it was a good improvement over the first implementation, it hardly reached the solution
proposed by[16]. His approach is highly efficient.
Following in his footsteps we separate each direction of the lattice into one dimensional arrays. Each
of these arrays holds the values of the PDF’s of the particles that move in a certain direction, of all the
nodes.
Figure 4.3: Each direction of propagation of a particle has an array. For each array, nodes are assigned
consecutively.
We make use of three kernel. One to make the collision step and part of the propagation, other to
complete the propagation and a third responsible for the boundary conditions.
The first kernel uses one dimensional blocks. We start by copying all the PDF’s of each node to
the registers of the thread associated with the node. We guarantee that this copy takes advantage of
coalescence accesses as, for example, two threads of the same block, consecutive in their ID number, will
43
access, at the same time, the two PDF’s of a certain direction that are consecutive in memory. The
hardware will notice this and coalesce the access. Then each thread will perform the collision step. Then
part of the streaming step. This part is done with the aid of shared memory. We allocate arrays of
shared memory, one for each direction and with the size of the block. We will pass all the values of
the registers to this shared memory taking the opportunity to write them into the place the PDF would
propagate.
For example, the PDF of a particle whose direction of travel is to east will move to the following
node. So, when writing to the shared memory, we will write to the following node. But the last node
of the block cannot write on the first node of the following block, as threads cannot communicate with
shared memory of different blocks. So we store the particles leaving the block through the right frontier
(the last node of the block) in the fist node of the block, as this node couldn’t receive the new PDF’s
and had it’s directions empty. This is done to all particles with horizontal components of propagation.
So a node that moves in the north-east direction will only move to the right node, remaining in the same
vertical position. The vertical movement is done in the following step. We are gonna copy the values
from shared back to global memory. When doing so, we will take the opportunity to make the vertical
propagation. The particles that move in the north-east direction and that have previously moved to the
right node, will now move to the node above, completing propagation of vertical and diagonal PDF’s.
This ends the first kernel. We still have to exchange the PDF’s that horizontally couldn’t get out of
the block and were ’imprisioned’ in it. The second kernel does this. It moves the PDF’s stored in the
beginning and of the end of each of the previously used blocks into the correct positions. It only uses
global memory, as it is an unavoidable one-read-one-write.
The third kernel is a direct conversion of the serial boundary conditions.
44
Chapter 5
Results
5.1 Test cases
In order to benchmark both the LGBK and the MRT codes, two cases were used: the Poiseuille flow and
the lid-driven cavity.
5.1.1 Poiseulle Flow
The Poiseuille flow consists in having a fluid between two infinite plates, at rest, subjected to a pressure
gradient between the inlet and the outlet. This makes the fluid move, reaching a steady state with a
parabolic velocity profile. To this flow there is an exact solution of the Navier-Stokes (eq. 5.1).
u(y) =GH2
8ν
(
1 −(
2y
H
)2)
,−H
2< y <
H
2(5.1)
Here G =P
ρL, H , L and P are, respectively, the height and length of the channel and the pressure
gradient between the inlet and the outlet.
This flow is mainly used to ’calibrate the flow viscosity’ [14] : we select a viscosity and compare the
attained profile with that given by eq. 5.1.
The two top plates are simulated with first order bounce-back boundary condition, while the other
two frontiers are considered periodic (infinite domain). The effect of the pressure is translated into a
bias, g, applied to the PDF’s, during the collision step [14].
f′
i = fi − ω(fi − feq) + gi (5.2)
According to [14], this bias, for a bi-dimensional case, is as follows
45
g1 = 4g2 = 4g8 =1
12gρ
g5 = 4g4 = 4g6 = − 1
12gρ
g0 = 4g3 = 4g7 = 0
The bias is calculated by considering the momentum that is necessary to be equivalent to a given
pressure drop. We are replacing the pressure at the inlet and outlet by a forcing term in every node of
the domain (bulk and frontier nodes). Calculating the moment of each node according to the expression
given before, we can see that there is only one horizontal component, pulling to the direction of the
smaller pressure.
g =Uoν
8H2(5.3)
We used Succi’s suggestion [14] and used a mesh of 128× 32 with ν = 0.1 and expected to obtain the
parabolic profile with the maximum equal to 0.1.
Figure 5.1: Longitudinal velocity profile obtained for the Poiseuille flow test case.
In can be seen that near the walls, the profile drops abruptly, instead of smoothly as before. This is
the result of the first order accurate boundary condition, as it is independent of the mesh size.
To know when we have reached the steady state solution, we have used the difference between the
analytical solution and the numerical one. Representing the velocity in the horizontal direction as u and
Nt as the total number of points, the error (or residue) used was the following
e =
∑ |uexact − unum|Nt
(5.4)
Depending on the mesh used, the criterion for convergence may lead to smoother or oscillatory curves.
For example, if we set 10−6 as the criterion for a 128 × 32 mesh, up until 2 × 10−4 the error would be
46
0 1000 2000 3000 4000 5000 6000 7000 8000 90000
5
10
15
x 10−4
iteration
resi
due
Figure 5.2: The error decreasing with the number of
iterations. The criterion for convergence was set to
10−6. It took 9556 iterations in a 256 × 64 mesh.
50 100 150 200 250
10
20
30
40
50
60
Figure 5.3: Bidimensional plot of the horizontal ve-
locity of a Poiseuille flow for a converged case. The
maximum velocity is in the middle of the channel
section, and the velocity profile is constant through-
out the channel.
decreased smoothly as in the previous graphic. From that value onwards, the residue would be oscil-
lating. The number of iterations needed for the expected residue to be achieved increases exponetially
when this behaviour is verified. The same criterion for a 256 × 64 mesh lead us to the smooth graphic
of residue times iteration number presented earlier.
Lid-driven cavity Flow
The lid-driven cavity is a rich testing environment with simple boundary conditions: three no-slip and a
sliding wall at the top.
We used the cavity to benchmark the solution by comparing the velocity profiles to the ones Ghia
obtained 1. We compare for Reynolds numbers of 100 and 1000. We will use Reynolds of 100 for the
LGBK method and the 1000 for the MRT. This is done only to avoid repeating the same graphic’s twice.
Both boundary conditions were simulated with first order approximations. The lid velocity, in lattice
units, is set to be 0.1, the lattice is 128×128 and τ is set to 0.5384. To the MRT model we used the
following relaxations
s1 = s4 = s6 = 0, s2 = 1.1, s3 = 1.0, s5 = s7 = 1.2, s8 = s9 =1
τ(5.5)
For convergence, we have used the difference between the values of two consecutive iterations. This
holds as the flow is expected to reach a steady state situation. If n + 1 represents the iteration of t + ∆t
1More accurate solutions for the cavity have been obtained since Ghia, such as the ones from Peyret and Botella.
Nonetheless, since the objective is just to verify the solution and not it’s accuracy, Ghia solution suffices.
47
50 100 150 200 250
50
100
150
200
2500
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Figure 5.4: Bi-dimensional plot of the moduli of the velocity for a Reynolds number of 1000
Figure 5.5: Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse
section. Obtained with SRT for a Re=100.
and n the iteration at t, then
e =
∑∣
∣un+1
i − uni
∣
∣
Nt(5.6)
48
Figure 5.6: Horizontal and vertical velocity profiles for, respectively, a longitudinal and a transverse
section. Obtained with MRT for a Re=1000.
0 1 2 3 4 5 6 7 8 9−14
−13
−12
−11
−10
−9
−8
−7
LN(iteration)
LN(r
esid
ue)
Figure 5.7: A solution is considered converged, for the cavity, when the residue is inferior to 10−7. We
have used a logarithmic scale, both to the residue and to the number of iterations for visualization
purposes: small oscillations in the residue were small enough to look like a straight line if the plot was
residue vs iteration number.
In terms of stability, it’s easy to see how much improvement the MRT method brings. For the same
exact set of properties and iterations, we compare the velocity values in the vicinity of the left wall,
where the gradients are higher, with a relatively high Reynolds number. Figures 5.1.1 demonstrate the
numerical instabilities that appeared in the velocity profile.
The numerical instabilities in the LGBK scheme will propagate, causing errors that will collapse the
numerical stability. In less than 1000 iteration all PDF’s are nans (Not a Number). This nan’s are
caused by the PDF values decreasing below zero, which has no physical meaning and trashes the results.
The MRT scheme manages to take control over this instabilities and exhibit a stabler behavior.
A few remarks are in order. First, in bi-dimensional serial code, the collision step takes about 77 %
of the time. The streaming step and boundary conditions take about 21 %. Since both steps can be
49
Figure 5.8: Horizontal velocity profile the top half nodes in the vicinity of the Wall (x=1). The code is
set to run 100 iterations for a Reynolds number of 5848.
parallelized, Lattice Boltzmann methods offer a paralelizable fraction of 98%. What this means is that,
according to Amdhal’s law (see appendices), if we maintain the problem size constant, the maximum
theoretical speedup that can be achieved is 50. But, to obtain it, more than 10000 processors were
needed. To a speedup of 47, ’only’ 767 ( a saturation curve).
Our MRT implementation was about 30% slower than the SRT. This means that some optimization
can be performed in order to obtain the 15% referred by [8].
For the three dimensional case (3D lid-cavity), the collision and propagation took, respectively, 84
and 16 % of the time. Meaning that, not only the overhead of initializing and ending the variables of
the program are negligible, but also that there is a slightly edge of 2%in parallelization of 3D cases over
2D. This is in accordance with [14]. This little improvement translates, for the same problem size and
according to Amdahl, in a linear speedup with the number of processors.
Depending on the processor used, a serial implementation typically achieves values of MLUPs ranging
from around 0.5 to 2 MFLUPs. A 3D case offers lower MFLUPs, as each node has more particles in
which to take action.
5.2 Shared Memory parallelization: OpenMP
To run the tests of OpenMP we used Tesla’s four CPU’s, with a shared memory architecture.
We will compare the elapsed time of the program for different mesh sizes and for different number
of processors. Each processor is only given, at most, one thread, so, we generate up to four threads. All
the values were obtained for the same parameters and the same number of iterations. See (table 5.2)
Fixing a mesh size we can see a linear evolution with the number of processors (fig. 5.9).
50
Table 5.1: Speedups obtained for the OpenMP implementation for up to four processors.
Speedups
Lattice Serial (sec.) 2 procs 3 procs 4 procs
322 3.13 1.84 2.85 3.48
642 12.47 1.95 2.97 4.0
1282 50.0 2.02 2.82 3.89
2562 200.04 2.02 2.87 3.77
5122 798.99 2.02 3.02 3.98
10242 3188.77 2.08 3.03 3.91
Figure 5.9: Code scaling with the number of cores.
As was expected, increasing the number of cores reduces the amount of time needed to complete
the program. The linear behavior, predicted by both Amdahl and Gustafson’s laws, is verified. Bigger
meshes are closer to the speedups’ trend-line. That’s because the overhead of generating/destroying
threads is hidden by the amount of work each thread has to perform.
Note that, for a fixed mesh size, a linear behavior with the introduction of more processors is not
expected ad-infinitum. In a extreme situation, where we had as many processors as lattice nodes, each
processor would be in charge of only one node. If the lattice was the same, there was no way to increase
the speedup beyond this point( Amdahl’s law). Of course, probably the overhead of generating so many
threads would kill all speedup. But if we increase the number of processors and the problem size, the
linear behavior is expected to last as long as overheads and memory accesses allow it (Gustafson’s law).
If one tries to use more threads per processor, in a hope that instruction-level parallelism can make
51
Table 5.2: Speedups for different mesh sizes and cluster nodes.
Speedups
Lattice serial [s] 2 nodes 3 nodes 4 nodes
322 11.7 1.80 1.83 2.13
642 42.7 1.92 2.79 3.14
1282 187.2 2.08 3.28 4.28
5122 2995.2 2.074 3.28 4.14
10242 11980.8 2.02 3.1 4.3
use of them and reduce CPU idle times, we are confronted with, at best, slightly worst results. For
example, when using one processor, if we assign two threads to the processor there is a slight increment
of the elapsed time. That is due to the extra overhead and to the permutation of the threads running
in the processor. If we assigned three threads to the processor, the elapsed time is better than when
assigning only two threads. But it’s worst than when assigning only one thread.
In terms of MLUPs, we can say that the total MLUPs achieved by the system is proportional to the
MLUPs of a single processor scaled up by the total number of processors used. So, the top performance
achieved when using the four CPU’s, is 6.56 MLUPs.
5.3 Distributed memory : MPI
The time to run a simulation, for different meshes is found on table 5.3.
With small meshes we can verify that, even with 4 nodes, the improvements only occurs by a factor of
two. That is due to the fine granularity present in small meshes. To achieve coarse granularity we have
to use bigger meshes. It’s only then that the speedup is close to the number of processors used. Note
that speedups slighlty higher than the number of processors can occur due to caching in the processors.
We will now test the effect of communication with rectangular domains. For that purpose, with the
same mesh, we will split the domains longitudinally and transversely and compare the different elapsed
times. We will use a fixed lattice size, so that the number of nodes in each sub-domain remains constant.
That way, different wall clock times are the result of communication time (as long as this difference is
bigger than the small fluctuation of different runs of the same program in the same conditions).
We started with a 128 × 1024 lattice. The longer direction has 8 times more nodes that the shorter
one. When cutting in the longer direction, the runtime increases only 0.6 seconds. We then used a mesh
with 128 × 10240. The difference between frontier nodes is 80 times. It corresponds to a 5.3 seconds.
We continued this type of procedure and the conclusion was that the way the domain is divided does
influence the communication time but, in absolute terms, the amount of time required is neglectable due
52
Table 5.3: Speedups for different mesh sizes and cluster nodes.
Speedups
Lattice SRT MRT SRT
322 0.99 0.82 2.05
642 3.78 2.97 8.2
1282 12.3 6.6 27
2562 28.59 9.25 65.5
5122 42.38 11.1 83.2
10242 47.45 11.5 84.56
to the high amounts of computational work per communication event.
With the four node cluster we achieved a maximum of 3.88 MLUPs. It is lower than the value
obtained with OpenMP, but that’s because each individual processor of the cluster is ’slower’ than those
used for the OpenMP implementation. The same serial code has 1.64 MLUPS for a single Tesla processor
and 0.88 MLUPs for a single processor of the Cluster. Roughly half. Once again one can approximate
the number of MLUPs achieved by the cluster by multiplying the number of processors in a cluster by
the MLUPs speed of single processor. The velocity fields predicted with serial, OpenMP and MPI are
virtually the same.
5.3.1 Massively parallel processors: CUDA
With the first approach, i.e., the literal porting of the serial code to the GPU, the highest achieved
speedup, for the SRT method, was 27. This speedup is almost three times the one obtained, in a similar
crude implementation, by [11]. He reached only 9. The two values differ due to the hardware that was
used. We used a newer device with higher computational power. This is representative of the scalability
that occurs with newer versions of hardware.
In terms of MLUPs, the speedup of 27 translates into 45 MLUPs. Without much effort, a single GPU
achieved the same performance that would be obtained with 51 cluster nodes or 27 processors.
Comparing between the SRT and the MRT implementation, this last one falls a little bit behind. It’s
peak performance is 12MFLUPs. It dropped to near a half relatively to the SRT.
The second implementation, using shared memory and coalesced accesses, achieved, for the SRT, the
maximum speedup of 44, peaking 86 MLUPs. About the double of the previous implementation.
But it was with the approach introduced by [16] that the maximum speedup was obtained. His
method, highly optimized, allowed the maximum speedup of 110 or 180MFLUPs.
53
Table 5.4: Comparison of the peak values in the three implementations.
OpenMP MPI CUDA
Speedup 4.0 4.28 110
MFLUPs 6.56 3.88 180
Figure 5.10: The increase in performance, up to a certain point, with the increase of the lattice dimen-
sions.
Independently of the implementation, the GPU achieves it’s peak performance for bigger lattices. We
also see that up from a certain point, it saturates,as it was expected.
Overall, the maximum MLUPs achieved with the three parallel methods are resumed in the table
5.4.
Figure 5.11 illustrates the discrepancy between values.
As can be seen, the GPU offers a great edge when it comes to Lattice Boltzmann methods and alikes.
A single GPU could outperform all the other methods. To achieve the same performance as the GPU, we
needed 110 CPU’s or 205 cluster nodes (for example, approximately 28 machines, each with 4 processors).
54
MPI
OMP
0
2
4
6
8
0 1000 2000 3000 4000
CUDA
MLUPs
0
50
100
150
200
Lattice2
0 1000 2000 3000 4000
Figure 5.11: Comparison between the three implementations. The filled nodes use the scale on the left,
while the others the scale on the right.
55
Chapter 6
Conclusion
In this work we have made two implementations of the Lattice Boltzmann numerical schemes: the LGBK
and the MRT. These implementations were initially made in serial and then ported to three different
parallel platforms: to a shared memory architecture, to a distributed memory architecture and to a
GPU. To do so, we used, respectively, OpenMP, MPI and CUDA. All the codes were verified with two
classical benchmarks: the Poiseuille flow and the lid-driven cavity.
The LGBK method was found to be the fastest but, for limited mesh sizes, should only be used in
flows with low Reynolds number. Typically, laminar cases. The other method, the MRT, was slower
than the LGBK but compensated by presenting a much more stable nature, allowing for computations
with high Reynolds numbers with relatively small meshes. It’s a better choice for turbulent simulations.
The parallel nature of the Lattice Boltzmann methods revealed itself in every parallel implementation,
as high, scalable speedups were achieved (limited, of course, by the hardware available). It was found
that increasing the number of processors, independent of the memory architecture used, the code scaled
linearly with the added processors. It was also concluded that the GPU achieved the best performances
of all three implementations. A single GPU card achieved the performance that could only be equaled
by twenty eight machines, each with four processors. Plus, in terms of efficiency, either energetically,
economically or in terms of room space, the GPU was, by far, the best solution.
Future work should be the extent of the implementations to the third dimension, it’s conversion to
the three parallel approaches, specially to the GPU, and to make make an implementation that could
test how a cluster of GPU’s could perform. It is also necessary to compare this parallel results with
the ones achieved with more complex flows and geometries. That way, one could access if the main
competitive advantages of the Lattice Boltzmann methods - the ability to adapt to dispersed media and
to irregular-time-changing boundary conditions and to be highly parallel - are still one competitive edge
in parallel implementations.
56
Bibliography
[1] C. K. Aidun and J. R. Clausen. Lattice-boltzmann method for complex flows. Annual Review of
Fluid Mechanics, 42(1):439–472, January 2010.
[2] Ricky A. Kendall Arthur S.Bland. Jaguar: The world’s most powerfull computer. 2009.
[3] Jorg Matthias Bernsdorf. Simulation of complex flows and multi-physics with the lattice-boltzmann
method. 2008.
[4] Carlo Cercignani. The Boltzmann Equation and its Applications. Springer-Verlag, 1988.
[5] Barbara Chapman, Gabriele Jost, and Ruud van der Pas. Using OpenMP: Portable Shared Memory
Parallel Programming (Scientific and Engineering Computation). The MIT Press, October 2007.
[6] Jaswinder Pal Singh David E. Culler. Parallel Computer Architecture, A hardware/software ap-
proach. Morgan-Kaufmann, 1999.
[7] Wen-mei Hwu David Kirk. Programming Massively Parallel Processors: A hands on approach. 2010.
[8] D. d’Humieres, I. Ginzburg, M. Krafczyk, P. Lallemand, and L. S. Luo. Multiple-relaxation-time
lattice boltzmann models in three dimensions. Philosophical Transactions: Mathematical, Physical
and Engineering Sciences, 360(1792):437+, 2002.
[9] Lumsdaine Gropp, Huss-Lederman. MPI - The complete reference. MIT Press, 1998.
[10] Harry Henderson. Encyclopedia of Computer Science and Technology. Springer-Verlag, 2009.
[11] Takehiro Oyakawa Liu Peng, Ken-ichi Nomura. Parallel lattice boltzmann simulation on emergin
multi-core platforms. 2008.
[12] NVIDIA. NVIDIA CUDA C programming guide. 2010.
[13] Daniel Page. A Practical Introduction to Computer Architecture. Springer, 2009.
[14] S. Succi. The lattice Boltzmann equation for fluid dynamics and beyond. Numerical mathematics
and scientific computation. Clarendon Press ; Oxford University Press, August 2001.
57
[15] J. Tölke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. Int. J. Comput.
Fluid Dyn., 22(7):443–456, August 2008.
[16] Jonas Tolke. Implementation of a lattice boltzmann kernel using the compute unified device archi-
tecture developed by nvidia. Comput. Vis. Sci., 13(1):29–39, 2009.
[17] J. S. Wu and Y. L. Shao. Simulation of lid-driven cavity flows by parallel lattice boltzmann
method using multi-relaxation-time scheme. International Journal for Numerical Methods in Fluids,
46(9):921–937, 2004.
[18] Zhou Xiao-Yang, Shi Bao-Chang, and Wang Neng-Chao. Numerical simulation of lbgk model for
high reynolds number flow. Chinese Physics, 13(5):712, 2004.
58
.1 Terminology
Altough some terms may have already been mentioned, I believe it is important to have a non-extensive
glossary of terms eventually found in parallel literature.
• Throughput - the rate at which data is successfully transmitted. A measure of productivity.
• Granularity - the ratio of computational work to communication. If, between communications,
high amounts of computational work is performed, it’s called Coarse granularity. Otherwise it’s
fine.
• Overhead - it’s the non-productive time necessary to prepare or organize ’things’ so that work
can be done. Making an analogy with an assembly line, it would be the time to start the machines.
Whenever a computer program is starting or ending, some overhead is present. In parallel programs
other sources of overhead include synchronizations, data communication
• Bottleneck - It’s a single part of a program that is imposing the limit on the maximum speedup
that can be obtained. No matter how we improve or parallelize all the other parts of the program,
the elapsed time will not decrese as it is dictated by the bottleneck.
• Synchronization - A point of synchronization is a location in a code where all parallel tasks are
coordenated. The program will not continue until all the tasks achieve this point. The entity used
for marking these points is called a barrier.
• Race Conditons - It’s typically an undesired event where the outcome of one operation depends
entirely on who reached and modified a common resource, first. The entities used to avoid race
conditions include locks, semaphores and atomic operations.
• Amdahl’s law - a theoretical way to predict the speedup achieved with a system of various
processors. If f is the fraction of the code that can be run in a parallel and Np is the number of
processors, then the speedup, S, is is obtained by:
SAmdahl =1
f
Np
+ (1 − f)
(1)
The fraction of code that cannot be run parallely, 1 − f , is the bottleneck, limiting the maximum
speedup. Amdahl’s law is used to tread problems with a fixed size.
• Gustafson’s law - another theorical prediction on the speed ups that can be achieved with multiple
processors. It differs from Amdahl’s law as it takes into account the possibility of increasing the
problem’s size in order to take advantage of the computational resources available. Being f the
fraction of the code taking advange of parallelism and NP the number of processors, the maximum
speedup is
SGustafson = NP − (1 − f)(NP − 1) (2)
59
.2 Rules of thumb for higher performances
Although every model has it’s own quirks, there are general cares that can be taken to have an optimized
implementation.
One of the cares relates to the way arrays are stored in memory and how to access them. A one-
dimensional array, a vector, is stored consecutively in memory. This offers no problem. But a two or
three dimensional array are also stored in a one dimensional fashion. Each computer language has it’s
preferential way to arrange the multi-dimensional matrices in consecutive memory locations.
For example, in C, if we declare a 2D matrix, all elements will be placed linearly in memory in a
row-major order. That is, starting in the top row, each element of the row, starting with lowest column
index, will be placed one after the other in memory. Then, after the first row is finished, it will move to
the second row and place each element of that row in memory. And so on (fig. 1). Fortran, for example,
uses a column-major approach, putting all elements of a column sequentially.
Figure 1: A 3 × 3 matrix stored in a row major fashion.
The importance of knowing how data is stored is to know how we can access it in the most efficient
way. As data is stored in main memory, and this is a slow memory, whenever a processor needs data
from main memory, intead of taking just the data needed, it uses the strategy of taking a whole chunk of
contiguous data and places it in cache memory. This way avoiding another long-latency access to main
memory. So, if we are programming accesses to arrays entries having in mind it’s contiguous pattern,
we can reduce the number of calls to the main memory and hence reduce the overall time of a program.
f o r ( i n t j=0; j<n ; j++)
f o r ( i n t i=0; i<n ; i++)
sum += a [ i ] [ j ] ;
f o r ( i n t i=0; i<n ; i++)
f o r ( i n t j=0; j<n ; j++)
sum += a [ i ] [ j ] ;
Figure 2: The loop on the left accesses data non-contiguously in memory. The loop on the right uses a
row major acessing pattern.
60
For the same reasons, whenever possible, it is wise to do loops unrolls: instead of giving one instruction
and telling the loop to perform that instruction over a certain number of different indices, we can obtain
better performance if the loop sweeps over less indices, but in every loop iteration there is more than
one instruction. It’s better understood and expressed with an example.
f o r ( i n t i=1; i<n ; i++)
a [ i ] = b [ i ] + 1 ;
c [ i ] = a [ i ] + a [ i−1] + b [ i−1] ;
f o r ( i n t i=1; i<n ; i+=2)
a [ i ] = b [ i ] + 1 ;
c [ i ] = a [ i ] + a [ i−1] + b [ i−1] ;
a [ i+1] = b [ i+1] + 1 ;
c [ i+1] = a [ i+1] + a [ i ] + b [ i ] ;
Figure 3: The loop on the right is the unrolled version of the loop on the left.
It is also important to familiarize with the compiler, as there are optimization flags that, when active,
can lead to better performance.
61
top related