hpc for the pt-code · the pt - code basic structure solve a linear system au = f wherein a is...
Post on 20-Jul-2020
1 Views
Preview:
TRANSCRIPT
HPC for the PT-Code
Stefan Rosenberger
Supervisor: Univ.-Prof. Dipl.-Ing. Dr. Gundolf HaaseInstitut fur Mathematik und wissenschaftliches Rechnen
Universitat Graz
January 29, 2016
Stefan Rosenberger HPC for the PT-Code January 29, 2016 1 / 25
The PT - Code
1 The PT - Code
2 OpenMP Parallelization
3 OpenACC Parallelization
Stefan Rosenberger HPC for the PT-Code January 29, 2016 2 / 25
The PT - Code
Basic Structure
Solve a linear System Au = fWherein A is sparse, positive definite and a 2× 2 block system.
Apply CG - method to solve the system (with an inverse operator C ).
Use AMG - method to apply C efficient.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 3 / 25
The PT - Code
CG - Structure
We want to minimize the error r .
Start:
r0 = f − Au0
h0 = Cr0
d0 = h0
Calculate: αk =rTk hkdTk z
, and the new values:
uk+1 = uk + αkdk
rk+1 = rk − αkz
hk+1 = Crk+1 (≈ A−1rk+1)
βk =rTk+1hk+1
rTk hk
dk+1 = hk+1 + βkdkStefan Rosenberger HPC for the PT-Code January 29, 2016 4 / 25
The PT - Code
Basic Structure
Use AMG - method to apply C efficient.
Apply multigrid for hk+1 = Crk+1, for l levels.
Necessary calculations: prolongation, restriction and smoothing(Jacobi method) between the levels l .
Those calculations are parallelizable!
Goal:
This final calculation should be done on a GPU.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 5 / 25
OpenMP Parallelization
1 The PT - Code
2 OpenMP Parallelization
3 OpenACC Parallelization
Stefan Rosenberger HPC for the PT-Code January 29, 2016 6 / 25
OpenMP Parallelization
OpenMP - General Concept
OpenMP is parallelization on the CPU.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 7 / 25
OpenMP Parallelization
OpenMP - Application
We can apply simple pragma directives to create parallel loops.
e.g.:
#pragma omp p a r a l l e l f o r sha r ed ( cnt , . . . ) p r i v a t e ( s , . . . )for ( int i = 0 ; i < s i z e ; i++) {
.
.
.// shared values: usually read values
// private values: usually run index , pointer
.
.
.}
Remark: As mentioned in slide 1, we use a initial matrix A, whereinall entries are 2× 2 matrices. (c.f. function next slide)
Stefan Rosenberger HPC for the PT-Code January 29, 2016 8 / 25
OpenMP Parallelization
OpenMP - Example Jacobi
void operator ( ) ( const t o o l b o x v e c t o r<S>& f , t o o l b o x v e c t o r<S>& u ) {// define constants , pointers
// ...
#pragma omp p a r a l l e l f o r d e f a u l t ( none ) sha r ed ( cnt , co l , e l e , dsp , uu , v , f ) p r i v a t e ( s , c , t ) s c h edu l e ( gu ided , 2)for ( int i i = 0 ; i i < n s i z e ; i i ++) {
s = f [ i i ] ;const T* pco l = c o l + dsp [ i i ] ;const S* p e l e = e l e + dsp [ i i ] ;
// First part of the matrix
const T c s i z e = cnt [ i i ] ;for ( int j j = 0 ; j j < c s i z e ; j j ++) {
c = pco l [ j j ] ;t = p e l e [ j j ] ;
s −= t * uu [ c ] ;}v [ i i ] = s ;
}const int d s i z e = d i a . s i z e ( ) / 4 ;
#pragma omp p a r a l l e l f o r d e f a u l t ( none ) sha r ed (u , v , d i a ) s c h edu l e ( gu ided , 2)for ( int i i = 0 ; i i < d s i z e ; i i ++) {
u [ 2 * i i ] += omega * ( v [ 2 * i i ] * d i a [4* i i +3] − v [ 2 * i i + 1 ] * d i a [4* i i +1 ] ) ;u [ 2 * i i + 1 ] += omega * (−v [ 2 * i i ] * d i a [4* i i +2] + v [ 2 * i i + 1 ] * d i a [4* i i ] ) ;
}}
Stefan Rosenberger HPC for the PT-Code January 29, 2016 9 / 25
OpenMP Parallelization
OpenMP - Notes
One can apply the parallelization on all occurring loops (and we did
that ).
OpenMP calculate on the CPU −→ only in the parallelization regions,we have to take care of data access (shared, private).
Compare speed up:
Without parallelization: ≈ 1, 33614 seconds total time.
OpenMP with 8 threads: ≈ 0, 967979 seconds total time.
Without parallelization: ≈ 0.0576582 seconds per loop.
OpenMP with 8 threads: ≈ 0.0445387 seconds total time.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 10 / 25
OpenACC Parallelization
1 The PT - Code
2 OpenMP Parallelization
3 OpenACC Parallelization
Stefan Rosenberger HPC for the PT-Code January 29, 2016 11 / 25
OpenACC Parallelization
OpenACC - General Concept
OpenACC is paralellization on the GPU.
Destiguish between code on the host, and code on the device.
Advantage host: One can use classes and corresponding functions(e.g. constructor).
Advantage device: much faster calculation possible.Stefan Rosenberger HPC for the PT-Code January 29, 2016 12 / 25
OpenACC Parallelization
OpenACC - Application
We can apply again simple pragma directives to create parallel loops.
e.g.:
#pragma acc k e r n e l s l oop p r i v a t e ( s , c , t ) i ndependen tpcopyout ( out [ 0 : n out ] ) , pcopy in ( i n [ 0 : n i n ] )
for ( int i = 0 ; i < s i z e ; i++) {...
// shared values: usually read values
// private values: usually run index , pointer
.
.
.}
Now we have to care about data - copy. We have to provide the dataon the device to execute the code correct.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 13 / 25
OpenACC Parallelization
OpenACC - Example Jacobi
void operator ( ) ( const t o o l b o x v e c t o r<S>& f , t o o l b o x v e c t o r<S>& u ) {// define constants , pointers
// ...
#pragma acc k e r n e l s l oop p r i v a t e ( s , c , t ) i ndependen t pcopyout ( v [ 0 : n v ] ) , pcopy in ( uu [ 0 : n u ] ,dsp [ 0 : n dsp ] , cnt [ 0 : n cn t ] , f [ 0 : n f ] , c o l [ 0 : n c o l ] , e l e [ 0 : n e l e ] )for ( int i i = 0 ; i i < n s i z e ; i i ++) {
s = f [ i i ] ;const T* pco l = c o l + dsp [ i i ] ;const S* p e l e = e l e + dsp [ i i ] ;
// First part of the matrix
const T c s i z e = cnt [ i i ] ;for ( int j j = 0 ; j j < c s i z e ; j j ++) {
c = pco l [ j j ] ;t = p e l e [ j j ] ;
s −= t * uu [ c ] ;}v [ i i ] = s ;
}
const int d s i z e = d i a . s i z e ( ) / 4 ;
#pragma acc k e r n e l s l oop independen t pcopy ( u [ 0 : n u ] ) , pcopy in ( v [ 0 : n v ] , d i a [ 0 : n d i a ] )for ( int i i = 0 ; i i < d s i z e ; i i ++) {
u [2* i i + 0 ] += omega * ( v [2* i i ] * d i a [4* i i + 3 ] − v [2* i i + 1 ] * d i a [4* i i + 1 ] ) ;u [2* i i + 1 ] += omega * (−v [2* i i ] * d i a [4* i i + 2 ] + v [2* i i + 1 ] * d i a [4* i i + 0 ] ) ;
}}
Stefan Rosenberger HPC for the PT-Code January 29, 2016 14 / 25
OpenACC Parallelization
OpenACC - Notes
One can apply the parallelization on all occurring loops (and we did
that ).
But we have to take care of the data on the device and the host.
We need data - regions:
#pragma acc data copy i n ( u [ 0 : s i z e ] , . . . ) copyout ( u [ 0 : s i z e ] , . . . ){
.// do stuff with data
.}
The pragma acc data create the data only for the following scope!
For a complex code it is better to create only once a copy of thedata, and delete it only in the end of the code.
We need enter and exit commands.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 15 / 25
OpenACC Parallelization
OpenACC - Data handling
OpenACC provides commands to create a hard copy of data on thedevice:
#pragma acc e n t e r data copy i n ( data [ 0 : s i z e ] )
This will create a data array with length size on the device and copythe data to the device.
Create data on the device, without knowing of the values:
#pragma acc e n t e r data c r e a t e ( tmp [ 0 : s i z e ] )
This will create a data array with length size on the device, but it willnot copy any data to the device.Useful for e.g. temporary values.
To delete data on the device one can use the pragma:
#pragma acc e x i t data d e l e t e ( tmp [ 0 : s i z e ] )#pragma acc e x i t data copyout ( data [ 0 : s i z e ] )
This will remove the data from the device.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 16 / 25
OpenACC Parallelization
OpenACC - Data handling during Calculation
If data is on the device, one can use:
#pragma acc e n t e r data pcopy in ( data [ 0 : s i z e ] )
This will create a data array with length size on the device and copythe data to the device if the data is not on the device or the pointeris not created. Otherwise the command will do nothing!
And similar, one can use
#pragma acc e n t e r data p c r e a t e ( tmp [ 0 : s i z e ] )
to create data if the array did not exist on the device.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 17 / 25
OpenACC Parallelization
OpenACC - toolbox vector
Most data is stored in toolbox vectors.
OpenACC need class informations on the device.
Create copyin-Functions for classes:
void todev (){#pragma acc e n t e r data pcopy in ( t h i s [ 0 : 1 ] , da t a [ 0 : s i z e ] )
}void todev ( ) const{
#pragma acc e n t e r data pcopy in ( t h i s [ 0 : 1 ] , da t a [ 0 : s i z e ] )}
the pointer this copies all class - member informations (not the data)to the device.
One should copy the class pointer for all required classes to the device.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 18 / 25
OpenACC Parallelization
OpenACC - Compiling Checks
We can use pgc++ -Minfo=acc -acc .... to get compiling informationof OpenACC.
Information for jacobi data copyin:
de f au l t amg : : j a c ob i 2 x 2<int , double> : : todev ( ) :68 , i n c l u d e "toolbox.h"
116 , i n c l u d e "amg_2x2.h"
58 , Gene r a t i ng e n t e r data copy i n ( this [ : 1 ] )
Stefan Rosenberger HPC for the PT-Code January 29, 2016 19 / 25
OpenACC Parallelization
OpenACC - Compiling Checks
Information for jacobi parallelization of loops:
de f au l t amg : : j a c ob i 2 x 2<int , double> : : operator ( ) ( const t o o l b o x v e c t o r<double> &, t o o l b o x v e c t o r<double> &):68 , i n c l u d e "toolbox.h"
116 , i n c l u d e "amg_2x2.h"
89 , Gene r a t i ng copy i n ( this [ : ] )115 , Loop i s p a r a l l e l i z a b l e
A c c e l e r a t o r k e r n e l g ene r a t edGene r a t i ng Tes l a code
115 , #pragma acc l oop gang /* b l o c k I d x . x */123 , #pragma acc l oop v e c t o r (128) /* t h r e a d I d x . x */127 , Sum r e d u c t i o n gene r a t ed for s
123 , Loop i s p a r a l l e l i z a b l e152 , Gene r a t i ng copy i n ( this [ : ] )167 , Loop i s p a r a l l e l i z a b l e
A c c e l e r a t o r k e r n e l g ene r a t edGene r a t i ng Tes l a code
167 , #pragma acc l oop gang , v e c t o r (128) /* b l o c k I d x . x t h r e a d I d x . x */89 , Gene r a t i ng copyout ( v [ : v−> s i z e ] )Gene r a t i ng copy i n ( uu [ : u−> s i z e ] , dsp [ : dsp−> s i z e ] , cnt [ : cnt−> s i z e ] , f [ : f−> s i z e ] , c o l [ : c o l−> s i z e ] , e l e [ : e l e−> s i z e ] )
NOTE: the output 89, Generating copyout(v[: v-> size]) did notimply, that the system actually copies data from or to the device!
Stefan Rosenberger HPC for the PT-Code January 29, 2016 20 / 25
OpenACC Parallelization
OpenACC - Compiling Checks
Stefan Rosenberger HPC for the PT-Code January 29, 2016 21 / 25
OpenACC Parallelization
OpenACC - Run Checks
One can use the command
export PGI\ ACC\ NOTIFY=3
to get run information for real enter data.
Information for jacobi enter data:
up load CUDA data f i l e =/home/ ro s enb s /PT Code/SR PT/ s r c /example amg openACC/example−amg−openacc . cpp f u n c t i o n= ZN11de fau l t amg10 jacob i 2x2 I i dEc lERK14
t oo l b o x v e c t o r I dERS3 l i n e =152 d e v i c e=0 t h r e a d i d=1 v a r i a b l e=v by t e s=352
We get: line, name of the variable and size!
We get this information for all operations on the device!
→ A lot of information!
Now, one can go step by step through the code, and reduce the copyoperation to optimize the calculation.
Stefan Rosenberger HPC for the PT-Code January 29, 2016 22 / 25
OpenACC Parallelization
OpenACC - Run Checks
Stefan Rosenberger HPC for the PT-Code January 29, 2016 23 / 25
OpenACC Parallelization
OpenMP - Notes
I applied the copyin data regions (almost everywhere )!
Compare speed up total time:
Without parallelization: ≈ 1, 336 seconds total time.
OpenMP with 8 threads: ≈ 0, 967 seconds total time (×1, 38 faster).
OpenACC on the device: ≈ 0, 559 seconds total time (×2, 40 faster).
Compare speed up per loop:
Without parallelization: ≈ 0.0576 seconds per loop.
OpenMP with 8 threads: ≈ 0.0445 seconds total time (×1, 29 faster).
OpenACC on the device: ≈ 0, 0241 seconds total time (×2, 39 faster).
Stefan Rosenberger HPC for the PT-Code January 29, 2016 24 / 25
OpenACC Parallelization
Thank you for your attention!
Stefan Rosenberger HPC for the PT-Code January 29, 2016 25 / 25
top related