hpc for the pt-code · the pt - code basic structure solve a linear system au = f wherein a is...

Report

Post on 20-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HPC for the PT-Code

Stefan Rosenberger

Supervisor: Univ.-Prof. Dipl.-Ing. Dr. Gundolf HaaseInstitut fur Mathematik und wissenschaftliches Rechnen

Universitat Graz

January 29, 2016

Stefan Rosenberger HPC for the PT-Code January 29, 2016 1 / 25

The PT - Code

1 The PT - Code

2 OpenMP Parallelization

3 OpenACC Parallelization

Stefan Rosenberger HPC for the PT-Code January 29, 2016 2 / 25

The PT - Code

Basic Structure

Solve a linear System Au = fWherein A is sparse, positive definite and a 2× 2 block system.

Apply CG - method to solve the system (with an inverse operator C ).

Use AMG - method to apply C efficient.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 3 / 25

The PT - Code

CG - Structure

We want to minimize the error r .

Start:

r0 = f − Au0

h0 = Cr0

d0 = h0

Calculate: αk =rTk hkdTk z

, and the new values:

uk+1 = uk + αkdk

rk+1 = rk − αkz

hk+1 = Crk+1 (≈ A−1rk+1)

βk =rTk+1hk+1

rTk hk

dk+1 = hk+1 + βkdkStefan Rosenberger HPC for the PT-Code January 29, 2016 4 / 25

The PT - Code

Basic Structure

Use AMG - method to apply C efficient.

Apply multigrid for hk+1 = Crk+1, for l levels.

Necessary calculations: prolongation, restriction and smoothing(Jacobi method) between the levels l .

Those calculations are parallelizable!

Goal:

This final calculation should be done on a GPU.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 5 / 25

OpenMP Parallelization

1 The PT - Code

2 OpenMP Parallelization

3 OpenACC Parallelization

Stefan Rosenberger HPC for the PT-Code January 29, 2016 6 / 25

OpenMP Parallelization

OpenMP - General Concept

OpenMP is parallelization on the CPU.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 7 / 25

OpenMP Parallelization

OpenMP - Application

We can apply simple pragma directives to create parallel loops.

e.g.:

#pragma omp p a r a l l e l f o r sha r ed ( cnt , . . . ) p r i v a t e ( s , . . . )for ( int i = 0 ; i < s i z e ; i++) {

.// shared values: usually read values

// private values: usually run index , pointer

Remark: As mentioned in slide 1, we use a initial matrix A, whereinall entries are 2× 2 matrices. (c.f. function next slide)

Stefan Rosenberger HPC for the PT-Code January 29, 2016 8 / 25

OpenMP Parallelization

OpenMP - Example Jacobi

void operator ( ) ( const t o o l b o x v e c t o r<S>& f , t o o l b o x v e c t o r<S>& u ) {// define constants , pointers

// ...

#pragma omp p a r a l l e l f o r d e f a u l t ( none ) sha r ed ( cnt , co l , e l e , dsp , uu , v , f ) p r i v a t e ( s , c , t ) s c h edu l e ( gu ided , 2)for ( int i i = 0 ; i i < n s i z e ; i i ++) {

s = f [ i i ] ;const T* pco l = c o l + dsp [ i i ] ;const S* p e l e = e l e + dsp [ i i ] ;

// First part of the matrix

const T c s i z e = cnt [ i i ] ;for ( int j j = 0 ; j j < c s i z e ; j j ++) {

c = pco l [ j j ] ;t = p e l e [ j j ] ;

s −= t * uu [ c ] ;}v [ i i ] = s ;

}const int d s i z e = d i a . s i z e ( ) / 4 ;

#pragma omp p a r a l l e l f o r d e f a u l t ( none ) sha r ed (u , v , d i a ) s c h edu l e ( gu ided , 2)for ( int i i = 0 ; i i < d s i z e ; i i ++) {

u [ 2 * i i ] += omega * ( v [ 2 * i i ] * d i a [4* i i +3] − v [ 2 * i i + 1 ] * d i a [4* i i +1 ] ) ;u [ 2 * i i + 1 ] += omega * (−v [ 2 * i i ] * d i a [4* i i +2] + v [ 2 * i i + 1 ] * d i a [4* i i ] ) ;

}}

Stefan Rosenberger HPC for the PT-Code January 29, 2016 9 / 25

OpenMP Parallelization

OpenMP - Notes

One can apply the parallelization on all occurring loops (and we did

that ).

OpenMP calculate on the CPU −→ only in the parallelization regions,we have to take care of data access (shared, private).

Compare speed up:

Without parallelization: ≈ 1, 33614 seconds total time.

OpenMP with 8 threads: ≈ 0, 967979 seconds total time.

Without parallelization: ≈ 0.0576582 seconds per loop.

OpenMP with 8 threads: ≈ 0.0445387 seconds total time.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 10 / 25

OpenACC Parallelization

1 The PT - Code

2 OpenMP Parallelization

3 OpenACC Parallelization

Stefan Rosenberger HPC for the PT-Code January 29, 2016 11 / 25

OpenACC Parallelization

OpenACC - General Concept

OpenACC is paralellization on the GPU.

Destiguish between code on the host, and code on the device.

Advantage host: One can use classes and corresponding functions(e.g. constructor).

Advantage device: much faster calculation possible.Stefan Rosenberger HPC for the PT-Code January 29, 2016 12 / 25

OpenACC Parallelization

OpenACC - Application

We can apply again simple pragma directives to create parallel loops.

e.g.:

#pragma acc k e r n e l s l oop p r i v a t e ( s , c , t ) i ndependen tpcopyout ( out [ 0 : n out ] ) , pcopy in ( i n [ 0 : n i n ] )

for ( int i = 0 ; i < s i z e ; i++) {...

// shared values: usually read values

// private values: usually run index , pointer

Now we have to care about data - copy. We have to provide the dataon the device to execute the code correct.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 13 / 25

OpenACC Parallelization

OpenACC - Example Jacobi

void operator ( ) ( const t o o l b o x v e c t o r<S>& f , t o o l b o x v e c t o r<S>& u ) {// define constants , pointers

// ...

#pragma acc k e r n e l s l oop p r i v a t e ( s , c , t ) i ndependen t pcopyout ( v [ 0 : n v ] ) , pcopy in ( uu [ 0 : n u ] ,dsp [ 0 : n dsp ] , cnt [ 0 : n cn t ] , f [ 0 : n f ] , c o l [ 0 : n c o l ] , e l e [ 0 : n e l e ] )for ( int i i = 0 ; i i < n s i z e ; i i ++) {

s = f [ i i ] ;const T* pco l = c o l + dsp [ i i ] ;const S* p e l e = e l e + dsp [ i i ] ;

// First part of the matrix

const T c s i z e = cnt [ i i ] ;for ( int j j = 0 ; j j < c s i z e ; j j ++) {

c = pco l [ j j ] ;t = p e l e [ j j ] ;

s −= t * uu [ c ] ;}v [ i i ] = s ;

}

const int d s i z e = d i a . s i z e ( ) / 4 ;

#pragma acc k e r n e l s l oop independen t pcopy ( u [ 0 : n u ] ) , pcopy in ( v [ 0 : n v ] , d i a [ 0 : n d i a ] )for ( int i i = 0 ; i i < d s i z e ; i i ++) {

u [2* i i + 0 ] += omega * ( v [2* i i ] * d i a [4* i i + 3 ] − v [2* i i + 1 ] * d i a [4* i i + 1 ] ) ;u [2* i i + 1 ] += omega * (−v [2* i i ] * d i a [4* i i + 2 ] + v [2* i i + 1 ] * d i a [4* i i + 0 ] ) ;

}}

Stefan Rosenberger HPC for the PT-Code January 29, 2016 14 / 25

OpenACC Parallelization

OpenACC - Notes

One can apply the parallelization on all occurring loops (and we did

that ).

But we have to take care of the data on the device and the host.

We need data - regions:

#pragma acc data copy i n ( u [ 0 : s i z e ] , . . . ) copyout ( u [ 0 : s i z e ] , . . . ){

.// do stuff with data

The pragma acc data create the data only for the following scope!

For a complex code it is better to create only once a copy of thedata, and delete it only in the end of the code.

We need enter and exit commands.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 15 / 25

OpenACC Parallelization

OpenACC - Data handling

OpenACC provides commands to create a hard copy of data on thedevice:

#pragma acc e n t e r data copy i n ( data [ 0 : s i z e ] )

This will create a data array with length size on the device and copythe data to the device.

Create data on the device, without knowing of the values:

#pragma acc e n t e r data c r e a t e ( tmp [ 0 : s i z e ] )

This will create a data array with length size on the device, but it willnot copy any data to the device.Useful for e.g. temporary values.

To delete data on the device one can use the pragma:

#pragma acc e x i t data d e l e t e ( tmp [ 0 : s i z e ] )#pragma acc e x i t data copyout ( data [ 0 : s i z e ] )

This will remove the data from the device.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 16 / 25

OpenACC Parallelization

OpenACC - Data handling during Calculation

If data is on the device, one can use:

#pragma acc e n t e r data pcopy in ( data [ 0 : s i z e ] )

This will create a data array with length size on the device and copythe data to the device if the data is not on the device or the pointeris not created. Otherwise the command will do nothing!

And similar, one can use

#pragma acc e n t e r data p c r e a t e ( tmp [ 0 : s i z e ] )

to create data if the array did not exist on the device.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 17 / 25

OpenACC Parallelization

OpenACC - toolbox vector

Most data is stored in toolbox vectors.

OpenACC need class informations on the device.

Create copyin-Functions for classes:

void todev (){#pragma acc e n t e r data pcopy in ( t h i s [ 0 : 1 ] , da t a [ 0 : s i z e ] )

}void todev ( ) const{

#pragma acc e n t e r data pcopy in ( t h i s [ 0 : 1 ] , da t a [ 0 : s i z e ] )}

the pointer this copies all class - member informations (not the data)to the device.

One should copy the class pointer for all required classes to the device.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 18 / 25

OpenACC Parallelization

OpenACC - Compiling Checks

We can use pgc++ -Minfo=acc -acc .... to get compiling informationof OpenACC.

Information for jacobi data copyin:

de f au l t amg : : j a c ob i 2 x 2<int , double> : : todev ( ) :68 , i n c l u d e "toolbox.h"

116 , i n c l u d e "amg_2x2.h"

58 , Gene r a t i ng e n t e r data copy i n ( this [ : 1 ] )

Stefan Rosenberger HPC for the PT-Code January 29, 2016 19 / 25

OpenACC Parallelization

OpenACC - Compiling Checks

Information for jacobi parallelization of loops:

de f au l t amg : : j a c ob i 2 x 2<int , double> : : operator ( ) ( const t o o l b o x v e c t o r<double> &, t o o l b o x v e c t o r<double> &):68 , i n c l u d e "toolbox.h"

116 , i n c l u d e "amg_2x2.h"

89 , Gene r a t i ng copy i n ( this [ : ] )115 , Loop i s p a r a l l e l i z a b l e

A c c e l e r a t o r k e r n e l g ene r a t edGene r a t i ng Tes l a code

115 , #pragma acc l oop gang /* b l o c k I d x . x */123 , #pragma acc l oop v e c t o r (128) /* t h r e a d I d x . x */127 , Sum r e d u c t i o n gene r a t ed for s

123 , Loop i s p a r a l l e l i z a b l e152 , Gene r a t i ng copy i n ( this [ : ] )167 , Loop i s p a r a l l e l i z a b l e

A c c e l e r a t o r k e r n e l g ene r a t edGene r a t i ng Tes l a code

167 , #pragma acc l oop gang , v e c t o r (128) /* b l o c k I d x . x t h r e a d I d x . x */89 , Gene r a t i ng copyout ( v [ : v−> s i z e ] )Gene r a t i ng copy i n ( uu [ : u−> s i z e ] , dsp [ : dsp−> s i z e ] , cnt [ : cnt−> s i z e ] , f [ : f−> s i z e ] , c o l [ : c o l−> s i z e ] , e l e [ : e l e−> s i z e ] )

NOTE: the output 89, Generating copyout(v[: v-> size]) did notimply, that the system actually copies data from or to the device!

Stefan Rosenberger HPC for the PT-Code January 29, 2016 20 / 25

OpenACC Parallelization

OpenACC - Compiling Checks

Stefan Rosenberger HPC for the PT-Code January 29, 2016 21 / 25

OpenACC Parallelization

OpenACC - Run Checks

One can use the command

export PGI\ ACC\ NOTIFY=3

to get run information for real enter data.

Information for jacobi enter data:

up load CUDA data f i l e =/home/ ro s enb s /PT Code/SR PT/ s r c /example amg openACC/example−amg−openacc . cpp f u n c t i o n= ZN11de fau l t amg10 jacob i 2x2 I i dEc lERK14

t oo l b o x v e c t o r I dERS3 l i n e =152 d e v i c e=0 t h r e a d i d=1 v a r i a b l e=v by t e s=352

We get: line, name of the variable and size!

We get this information for all operations on the device!

→ A lot of information!

Now, one can go step by step through the code, and reduce the copyoperation to optimize the calculation.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 22 / 25

OpenACC Parallelization

OpenACC - Run Checks

Stefan Rosenberger HPC for the PT-Code January 29, 2016 23 / 25

OpenACC Parallelization

OpenMP - Notes

I applied the copyin data regions (almost everywhere )!

Compare speed up total time:

Without parallelization: ≈ 1, 336 seconds total time.

OpenMP with 8 threads: ≈ 0, 967 seconds total time (×1, 38 faster).

OpenACC on the device: ≈ 0, 559 seconds total time (×2, 40 faster).

Compare speed up per loop:

Without parallelization: ≈ 0.0576 seconds per loop.

OpenMP with 8 threads: ≈ 0.0445 seconds total time (×1, 29 faster).

OpenACC on the device: ≈ 0, 0241 seconds total time (×2, 39 faster).

Stefan Rosenberger HPC for the PT-Code January 29, 2016 24 / 25

OpenACC Parallelization

Thank you for your attention!

Stefan Rosenberger HPC for the PT-Code January 29, 2016 25 / 25

top related

hpc for the pt-code · the pt - code basic structure solve a linear system au = f wherein a is...

Documents

product-guide fileprivat bÜro industrie ausstattung pt-h105...

[wp] - technische universität braunschweig web viewdin....

clusteratlas asean - startseite bihk · mahardhika, pt...

#jakartatanggapcorona · 7 pt bayer indonesia 8 pt media...

clusteratlas asean - bihk · telkomsel, indosat, xl asiata,...

hobelmaschinen · 2012 tp 410 dh 520 dh 630 tp 630 d pt 200...

can immigration solve the problem of an aging …...can...

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt...

technische daten2015/2016 - camperonline · technische...

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt...

corporate design · max mustermann musterweg 12 musterstadt...

ersatzteilliste pt-m, pt-l, pt-xl - gastroparts · 2018....

coaguchek xs pt controls - hello from web server xs pt...

full-range brochure de de · 2019. 8. 7. · mtr 18 mtf 19...

polytron system pt 6100 d betriebsanleitung€¦ · ba pt...

anfänger - u 7, m/w abschlussrangliste 2017 (1...

marktentwicklung & innovationstreiber€¦ · graph compare...

89013234-00 betriebsanleitung pt-m pt-l...

pt-innere Übersicht - ralfingo.de innere/pt innere.pdf ·...

testseite schwarz schriftgröße 48 pt testseite ... ·...