performance characteristics of large smp machines...performance characteristics of large smp...

Report

Post on 26-Apr-2020

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Rechen- und Kommunikationszentrum (RZ)

Performance Characteristics of

Large SMP Machines

Dirk Schmidl, Dieter an Mey, Matthias S. Müller

schmidl@rz.rwth-aachen.de

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 2

Investigated Hardware

Kernel Benchmark Results

Memory Bandwidth

NUMA Distances

Synchronizations

Applications

NestedCP

TrajSearch

Conclusion

Agenda

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 3

Hardware

HP ProLiant DL980 G7

8 x Intel Xeon X6550 @ 2 GHz

256 GB main memory

internally several boards

SGI Altix Ultraviolet

104 x Intel Xeon E7- 4870 @ 2.4 GHz

about 2 TB main memory

2 Socket Boards connected with NUMALink network

Bull Coherence Switch System

16 x Intel Xeon X7550 @ 2 GHz

256 GB main memory

4-socket boards externally connected with the Bull Coherence Switch (BCS)

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 4

Hardware

ScaleMP System

64 x Intel Xeon X7550 @ 2 GHz

about 4 TB main memory

4-socket boards connected with Infiniband

vSMP foundation software used to create a cache coherent single system

Intel Xeon Phi

1 Intel Xeon Phi coprocessor @ 1.05 GHz

plugged in a PCIe slot

8 GB main memory

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 5

Serial Bandwidth

Ban

idth

in G

B/s

Write Bandwidth HP

Ban

idth

in G

B/s

Write Bandwidth AltixUV

local

remote 1st level

remote 2nd level

Ban

idth

in G

B/s

Write Bandwidth BCS

Ban

idth

in G

B/s

Write Bandwidth ScaleMP

012345

Ban

idth

in G

B/s

Write Bandwidth Phi

Standard SW-Prefetching

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 6

Distance Matrix

Measured bandwidth between sockets

memory and threads placed with numactl

normalized to 10 for socket 0

Socket 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 10 13 13 13 57 57 57 57 59 59 59 59 59 57 57 57

1 13 10 13 13 56 55 56 56 56 56 56 55 55 56 56 55

2 14 13 10 13 58 58 58 58 56 56 56 56 58 58 58 58

3 13 13 13 10 56 55 56 55 56 56 56 55 56 55 56 55

4 56 56 56 56 10 13 13 13 56 56 56 57 58 58 58 58

5 55 55 55 55 13 10 13 13 55 55 55 55 56 56 56 55

6 58 58 58 59 13 13 10 13 58 58 58 58 56 56 56 57

7 56 55 56 55 13 13 13 10 56 56 56 56 56 56 56 56

8 58 58 58 58 56 57 56 56 10 13 13 13 56 56 56 56

9 56 56 55 55 55 55 55 55 13 10 13 13 55 55 56 55

10 56 56 56 56 58 58 58 58 13 13 10 13 58 58 58 58

11 56 56 56 55 56 56 56 55 13 13 13 10 56 56 56 56

12 56 56 56 56 58 58 58 58 56 57 56 56 10 13 13 13

13 55 55 55 55 56 56 55 55 56 55 55 55 13 10 13 13

14 58 58 58 58 56 56 56 56 58 58 58 58 13 13 10 13

15 56 56 56 56 56 56 56 56 56 56 56 56 13 13 13 10

Socket 0 1 2 3 4 5 6 7

0 10 10 17 13 18 18 18 18

1 10 10 17 13 18 18 18 18

2 17 17 10 11 18 18 18 18

3 17 17 10 11 19 19 18 18

4 18 18 18 18 10 10 17 17

5 18 18 18 18 10 10 17 17

6 18 18 18 18 17 17 10 10

7 18 19 18 18 17 17 10 9

• remote accesses much more expensive on the BCS machine

• HP machine internally has also several NUMA levels

BCS

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 7

Parallel Bandwidth

Read and Write Bandwidth on local data

16 MB memory footprint per thread

100

150

200

250

Ban

idth

in G

B/s

Number of Threads

HP-read ALTIX-read BCS-read SCALEMP-read Phi-read

HP-write ALTIX-write BCS-write SCALEMP-write Phi-write

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 8

mem_go_around

Investigate slow-down, when remote accesses ocure

Every thread initializes local memory and measures the bandwidth

In step n thread t uses the memory of thread (t+n)%nthreads

this increases the number of remote accesses in every step

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 9

mem_go_around

128

256

512

ry B

and

wid

th in

Turn

HP Altix BCS ScaleMP Phi

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 10

Synchronization

Overhead in microseconds to acquire a lock

Synchronization overhead rises with the number of threads

ScaleMP introduces more overhead for large thread counts

#threads BCS SCALEMP PHI ALTIX HP

1 0.06 0.07 0.40 0.05 0.93 8 0.27 0.29 1.89 0.21 0.26

32/30 0.62 0.99 1.77 3.29 0.97 64/60 1.04 24.36 1.94 3.72 1.07

128/120 1.64 35.78 2.01 2.99 240 2.26

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 11

NestedCP: Parallel Critical Point Extraction

Virtual Reality Group of RWTH Aachen University:

Analysis of large-scale flow simulations

Feature extraction from raw data

Interactive analysis in virtual environment (e.g. a cave)

Critical Point: Point in the vector field with zero velocity

Andreas Gerndt, Virtual Reality Center, RWTH Aachen

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 12

NestedCP

parallelization done with OpenMP tasks

many independent tasks only synchronized at the end

100

150

200

250

300

1 2 4 8

/60

8/1

Spe

nti

in s

ec.

Number of Threads

BCS SCALEMP PHI ALTIXHP BCS-Speedup SCALEMP-Speedup PHI-Speedup

ALTIX-Speedup HP-Speedup

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 13

TrajSearch

Direct Numerical Simulation of three

dimensional turbulent flow field

produces large output arrays

16384 Procs@BlueGene ~ ½ year of

computation produced 2048³ output

grid (320GB)

Trajectory Analysis (TrajSearch)

implemented with OpenMP was

optimized for large NUMA machines

Here the 1024³ grid cells data was

used (~40 GB) Institute for Combustion Technology

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 14

TrajSearch

Optimizations:

reduced number of locks

NUMA aware data initialization

data blocked to 8x8x8 blocks to load nearest data on ScaleMP

self-written NUMA aware scheduler

100

120

140

Spe

nti

in h

Number of Threads ALTIX BCS SCALEMP ALTIX-Speedup BCS-Speedup SCALEMP-Speedup

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 15

Conclusion

larger systems provide a larger total memory bandwidth

the overhead for a lot of remote accesses is also higher on larger

systems as has been seen with the mem_go_around test

the caching in the vSMP software can hide the remote latency, even

when larger arrays are read or written remotely

synchronization is a problem on all systems that increases with the

number of cores

the Xeon Phi system delivers a good bandwidth and low

synchronization overhead for a large number of threads

applications can run well on large NUMA machines

Remark:

A Revised version with newer performance measurements will soon

be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/default.aspx

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 16

Conclusion

larger systems provide a larger total memory bandwidth

the overhead for a lot of remote accesses is also higher on larger

systems as has been seen with the mem_go_around test

the caching in the vSMP software can hide the remote latency, even

when larger arrays are read or written remotely

synchronization is a problem on all systems that increases with the

number of cores

the Xeon Phi system delivers a good bandwidth and low

synchronization overhead for a large number of threads

applications can run well on large NUMA machines

Remark:

A Revised version with newer performance measurements will soon

be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/default.aspx

Thank you for your attention! Questions?

top related

smp und scp dt / engl - pimmedia.schmalz.com ·...

Documents

brainwave-entrainment mit mind-machines

Documents

heinert – support vector machines support vector machines...

Documents

gaggia - gastropartsgaggia – espresso machines 138...

Documents

pathological characteristics of bk polyomavirus-associated

Documents

sofosbuvir characteristics

Documents

smp und threads

Documents

technical characteristics f2 - gi.fa...

Documents

xi-machines mediaserver broschüre...

Documents

differential imaging characteristics and dissemination ......

Documents

cluster smart machines

Documents

geochemical characteristics of walat formation …

Documents

die smp-produktlinie im Überblick smp klebedichtstoffe s...

Documents

smp - sts-schwingungstechnik.de · smp massive anlagen...

Documents

combining population and patient-specific characteristics

Documents

algorithmic cooling and quantumthermodynamic machines

Documents

geothermal reservoir characteristics of meso- and cenozoic

Documents

acoustic characteristics parameters of polyurethane/rice

Documents

manufacture and characteristics of pastilles · pdf...

Documents

elektrohaftmagnet, haftkraftdiagramm, holding solenoid,...

Documents