performance characteristics of large smp machines...performance characteristics of large smp...
Post on 26-Apr-2020
23 Views
Preview:
TRANSCRIPT
Rechen- und Kommunikationszentrum (RZ)
Performance Characteristics of
Large SMP Machines
Dirk Schmidl, Dieter an Mey, Matthias S. Müller
schmidl@rz.rwth-aachen.de
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 2
Investigated Hardware
Kernel Benchmark Results
Memory Bandwidth
NUMA Distances
Synchronizations
Applications
NestedCP
TrajSearch
Conclusion
Agenda
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 3
Hardware
HP ProLiant DL980 G7
8 x Intel Xeon X6550 @ 2 GHz
256 GB main memory
internally several boards
SGI Altix Ultraviolet
104 x Intel Xeon E7- 4870 @ 2.4 GHz
about 2 TB main memory
2 Socket Boards connected with NUMALink network
Bull Coherence Switch System
16 x Intel Xeon X7550 @ 2 GHz
256 GB main memory
4-socket boards externally connected with the Bull Coherence Switch (BCS)
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 4
Hardware
ScaleMP System
64 x Intel Xeon X7550 @ 2 GHz
about 4 TB main memory
4-socket boards connected with Infiniband
vSMP foundation software used to create a cache coherent single system
Intel Xeon Phi
1 Intel Xeon Phi coprocessor @ 1.05 GHz
plugged in a PCIe slot
8 GB main memory
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 5
Serial Bandwidth
0
5
10
15
1B
16
B
25
6B
4K
B
64
KB
1M
B
16
MB
25
6M
B
4G
B
Ban
dw
idth
in G
B/s
Write Bandwidth HP
0
5
10
15
1B
16
B
25
6B
4K
B
64
KB
1M
B
16
MB
25
6M
B
4G
B
Ban
dw
idth
in G
B/s
Write Bandwidth AltixUV
local
remote 1st level
remote 2nd level
0
5
10
15
1B
16
B
25
6B
4K
B
64
KB
1M
B
16
MB
25
6M
B
4G
B
Ban
dw
idth
in G
B/s
Write Bandwidth BCS
0
5
10
15
1B
16
B
25
6B
4K
B
64
KB
1M
B
16
MB
25
6M
B
4G
B
Ban
dw
idth
in G
B/s
Write Bandwidth ScaleMP
012345
1B
16
B
25
6B
4K
B
64
KB
1M
B
16
MB
25
6M
B
4G
B
Ban
dw
idth
in G
B/s
Write Bandwidth Phi
Standard SW-Prefetching
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 6
Distance Matrix
Measured bandwidth between sockets
memory and threads placed with numactl
normalized to 10 for socket 0
Socket 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 10 13 13 13 57 57 57 57 59 59 59 59 59 57 57 57
1 13 10 13 13 56 55 56 56 56 56 56 55 55 56 56 55
2 14 13 10 13 58 58 58 58 56 56 56 56 58 58 58 58
3 13 13 13 10 56 55 56 55 56 56 56 55 56 55 56 55
4 56 56 56 56 10 13 13 13 56 56 56 57 58 58 58 58
5 55 55 55 55 13 10 13 13 55 55 55 55 56 56 56 55
6 58 58 58 59 13 13 10 13 58 58 58 58 56 56 56 57
7 56 55 56 55 13 13 13 10 56 56 56 56 56 56 56 56
8 58 58 58 58 56 57 56 56 10 13 13 13 56 56 56 56
9 56 56 55 55 55 55 55 55 13 10 13 13 55 55 56 55
10 56 56 56 56 58 58 58 58 13 13 10 13 58 58 58 58
11 56 56 56 55 56 56 56 55 13 13 13 10 56 56 56 56
12 56 56 56 56 58 58 58 58 56 57 56 56 10 13 13 13
13 55 55 55 55 56 56 55 55 56 55 55 55 13 10 13 13
14 58 58 58 58 56 56 56 56 58 58 58 58 13 13 10 13
15 56 56 56 56 56 56 56 56 56 56 56 56 13 13 13 10
Socket 0 1 2 3 4 5 6 7
0 10 10 17 13 18 18 18 18
1 10 10 17 13 18 18 18 18
2 17 17 10 11 18 18 18 18
3 17 17 10 11 19 19 18 18
4 18 18 18 18 10 10 17 17
5 18 18 18 18 10 10 17 17
6 18 18 18 18 17 17 10 10
7 18 19 18 18 17 17 10 9
• remote accesses much more expensive on the BCS machine
• HP machine internally has also several NUMA levels
BCS
HP
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 7
Parallel Bandwidth
Read and Write Bandwidth on local data
16 MB memory footprint per thread
0
50
100
150
200
250
0
10
20
30
40
50
60
70
80
90
10
0
11
0
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
0
20
0
21
0
22
0
23
0
24
0
Ban
dw
idth
in G
B/s
Number of Threads
HP-read ALTIX-read BCS-read SCALEMP-read Phi-read
HP-write ALTIX-write BCS-write SCALEMP-write Phi-write
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 8
mem_go_around
Investigate slow-down, when remote accesses ocure
Every thread initializes local memory and measures the bandwidth
In step n thread t uses the memory of thread (t+n)%nthreads
this increases the number of remote accesses in every step
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 9
mem_go_around
1
2
4
8
16
32
64
128
256
512
0
10
20
30
40
50
60
70
80
90
10
0
11
0
12
0
Me
mo
ry B
and
wid
th in
GB
/s
Turn
HP Altix BCS ScaleMP Phi
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 10
Synchronization
Overhead in microseconds to acquire a lock
Synchronization overhead rises with the number of threads
ScaleMP introduces more overhead for large thread counts
#threads BCS SCALEMP PHI ALTIX HP
1 0.06 0.07 0.40 0.05 0.93 8 0.27 0.29 1.89 0.21 0.26
32/30 0.62 0.99 1.77 3.29 0.97 64/60 1.04 24.36 1.94 3.72 1.07
128/120 1.64 35.78 2.01 2.99 240 2.26
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 11
NestedCP: Parallel Critical Point Extraction
Virtual Reality Group of RWTH Aachen University:
Analysis of large-scale flow simulations
Feature extraction from raw data
Interactive analysis in virtual environment (e.g. a cave)
Critical Point: Point in the vector field with zero velocity
Andreas Gerndt, Virtual Reality Center, RWTH Aachen
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 12
NestedCP
parallelization done with OpenMP tasks
many independent tasks only synchronized at the end
0
20
40
60
80
100
0
50
100
150
200
250
300
1 2 4 8
16
32
64
/60
12
8/1
20
24
0
Spe
ed
up
Ru
nti
me
in s
ec.
Number of Threads
BCS SCALEMP PHI ALTIXHP BCS-Speedup SCALEMP-Speedup PHI-Speedup
ALTIX-Speedup HP-Speedup
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 13
TrajSearch
Direct Numerical Simulation of three
dimensional turbulent flow field
produces large output arrays
16384 Procs@BlueGene ~ ½ year of
computation produced 2048³ output
grid (320GB)
Trajectory Analysis (TrajSearch)
implemented with OpenMP was
optimized for large NUMA machines
Here the 1024³ grid cells data was
used (~40 GB) Institute for Combustion Technology
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 14
TrajSearch
Optimizations:
reduced number of locks
NUMA aware data initialization
data blocked to 8x8x8 blocks to load nearest data on ScaleMP
self-written NUMA aware scheduler
0
20
40
60
80
100
120
140
0
5
10
15
20
25
30
35
40
8
16
32
64
12
8
Spe
ed
up
Ru
nti
me
in h
ou
rs
Number of Threads ALTIX BCS SCALEMP ALTIX-Speedup BCS-Speedup SCALEMP-Speedup
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 15
Conclusion
larger systems provide a larger total memory bandwidth
the overhead for a lot of remote accesses is also higher on larger
systems as has been seen with the mem_go_around test
the caching in the vSMP software can hide the remote latency, even
when larger arrays are read or written remotely
synchronization is a problem on all systems that increases with the
number of cores
the Xeon Phi system delivers a good bandwidth and low
synchronization overhead for a large number of threads
applications can run well on large NUMA machines
Remark:
A Revised version with newer performance measurements will soon
be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/default.aspx
Performance Characteristics of Large SMP Machines
Dirk Schmidl | Rechen- und Kommunikationszentrum 16
Conclusion
larger systems provide a larger total memory bandwidth
the overhead for a lot of remote accesses is also higher on larger
systems as has been seen with the mem_go_around test
the caching in the vSMP software can hide the remote latency, even
when larger arrays are read or written remotely
synchronization is a problem on all systems that increases with the
number of cores
the Xeon Phi system delivers a good bandwidth and low
synchronization overhead for a large number of threads
applications can run well on large NUMA machines
Remark:
A Revised version with newer performance measurements will soon
be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/default.aspx
Thank you for your attention! Questions?
top related