The Vampir Performance Analysis Tool
Hans–Christian Hoppe
Gesellschaft für Parallele Anwendungen und Systeme mbH
Pallas GmbHHermülheimer Straße 10D-50321 Brühl, Germany
[email protected]://www.pallas.com
SCICOMP 2000 Tutorial, San Diego
© Pallas GmbH
Outline
Performance tools for parallel programming
Performance analysis for MPI
The Vampir tool
The Vampir roadmap
© Pallas GmbH
Why performance tools?
CPUs and interconnects are getting faster all the time
Compilers are improving
“Abundance of computing power”
Shouldn’t it be sufficient to just write an application and let the system do the rest?
© Pallas GmbH
Why performance tools?
In reality, there remain severe performance bottlenecks– slow memory access (instructions and data)– cache consistency effects– starvation of instruction units– contention of interconnection systems– adverse interaction with schedulers
© Pallas GmbH
Why performance tools?
The application programmer does the rest– excessive sequential sections– bad load balance– non–optimized communication patterns– excessive synchronization
Performance analysis tools can– help to diagnose system–level performance problems– help to identify user–level performance bottlenecks– assist the users in improving their applications
© Pallas GmbH
Achieved performance vs. effort
Effort
Cod
e P
erfo
rman
ce
OpenMP
MPI
Code doesn’t work
Performance tools
Performance tools
KAP, Debuggers
© Pallas GmbH
Performance tools – goals?
Holy grail– Automatic parallelisation and optimization– One code version for sequential and parallel– One code version for all platforms– Automatic code verification– Automatic performance verification– Automatic detection of performance problems– Integration of performance analysis and parallelisation
© Pallas GmbH
Event–based MPI Analysis
Record trace of application execution– Calls to MPI and user routines– MPI communication events– Source locations– Values of performance registers or program variables
From a trace, a performance analysis tool can show– Protocol of execution over time– Statistics for MPI routine execution– Statistics for communication– Dynamic calling tree
Important advantage– Focus on any phase of the execution
© Pallas GmbH
Vampirtrace details
Vampirtrace™– Instrumentation library producing traces for Vampir and
Dimemas– Supports MPI–1 (incl. collective operations) and MPI–I/O– Exploits MPI profiling interface– Works with vendors MPI implementations– API for user–level instrumentation– Capability to filter for event subsets
Developed, productized and marketed by Pallas
Available for IBM SP, PE 3.x
© Pallas GmbH
Vampir details
Vampir™– Event–trace visualization tool– Analyzes MPI and user routines– Analyzes point–to–point, collective and MPI–IO operations– Focus on arbitrary execution phases– Execution and communication statistics– Filter processes, messages, and user/MPI routines
Jointly developed by TU Dresden and Pallas Productized and marketed by Pallas
Available for IBM RS6000, AIX 4.2/AIX 4.3
© Pallas GmbH
Dimemas details
Dimemas– Event–based performance prediction tool– Parameterized machine model
•CPU performance•Communication and network performance
– Predicts performance on modeled platform– What–if analysis determined influence of parameters
Jointly developed by UPC Barcelona and Pallas
Productized and marketed by Pallas
Available for IBM RS6000, AIX 4.2/AIX 4.3
© Pallas GmbH
Vampir main window
Vampir 2.5 main window
Tracefile loading can be interrupted at any time Tracefile loading can be resumed Tracefile can be loaded starting at a specified time offset Tracefile can be re–written
© Pallas GmbH
Aggregated profiling information– Execution time– Number of calls
Inclusive or exclusive of called routines
Summary chart
© Pallas GmbH
Vampir state model
User specifies activities and symbol grouping Look at all/any activities or all symbols
Summary chart
Calculation TracingMPI
MPI_Send
MPI_Recv
MPI_Wait
ssor
exchange
Activities
Symbols
© Pallas GmbH
Timeline display
To zoom, mark region with the mouse
© Pallas GmbH
Timeline display – message details
Click on message line
Message receive op
Messagesend op
Message information
© Pallas GmbH
Communication statistics
Message statistics for each process/node pair:– Byte and message count– min/max/avg message length, bandwidth
© Pallas GmbH
Message histograms
Message statistics by length, tag or communicator– Byte and message count– Min/max/avg bandwidth
© Pallas GmbH
Collective operations
For each process: mark operation locally
Connect start/stop points by lines
Start of opData being sent
Data being received
Stop of op
Connection lines
© Pallas GmbH
Collective operations
Click on collective operation display
See global timing info
See local timing info
© Pallas GmbH
I/O transfers are shown as lines
MPI–I/O operations
Click on I/O line
See detailed I/O information
© Pallas GmbH
Activity chart
Profiling information for all processes
© Pallas GmbH
Global calling tree
Display for each symbol:– Number of calls, min/max. execution time
Fold/unfold or restrict to subtrees
© Pallas GmbH
Process–local displays
Timeline (showing calling levels) Activity chart Calling tree (showing number of calls)
© Pallas GmbH
Effects of zooming
Select one iteration
Updated summary
Updated message statistics
© Pallas GmbH
Compare traces
Compare profiling information– To check load balance (between processes)– To evaluate scalability (different runs)– To look at optimization effects (different code versions)
Compare processes 6 and 19
Comparison by routine
© Pallas GmbH
Coupling Vampir and Dimemas
Actual program run
vs.
Ideal communication
© Pallas GmbH
Vampir/Vampirtrace roadmap
Ongoing developments– Scalability enhancements– Functionality enhancements– Instrumentation enhancements
Will be first available commercially on NEC and Compaq platforms
– Earth simulator– ASCI machines
PathForward developments for ASCI machines
© Pallas GmbH
Scalability challenges
Scalability in processor count– ASCI–class machines have 1000s of processors– High–end systems have 100s of processors– Applications use most of them
Scalability in time– Need to analyze actual production runs (hours/days)
Scalability in detail– Record and analyze system–specific performance data– Support for threaded and hybrid models
© Pallas GmbH
Scalability problems
Counter–based profiling tools are basically OK– Severely limited in the level of detail– Can’t focus into parts of application run
Event–based tools have problems– Event traces get really large– Display tools use huge amounts of memory– Many displays do not scale
Example: Vampir tracefiles for NAS NPB–LU– 128 processes: 3.000.000 records (120 Mbyte)– 256 processes: 15.000.000 records (600 Mbyte)– 512 processes: 150.000.000 records (6 Gbyte)
© Pallas GmbH
Threaded programming models
Enhance Vampir to display– Thread fork/join– Thread synchronization– Show a timeline per thread / aggregate threads into single
timeline– Display subroutine/code block execution for each thread
Create instrumentation library for thread packages
Integrate instrumentation capability into OpenMP systems
© Pallas GmbH
Cluster node display
Cluster information is already recorded Enhance Vampir to
– show aggregate execution information per node– show communication volume per node
© Pallas GmbH
Cluster timeline display
Display node–level information Show communication volume within nodes Show communication between nodes as usual Allow to expand nodes into processes
There may be more than two hierarchy levels ...
© Pallas GmbH
Cluster timeline display
© Pallas GmbH
Structured tracefile format
Subdivide the tracefile into frames– Time intervals, thread/process/node subsets
Put frame data – All in one file (as today)– In multiple files (one per frame ...)– On a parallel filesystem (exploit parallelism)
Frame index file holds– Location of frame start/end– Frame statistic data for immediate display– “Frame thumbnail”
© Pallas GmbH
Structured tracefile format
Vampir loads the frame index Displays immediately available
– Global profiling/communication statistics– By–frame profiling/communication statistics– Thumbnail timeline
User gets overview of application run– Can load particular frame data– Can navigate between frames
User can refine instrumentation/tracing– Get detailed trace of interesting frames
© Pallas GmbH
Dynamic tracing control
What can be controlled– Definition of frames– Data to be recorded per frame
Control methods– Instrumentation with Vampirtrace API– Binary instrumentation (atom) or use of a debugger– Configuration file– Interactive control agent (debugger)
Tracing the right data is an iterative process!
© Pallas GmbH
Cluster timeline display
For very large systems, still can’t look at complete system (too many nodes)
Display “interesting” nodes only– Regarding communication volume/delays– Regarding load imbalance– Regarding execution times of particular code modules
© Pallas GmbH
Scalable Vampir structure
Scalable user–interface Scalable internals
Data Control
Vampir SC
User Interaction
Trace Data Processing
Trace Data I/O
Data Control
Vampir DC
User Interaction
Trace Data Analysis
Display Handling
Structured Trace Data
runs on WS
runs on parallelsystem
may exploit parallel
FS
© Pallas GmbH
Access to Pallas tools
Download free evaluation copies from
http://www.pallas.com