thijs cornelissen (wuppertal)
DESCRIPTION
Track fitting and v ectorization. Thijs Cornelissen (Wuppertal). Updates in GlobalChi2Fitter. Rewritten calculation of jacobians Fewer temporary matrices calculated Matrix multiplication now uses SSE instructions, factor two faster (w/doubles) - PowerPoint PPT PresentationTRANSCRIPT
FSTF meeting 24 July 2013, T. Cornelissen 1
Thijs Cornelissen (Wuppertal)Thijs Cornelissen (Wuppertal)
Track fitting and vectorizationTrack fitting and vectorization
FSTF meeting 24 July 2013, T. Cornelissen
P 2
Updates in GlobalChi2FitterUpdates in GlobalChi2Fitter
Rewritten calculation of jacobians
Fewer temporary matrices calculated
Matrix multiplication now uses SSE instructions, factor two faster (w/doubles)
Reorganized internal storage of matrices in fitter, mainly to make
them properly aligned in memory (crucial for vector instructions)
In covariance matrix, inserted empty column between perigee and scatter entries, to
make total number of entries even
Jacobians stored as a 5x4 matrix (by default they are 5x5, which is very bad for
vectorization)
Optimized calculation of track errors at each measurement
Also taking advantage of SSE instructions
After these optimizations, main bottleneck is newing/deleting of
Tracking EDM objects
FSTF meeting 24 July 2013, T. Cornelissen
P 3
Matrix multiplication: scalarMatrix multiplication: scalar
Simple 4x4 matrix multiplication routine
With gcc 4.7.2, auto-vectorization makes this routine 30% slower!!
Gcc 4.8.1 shows small (~10%) improvement, still nowhere near
theoretical speed-up (factor 4)
FSTF meeting 24 July 2013, T. Cornelissen
P 4
Matrix multiplication: vectorizedMatrix multiplication: vectorized
Vectorized 4x4 matrix multiplication routine, calculates four dot
products in parallel
Tested to be four times faster than scalar version
Runs in development version of fitter, gives correct results
FSTF meeting 24 July 2013, T. Cornelissen
P 5
Profiling runIteration()Profiling runIteration()
Large reduction in derivative calculation thanks to matrix
optimizations
devval
new
FSTF meeting 24 July 2013, T. Cornelissen
P 6
Profiling calculateTrackErrors()Profiling calculateTrackErrors()
Overall factor 2 improvement after all optimizations (not just
vectorization)
Performance of errors2() function does not look optimal yet, still
investigating additional techniques like loop unrolling, blocking, …
devval
new
FSTF meeting 24 July 2013, T. Cornelissen
P 7
SSE and portabilitySSE and portability
Intel MIC and ARM processors don’t support SSE, code would crash immediately
Can implement runtime check at initialization using the assembler instruction ‘cpuid’, as explained here
Could then use result from cpuid to set function pointer to scalar or
vectorized functions
In the end, using a higher level library like Eigen would be more elegant
But performance will have to match the low level code