thijs cornelissen (wuppertal)

7
FSTF meeting 24 July 2013, T. Cornelissen 1 Thijs Cornelissen (Wuppertal) Thijs Cornelissen (Wuppertal) ack fitting and vectorizat ack fitting and vectorizat

Upload: tansy

Post on 17-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Track fitting and v ectorization. Thijs Cornelissen (Wuppertal). Updates in GlobalChi2Fitter. Rewritten calculation of jacobians Fewer temporary matrices calculated Matrix multiplication now uses SSE instructions, factor two faster (w/doubles) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Thijs Cornelissen  (Wuppertal)

FSTF meeting 24 July 2013, T. Cornelissen 1

Thijs Cornelissen (Wuppertal)Thijs Cornelissen (Wuppertal)

Track fitting and vectorizationTrack fitting and vectorization

Page 2: Thijs Cornelissen  (Wuppertal)

FSTF meeting 24 July 2013, T. Cornelissen

P 2

Updates in GlobalChi2FitterUpdates in GlobalChi2Fitter

Rewritten calculation of jacobians

Fewer temporary matrices calculated

Matrix multiplication now uses SSE instructions, factor two faster (w/doubles)

Reorganized internal storage of matrices in fitter, mainly to make

them properly aligned in memory (crucial for vector instructions)

In covariance matrix, inserted empty column between perigee and scatter entries, to

make total number of entries even

Jacobians stored as a 5x4 matrix (by default they are 5x5, which is very bad for

vectorization)

Optimized calculation of track errors at each measurement

Also taking advantage of SSE instructions

After these optimizations, main bottleneck is newing/deleting of

Tracking EDM objects

Page 3: Thijs Cornelissen  (Wuppertal)

FSTF meeting 24 July 2013, T. Cornelissen

P 3

Matrix multiplication: scalarMatrix multiplication: scalar

Simple 4x4 matrix multiplication routine

With gcc 4.7.2, auto-vectorization makes this routine 30% slower!!

Gcc 4.8.1 shows small (~10%) improvement, still nowhere near

theoretical speed-up (factor 4)

Page 4: Thijs Cornelissen  (Wuppertal)

FSTF meeting 24 July 2013, T. Cornelissen

P 4

Matrix multiplication: vectorizedMatrix multiplication: vectorized

Vectorized 4x4 matrix multiplication routine, calculates four dot

products in parallel

Tested to be four times faster than scalar version

Runs in development version of fitter, gives correct results

Page 5: Thijs Cornelissen  (Wuppertal)

FSTF meeting 24 July 2013, T. Cornelissen

P 5

Profiling runIteration()Profiling runIteration()

Large reduction in derivative calculation thanks to matrix

optimizations

devval

new

Page 6: Thijs Cornelissen  (Wuppertal)

FSTF meeting 24 July 2013, T. Cornelissen

P 6

Profiling calculateTrackErrors()Profiling calculateTrackErrors()

Overall factor 2 improvement after all optimizations (not just

vectorization)

Performance of errors2() function does not look optimal yet, still

investigating additional techniques like loop unrolling, blocking, …

devval

new

Page 7: Thijs Cornelissen  (Wuppertal)

FSTF meeting 24 July 2013, T. Cornelissen

P 7

SSE and portabilitySSE and portability

Intel MIC and ARM processors don’t support SSE, code would crash immediately

Can implement runtime check at initialization using the assembler instruction ‘cpuid’, as explained here

Could then use result from cpuid to set function pointer to scalar or

vectorized functions

In the end, using a higher level library like Eigen would be more elegant

But performance will have to match the low level code