This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen

Bug 937

Summary: Tracking: improve and fully undertand product blocking sizes selection
Product: Eigen Reporter: Benoit Jacob <jacob.benoit.1>
Component: Core - matrix productsAssignee: Nobody <eigen.nobody>
Status: NEW ---    
Severity: Optimization CC: benoit.steiner.goog, chtz, gael.guennebaud, ilja.j.honkonen
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on: 930, 931, 938, 939, 953, 958    
Bug Blocks:    
Attachments:
Description Flags
Data: performance and L1 cache misses for various products and blocking sizes, on a Core i7
none
Data: performance for various products and blocking sizes, on a Nexus 4
none
Data: performance and L1 cache misses for various products and blocking sizes, on a Core i7
none
Data: performance for various products and blocking sizes, on a Nexus 4
none
Output of test program compiled with gcc
none
Output of test program compiled with icpc none

Description Benoit Jacob 2015-01-25 00:12:22 UTC
This is a meta-bug, see dependent bugs. This also serves as place where to put data that is relevant to multiple dependent bugs.
Comment 1 Benoit Jacob 2015-01-25 00:17:45 UTC
Created attachment 527 [details]
Data: performance and L1 cache misses for various products and blocking sizes, on a Core i7

This is on a Core i7 with:
L1=32k
L2=256k
L3=12M

In addition to performance, this also records L1 cache read ('r') and write ('w') misses.

The matrices are Matrixf: the Scalar type is float.

Only one thread is used.
Comment 2 Benoit Jacob 2015-01-25 00:20:33 UTC
More details: all size triples are in (k,m,n) order, where the matrix product is always (m x k) times (k x n). So m is LHS.rows(), n is RHS.cols(), and k is the middle dimension.

All block sizes are actually rounded down to the nearest multiple of the register-level blocking size, so don't worry about rounding.
Comment 3 Benoit Jacob 2015-01-25 00:22:43 UTC
Created attachment 528 [details]
Data: performance for various products and blocking sizes, on a Nexus 4

This is now on a Nexus 4 phone. Cache miss counters didn't work, so only performance numbers are included.

Caches are:
L1=16k
L2=1M

The same is as above. Again, all triples are (k,m,n) where products are (m x k) times (k x n).
Comment 4 Benoit Jacob 2015-01-25 21:44:06 UTC
Created attachment 529 [details]
Data: performance and L1 cache misses for various products and blocking sizes, on a Core i7

This updated version has been processed for readability, indicated k=/m=/n= so it's clear which field is which, and also shows computed block sizes so we can compare them to cache sizes to try to make sense of these results.
Comment 5 Benoit Jacob 2015-01-25 21:45:34 UTC
Created attachment 530 [details]
Data: performance for various products and blocking sizes, on a Nexus 4
Comment 6 Gael Guennebaud 2015-01-28 11:53:54 UTC
GFLOPS on core i7 2.6GHz, clang, -mfma, turbo-boost disabled.
Size: 1152x9600 * 9600x9600


nc\kc  16   32   48   64   80  128  160  240  320  400
   4 48.1 59.3 62.9 50.4 51.3   54 55.7 56.8 57.1 57.7
   8 48.3 57.5 60.4 52.5 54.1 58.6 58.5 62.5 64.3 64.8
  16   49 56.6   60 54.9 57.1 62.7 64.8 68.8 74.8   69
  32 33.8 52.4 48.9   55 59.1 64.7 65.6 70.9 70.4 70.5
  48 33.6 43.4 39.9   45 49.1 59.6 62.1 65.7 67.6 69.3
  64 33.6   41 40.4 45.2 49.2 56.9 60.4 65.8 68.3 70.4
  80 33.5 46.9 40.4 45.3 47.7 57.2 60.3   67 68.6 71.4
 128 33.8 47.3 40.2 45.5 47.6   57 60.1   68 71.9 72.5
 160 31.6 40.8   42 45.5   49 57.5 60.1   71 72.4 72.6
 240   34 47.4 40.5 45.7 49.2 55.7 61.7 73.4 72.6 72.8
 320 32.4 47.6 51.7 45.7 49.3 57.6 61.8 73.4 72.8   73
 400 33.9 45.9   55 45.8 49.5 56.5 61.1 72.8 72.9 72.7
Comment 7 Gael Guennebaud 2015-01-28 14:05:57 UTC
Same but with larger block sizes:

nc\kc 256  384  512  640  768  896 1024 1152
 256   73 72.6 73.3 73.2 73.2 73.8 73.5 73.8
 384 72.9 72.4 73.3 72.9 73.8 74.1 74.1 74.2
 512 73.3 72.6 73.1 73.8 74.1 74.2 73.8 74.1
 640 73.2 72.2   73 73.7 73.4 73.9 73.3 74.4
 768 72.8 72.1 72.9 73.2 73.5 73.9 74.2 73.4
 896 72.3 71.6 72.8   73 73.5 72.9 73.9 73.6
1024 71.4 71.7 72.5 72.9 73.3 73.1   73 73.5
1152 71.6 71.4 72.5 72.9 72.2   73   73 73.3

Best results are obtained for very large values of kc. This is counter intuitive because we are expecting that a mr x kc horizontal block of the lhs fit within L1. In my case, mr=24, and with kc=1152, this represent 108KB that is much larger than the L1 of the i7. Perhaps this is thanks to a very good prefetching and the average sized L2 of the i7.
Comment 8 Benoit Jacob 2015-01-28 18:13:53 UTC
We just had a phone call and I'm recording here what we agreed on -- this further elaborates on comment 5.

1) The data on bug 937 here, and issues such as bug 938 and bug 939 show that we actually need ad-hoc logic to determine blocking sizes, and this ad-hoc logic may vary between CPU architectures.

2) Actually, currently, computeProductBlockingSizes does contain ad-hoc logic (bug 939). Rather than trying to remove it, the way to run well everywhere is to embrace ad-hoc logic and just select a different one on a different CPU architecture.

3) Like bug 931 comment 5 says, it does seem that all the ad-hoc logics that we  may need, only need to know two cache size parameters: basically, "L1" and "last level of cache per CPU core".

With this, a lot of things simplify, and we're no longer blocked on staring at these big tables of data trying to find patterns. It's OK to be ad-hoc!
Comment 9 Gael Guennebaud 2015-01-30 18:19:36 UTC
For the record, 

Xeon X5570 @ 2.93GHz (Nehalem), theoretical peak 23.44

nc\kc  16   32   48   64   80  128  160  240  320  400
   4 15.6 18.2 18.9 19.4 19.4 19.9 19.7 20.1 20.1 20.2
   8 15.6 18.5 19.2 19.6 19.8 20.1 20.1 20.4 20.5 20.5
  16 14.1 18.4 19.4 19.8   20 20.3 20.4 20.6 20.7 20.7
  32 12.8 17.9 19.4 19.8   20 20.3 20.3 20.6 20.7 20.8
  48 12.9 17.9 19.4 19.8   20 20.3 20.4 20.6 20.8 20.8
  64 12.7 17.7 19.4 19.8   20 20.3 20.4 20.7 20.8 20.8
  80 12.8 17.7 19.4 19.8   20 20.4 20.4 20.7 20.8 20.9
 128 12.8 17.7 19.4 19.8   20 20.4 20.5 20.7 20.8 20.9
 160 12.8 17.8 19.4 19.8 20.1 20.4 20.5 20.8 20.8 20.9
 240 12.8 17.8 19.4 19.8 20.1 20.4 20.5 20.7 20.8 20.9
 320 12.8 17.7 19.4 19.8   20 20.3 20.5 20.7 20.8 20.9
 400 13.1 17.7 19.2 19.7 19.9 20.4 20.5 20.7 20.8 20.9

Xeon E5-2670 v2 @ 2.50GHz (IvyBridge), theoretical peak = 40

nc\kc  16   32   48   64   80  128  160  240  320  400
   4 26.4 30.3 31.1 30.8 31.2 31.5 32.4 32.9   33 32.8
   8 28.7 32.2 33.2 32.7 33.9 34.6   35 35.1 34.8 34.2
  16 29.8 32.4   34 34.8 35.1   36 36.1 36.3 36.2 35.5
  32   30 32.6 34.5 35.5 35.9 36.4 36.5 37.1 37.2 36.3
  48   30 32.9 34.4 35.5 35.8 36.7 36.8 37.4 37.5 36.5
  64 30.2   33 34.7 35.8   36   37   37 37.5 37.6 36.7
  80 30.4 33.1 34.8 35.9 36.1 37.1 37.1 37.6 37.8 36.8
 128 30.6 33.1 35.1 36.1 36.4 37.3 37.3 37.8   38 36.9
 160 30.5 33.2 35.2 36.2 36.5 37.4 37.4 37.9   38 36.9
 240 30.7 33.5 35.3 36.3 36.6 37.5 37.6   38 38.1   37
 320 30.8 33.5 35.4 36.4 36.7 37.5 37.6   38 38.2 37.1
 400 30.8 33.6 35.4 36.5 36.7 37.5 37.7   38 38.2 37.1
Comment 10 Gael Guennebaud 2015-02-04 16:12:09 UTC
Nexus 9:

nc/kc  16   32   48   64   80  128  160  240  320  400
   4 15.2 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1
   8 15.2 11.6 15.1 15.2 15.2 15.1 15.2 15.1 15.2 15.1
  16 15.2 14.7 15.1 15.1 15.1 15.1 15.2 15.2 15.1 15.1
  32 15.2 14.8 15.1 15.2 15.1 15.2 15.1 15.2 15.1 15.1
  48 15.1 14.8 15.1 15.1 15.2 15.1 15.2 15.2 15.1 15.1
  64 15.1 14.5 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1
  80 15.2 14.9 15.1 15.1 15.2 15.1 15.1 15.1 15.1 15.1
 128 15.2 14.9 15.1 15.2 15.1 15.1 15.1 15.1 15.1 15.1
 160 15.1 15.2 15.1 15.1 15.1 15.1 15.1 15.2 15.2 15.1
 240 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.2 15.2
 320 15.1 15.1 15.1 15.1 15.2 15.1 15.1 15.2 15.2 15.1
 400 15.1 15.1 15.1 15.1 15.2 15.1 15.2 15.1 15.2 15.1
Comment 11 Gael Guennebaud 2015-02-04 19:27:03 UTC
Nexus 9 with larger block sizes.

nc/kc  256  384  512  640  768  896 1024 1152
 256 15.6 15.6 15.4 15.3 15.5 15.5   15 15.3
 384 15.6 15.5 15.6 15.3 15.1   15   15   15
 512 15.6 15.5 15.6 15.3 15.1 15.4   15 14.9
 640 15.6 15.5 15.6 15.3 15.2 15.4 15.3 14.9
 768 15.6 15.5 15.6 15.2 15.2 15.4 14.9 14.9
 896 15.5 15.5 15.6 15.2   15 15.4 14.9 14.9
1024 15.6 15.6 15.5 15.5 15.6 15.3   15 15.3
1152 15.5 15.5 15.3 15.6 15.5   15 15.3 15.5

The heuristic to determine the best block sizes on Nexus9 should be pretty straightforward ;)
Comment 12 Ilja Honkonen 2016-12-02 04:10:08 UTC
Created attachment 754 [details]
Output of test program compiled with gcc

If this is still relevant here is output from test program run on Knights Landing. Setup:

eigen 3.3.0 with #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES 1

g++ --version
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)

g++ -DNDEBUG -O3 -std=c++11 -march=native prog.cpp -o prog -I ../libraries/eigen
and here's what above command enables (g++ -march=native -Q --help=target):
  -m64                                  [enabled]
  -m80387                               [enabled]
  -m96bit-long-double                   [enabled]
  -mabm                                 [enabled]
  -madx                                 [enabled]
  -maes                                 [enabled]
  -malign-stringops                     [enabled]
  -mavx                                 [enabled]
  -mavx2                                [enabled]
  -mbmi                                 [enabled]
  -mbmi2                                [enabled]
  -mcx16                                [enabled]
  -mf16c                                [enabled]
  -mfancy-math-387                      [enabled]
  -mfentry                              [enabled]
  -mfma                                 [enabled]
  -mfp-ret-in-387                       [enabled]
  -mfsgsbase                            [enabled]
  -mfxsr                                [enabled]
  -mglibc                               [enabled]
  -mhard-float                          [enabled]
  -mieee-fp                             [enabled]
  -mlong-double-80                      [enabled]
  -mlzcnt                               [enabled]
  -mmovbe                               [enabled]
  -mpclmul                              [enabled]
  -mpopcnt                              [enabled]
  -mprfchw                              [enabled]
  -mpush-args                           [enabled]
  -mrdrnd                               [enabled]
  -mrdseed                              [enabled]
  -mred-zone                            [enabled]
  -msahf                                [enabled]
  -msse                                 [enabled]
  -msse2                                [enabled]
  -msse3                                [enabled]
  -msse4                                [enabled]
  -msse4.1                              [enabled]
  -msse4.2                              [enabled]
  -mssse3                               [enabled]
  -mstackrealign                        [enabled]
  -mtls-direct-seg-refs                 [enabled]
  -mxsave                               [enabled]
  -mxsaveopt                            [enabled]
Comment 13 Ilja Honkonen 2016-12-02 04:12:53 UTC
Created attachment 755 [details]
Output of test program compiled with icpc

And as previous but compiled with Intel:

icpc --version
icpc (ICC) 17.0.0 20160721

icpc prog.cpp -I ../libraries/eigen -O3 -xMIC-AVX512 -o prog.icpc -DNDEBUG -std=c++11
../libraries/eigen/Eigen/src/Core/products/GeneralMatrixMatrix.h(429): (col. 3) warning #13212: Reference to EBX in function requiring stack alignment
Comment 14 Gael Guennebaud 2016-12-02 09:11:03 UTC
Thanks for the log. There seems to be some performance issues with ICC: 3 times slower than with GCC. Probably some inlining issue or register spilling.
Comment 15 Nobody 2019-12-04 14:07:46 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/937.