This is a meta-bug, see dependent bugs. This also serves as place where to put data that is relevant to multiple dependent bugs.
Created attachment 527 [details] Data: performance and L1 cache misses for various products and blocking sizes, on a Core i7 This is on a Core i7 with: L1=32k L2=256k L3=12M In addition to performance, this also records L1 cache read ('r') and write ('w') misses. The matrices are Matrixf: the Scalar type is float. Only one thread is used.
More details: all size triples are in (k,m,n) order, where the matrix product is always (m x k) times (k x n). So m is LHS.rows(), n is RHS.cols(), and k is the middle dimension. All block sizes are actually rounded down to the nearest multiple of the register-level blocking size, so don't worry about rounding.
Created attachment 528 [details] Data: performance for various products and blocking sizes, on a Nexus 4 This is now on a Nexus 4 phone. Cache miss counters didn't work, so only performance numbers are included. Caches are: L1=16k L2=1M The same is as above. Again, all triples are (k,m,n) where products are (m x k) times (k x n).
Created attachment 529 [details] Data: performance and L1 cache misses for various products and blocking sizes, on a Core i7 This updated version has been processed for readability, indicated k=/m=/n= so it's clear which field is which, and also shows computed block sizes so we can compare them to cache sizes to try to make sense of these results.
Created attachment 530 [details] Data: performance for various products and blocking sizes, on a Nexus 4
GFLOPS on core i7 2.6GHz, clang, -mfma, turbo-boost disabled. Size: 1152x9600 * 9600x9600 nc\kc 16 32 48 64 80 128 160 240 320 400 4 48.1 59.3 62.9 50.4 51.3 54 55.7 56.8 57.1 57.7 8 48.3 57.5 60.4 52.5 54.1 58.6 58.5 62.5 64.3 64.8 16 49 56.6 60 54.9 57.1 62.7 64.8 68.8 74.8 69 32 33.8 52.4 48.9 55 59.1 64.7 65.6 70.9 70.4 70.5 48 33.6 43.4 39.9 45 49.1 59.6 62.1 65.7 67.6 69.3 64 33.6 41 40.4 45.2 49.2 56.9 60.4 65.8 68.3 70.4 80 33.5 46.9 40.4 45.3 47.7 57.2 60.3 67 68.6 71.4 128 33.8 47.3 40.2 45.5 47.6 57 60.1 68 71.9 72.5 160 31.6 40.8 42 45.5 49 57.5 60.1 71 72.4 72.6 240 34 47.4 40.5 45.7 49.2 55.7 61.7 73.4 72.6 72.8 320 32.4 47.6 51.7 45.7 49.3 57.6 61.8 73.4 72.8 73 400 33.9 45.9 55 45.8 49.5 56.5 61.1 72.8 72.9 72.7
Same but with larger block sizes: nc\kc 256 384 512 640 768 896 1024 1152 256 73 72.6 73.3 73.2 73.2 73.8 73.5 73.8 384 72.9 72.4 73.3 72.9 73.8 74.1 74.1 74.2 512 73.3 72.6 73.1 73.8 74.1 74.2 73.8 74.1 640 73.2 72.2 73 73.7 73.4 73.9 73.3 74.4 768 72.8 72.1 72.9 73.2 73.5 73.9 74.2 73.4 896 72.3 71.6 72.8 73 73.5 72.9 73.9 73.6 1024 71.4 71.7 72.5 72.9 73.3 73.1 73 73.5 1152 71.6 71.4 72.5 72.9 72.2 73 73 73.3 Best results are obtained for very large values of kc. This is counter intuitive because we are expecting that a mr x kc horizontal block of the lhs fit within L1. In my case, mr=24, and with kc=1152, this represent 108KB that is much larger than the L1 of the i7. Perhaps this is thanks to a very good prefetching and the average sized L2 of the i7.
We just had a phone call and I'm recording here what we agreed on -- this further elaborates on comment 5. 1) The data on bug 937 here, and issues such as bug 938 and bug 939 show that we actually need ad-hoc logic to determine blocking sizes, and this ad-hoc logic may vary between CPU architectures. 2) Actually, currently, computeProductBlockingSizes does contain ad-hoc logic (bug 939). Rather than trying to remove it, the way to run well everywhere is to embrace ad-hoc logic and just select a different one on a different CPU architecture. 3) Like bug 931 comment 5 says, it does seem that all the ad-hoc logics that we may need, only need to know two cache size parameters: basically, "L1" and "last level of cache per CPU core". With this, a lot of things simplify, and we're no longer blocked on staring at these big tables of data trying to find patterns. It's OK to be ad-hoc!
For the record, Xeon X5570 @ 2.93GHz (Nehalem), theoretical peak 23.44 nc\kc 16 32 48 64 80 128 160 240 320 400 4 15.6 18.2 18.9 19.4 19.4 19.9 19.7 20.1 20.1 20.2 8 15.6 18.5 19.2 19.6 19.8 20.1 20.1 20.4 20.5 20.5 16 14.1 18.4 19.4 19.8 20 20.3 20.4 20.6 20.7 20.7 32 12.8 17.9 19.4 19.8 20 20.3 20.3 20.6 20.7 20.8 48 12.9 17.9 19.4 19.8 20 20.3 20.4 20.6 20.8 20.8 64 12.7 17.7 19.4 19.8 20 20.3 20.4 20.7 20.8 20.8 80 12.8 17.7 19.4 19.8 20 20.4 20.4 20.7 20.8 20.9 128 12.8 17.7 19.4 19.8 20 20.4 20.5 20.7 20.8 20.9 160 12.8 17.8 19.4 19.8 20.1 20.4 20.5 20.8 20.8 20.9 240 12.8 17.8 19.4 19.8 20.1 20.4 20.5 20.7 20.8 20.9 320 12.8 17.7 19.4 19.8 20 20.3 20.5 20.7 20.8 20.9 400 13.1 17.7 19.2 19.7 19.9 20.4 20.5 20.7 20.8 20.9 Xeon E5-2670 v2 @ 2.50GHz (IvyBridge), theoretical peak = 40 nc\kc 16 32 48 64 80 128 160 240 320 400 4 26.4 30.3 31.1 30.8 31.2 31.5 32.4 32.9 33 32.8 8 28.7 32.2 33.2 32.7 33.9 34.6 35 35.1 34.8 34.2 16 29.8 32.4 34 34.8 35.1 36 36.1 36.3 36.2 35.5 32 30 32.6 34.5 35.5 35.9 36.4 36.5 37.1 37.2 36.3 48 30 32.9 34.4 35.5 35.8 36.7 36.8 37.4 37.5 36.5 64 30.2 33 34.7 35.8 36 37 37 37.5 37.6 36.7 80 30.4 33.1 34.8 35.9 36.1 37.1 37.1 37.6 37.8 36.8 128 30.6 33.1 35.1 36.1 36.4 37.3 37.3 37.8 38 36.9 160 30.5 33.2 35.2 36.2 36.5 37.4 37.4 37.9 38 36.9 240 30.7 33.5 35.3 36.3 36.6 37.5 37.6 38 38.1 37 320 30.8 33.5 35.4 36.4 36.7 37.5 37.6 38 38.2 37.1 400 30.8 33.6 35.4 36.5 36.7 37.5 37.7 38 38.2 37.1
Nexus 9: nc/kc 16 32 48 64 80 128 160 240 320 400 4 15.2 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1 8 15.2 11.6 15.1 15.2 15.2 15.1 15.2 15.1 15.2 15.1 16 15.2 14.7 15.1 15.1 15.1 15.1 15.2 15.2 15.1 15.1 32 15.2 14.8 15.1 15.2 15.1 15.2 15.1 15.2 15.1 15.1 48 15.1 14.8 15.1 15.1 15.2 15.1 15.2 15.2 15.1 15.1 64 15.1 14.5 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1 80 15.2 14.9 15.1 15.1 15.2 15.1 15.1 15.1 15.1 15.1 128 15.2 14.9 15.1 15.2 15.1 15.1 15.1 15.1 15.1 15.1 160 15.1 15.2 15.1 15.1 15.1 15.1 15.1 15.2 15.2 15.1 240 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.1 15.2 15.2 320 15.1 15.1 15.1 15.1 15.2 15.1 15.1 15.2 15.2 15.1 400 15.1 15.1 15.1 15.1 15.2 15.1 15.2 15.1 15.2 15.1
Nexus 9 with larger block sizes. nc/kc 256 384 512 640 768 896 1024 1152 256 15.6 15.6 15.4 15.3 15.5 15.5 15 15.3 384 15.6 15.5 15.6 15.3 15.1 15 15 15 512 15.6 15.5 15.6 15.3 15.1 15.4 15 14.9 640 15.6 15.5 15.6 15.3 15.2 15.4 15.3 14.9 768 15.6 15.5 15.6 15.2 15.2 15.4 14.9 14.9 896 15.5 15.5 15.6 15.2 15 15.4 14.9 14.9 1024 15.6 15.6 15.5 15.5 15.6 15.3 15 15.3 1152 15.5 15.5 15.3 15.6 15.5 15 15.3 15.5 The heuristic to determine the best block sizes on Nexus9 should be pretty straightforward ;)
Created attachment 754 [details] Output of test program compiled with gcc If this is still relevant here is output from test program run on Knights Landing. Setup: eigen 3.3.0 with #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES 1 g++ --version g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) g++ -DNDEBUG -O3 -std=c++11 -march=native prog.cpp -o prog -I ../libraries/eigen and here's what above command enables (g++ -march=native -Q --help=target): -m64 [enabled] -m80387 [enabled] -m96bit-long-double [enabled] -mabm [enabled] -madx [enabled] -maes [enabled] -malign-stringops [enabled] -mavx [enabled] -mavx2 [enabled] -mbmi [enabled] -mbmi2 [enabled] -mcx16 [enabled] -mf16c [enabled] -mfancy-math-387 [enabled] -mfentry [enabled] -mfma [enabled] -mfp-ret-in-387 [enabled] -mfsgsbase [enabled] -mfxsr [enabled] -mglibc [enabled] -mhard-float [enabled] -mieee-fp [enabled] -mlong-double-80 [enabled] -mlzcnt [enabled] -mmovbe [enabled] -mpclmul [enabled] -mpopcnt [enabled] -mprfchw [enabled] -mpush-args [enabled] -mrdrnd [enabled] -mrdseed [enabled] -mred-zone [enabled] -msahf [enabled] -msse [enabled] -msse2 [enabled] -msse3 [enabled] -msse4 [enabled] -msse4.1 [enabled] -msse4.2 [enabled] -mssse3 [enabled] -mstackrealign [enabled] -mtls-direct-seg-refs [enabled] -mxsave [enabled] -mxsaveopt [enabled]
Created attachment 755 [details] Output of test program compiled with icpc And as previous but compiled with Intel: icpc --version icpc (ICC) 17.0.0 20160721 icpc prog.cpp -I ../libraries/eigen -O3 -xMIC-AVX512 -o prog.icpc -DNDEBUG -std=c++11 ../libraries/eigen/Eigen/src/Core/products/GeneralMatrixMatrix.h(429): (col. 3) warning #13212: Reference to EBX in function requiring stack alignment
Thanks for the log. There seems to be some performance issues with ICC: 3 times slower than with GCC. Probably some inlining issue or register spilling.
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/937.