|Summary:||Implement faster GEMM kernel for AVX512|
|Component:||Core - matrix products||Assignee:||Nobody <eigen.nobody>|
|Severity:||Optimization||CC:||chtz, gael.guennebaud, hans.j.johnson, jacob.benoit.1, markos|
|Hardware:||x86 - AVX|
|Bug Depends on:||1633|
Description william.tambellini 2018-12-10 20:45:44 UTC
Created attachment 907 [details] bench_matrix_vs_tensor.cpp Good morning all I m benchmarking Eigen master matmul on AWS C5 : https://aws.amazon.com/ec2/instance-types/c5/ model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat The speed up of AVX512f compared to AVX/AVX2 is not that good : ubuntu@ip-172-30-0-228:~/eigen-eigen-9f48e814419e/unsupported/bench$ ./buildrun.sh "-mavx2 -mfma" g++-6 (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026 Bench Eigen Matrix vs Tensor Usage: program numberOfEigenThreads (default to 1) GCC: 6.5.0 20181026 Eigen version: 3.3.90 Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 Eigen::nbThreads: 1 EIGEN_NO_DEBUG EIGEN_VECTORIZE EIGEN_HAS_OPENMP: 201511 omp_get_num_threads: 1 Matmul: M=N=K Repeat: 10 MNK EMatrix ETensor 256 0.00510111 0.00469228 512 0.0387391 0.0315717 1024 0.272151 0.263714 2048 2.24269 2.22049 ubuntu@ip-172-30-0-228:~/eigen-eigen-9f48e814419e/unsupported/bench$ ./buildrun.sh "-mavx512f -mavx512cd -mfma" g++-6 (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026 Bench Eigen Matrix vs Tensor Usage: program numberOfEigenThreads (default to 1) GCC: 6.5.0 20181026 Eigen version: 3.3.90 Simd: AVX512, FMA, AVX2, AVX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 Eigen::nbThreads: 1 EIGEN_NO_DEBUG EIGEN_VECTORIZE EIGEN_HAS_OPENMP: 201511 omp_get_num_threads: 1 Matmul: M=N=K Repeat: 10 MNK EMatrix ETensor 256 0.00408185 0.00253315 512 0.0228611 0.0203946 1024 0.194071 0.170996 2048 1.52432 1.51736 Is it expected ? Kind WT
Comment 1 Gael Guennebaud 2018-12-10 21:02:39 UTC
For this kind of benchmark make sure that the frequencies of your CPUs are fixed. Nonetheless, I don't expect a x2 speedup because with x2 larger SIMD register there is a higher pressure on the bandwidth of the cache. It could be reduced by exploiting the additional registers offered by AVX512 but this has yet to be implemented.
Comment 2 Gael Guennebaud 2018-12-11 11:55:44 UTC
I did some experiments on a Xeon W-2155, and we're indeed far way to MKL's perf on AVX512: float: 138 vs 186 GFlops double: 60 vs 90 GFlops I also tested larger kernels exploiting the additional registers for better pipeling and reduced load/stores, in GFlops (for the kernel alone): float double 3pX4 r 152.91 69.163 (current) 3pX8 r 166.905 74.3742 4pX4 r 161.998 75.5747 5pX4 r 176.016 83.0351 6pX4 r 189.619 88.7183 The clear winner is the 6pX4 kernel, which is a good news because it's much simpler to integrate than the 3pX8 one. To make it even easier to integrate, I also tried a 6pX4 variant with the LHS packed as for the 3pX4 (i.e., as it is now), and the results are quite surprising: 6pX4 p3 200.65 96.5875 This is very promising! Hopefully I did not make a mistake (I triple checked, but I don't have unit tests for these micro kernels).
Comment 3 Gael Guennebaud 2018-12-11 13:15:24 UTC
ok, I messed up with a stride, so the last variant reusing the current packing is just as fast, not faster (this makes more sense): 6pX4 p3 188.077 88.0545
Comment 4 Gael Guennebaud 2018-12-11 14:53:44 UTC
a +10% for almost free, but no good rationale: https://bitbucket.org/eigen/eigen/commits/b500fef42ced/ Summary: Artificially increase l1-blocking size for AVX512. +10% speedup with current kernels. With a 6pX4 kernel (not committed yet), this provides a +20% speedup.
Comment 5 Gael Guennebaud 2018-12-11 15:19:53 UTC
For the record, the previous trick (increasing l1), also yield a +2% speed up on some Haswell: +2% for a xeon E5 v3, and no effect on my i7 laptop. For the 6pX4 kernel, better wait for 1633 first.
Comment 6 Hans Johnson 2019-10-29 16:20:13 UTC
NOTE: #1633 has been committed. Is this ready for moving forward now?
Comment 7 Christoph Hertzberg 2019-10-30 16:22:00 UTC
I guess this is Gael's call (I assume you have the necessary changes essentially ready on your machine?)
Comment 8 Nobody 2019-12-04 18:16:06 UTC
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1642.