This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen

Bug 1642

Summary: Implement faster GEMM kernel for AVX512
Product: Eigen Reporter: william.tambellini
Component: Core - matrix productsAssignee: Nobody <eigen.nobody>
Status: CONFIRMED ---    
Severity: Optimization CC: chtz, gael.guennebaud, hans.j.johnson, jacob.benoit.1, markos
Priority: Normal    
Version: 3.4 (development)   
Hardware: x86 - AVX   
OS: All   
Whiteboard:
Bug Depends on: 1633    
Bug Blocks: 1608    
Attachments:
Description Flags
bench_matrix_vs_tensor.cpp none

Description william.tambellini 2018-12-10 20:45:44 UTC
Created attachment 907 [details]
bench_matrix_vs_tensor.cpp

Good morning all
I m benchmarking Eigen master matmul on AWS C5 : 
https://aws.amazon.com/ec2/instance-types/c5/
model name      : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat

The speed up of AVX512f compared to AVX/AVX2 is not that good :

ubuntu@ip-172-30-0-228:~/eigen-eigen-9f48e814419e/unsupported/bench$ ./buildrun.sh "-mavx2 -mfma"
g++-6 (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Bench Eigen Matrix vs Tensor
Usage: program numberOfEigenThreads (default to 1)
GCC: 6.5.0 20181026
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
Eigen::nbThreads: 1
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1
Matmul: M=N=K
Repeat: 10
             MNK         EMatrix         ETensor
             256      0.00510111      0.00469228
             512       0.0387391       0.0315717
            1024        0.272151        0.263714
            2048         2.24269         2.22049


ubuntu@ip-172-30-0-228:~/eigen-eigen-9f48e814419e/unsupported/bench$ ./buildrun.sh "-mavx512f -mavx512cd -mfma"
g++-6 (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Bench Eigen Matrix vs Tensor
Usage: program numberOfEigenThreads (default to 1)
GCC: 6.5.0 20181026
Eigen version: 3.3.90
Simd: AVX512, FMA, AVX2, AVX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
Eigen::nbThreads: 1
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1
Matmul: M=N=K
Repeat: 10
             MNK         EMatrix         ETensor
             256      0.00408185      0.00253315
             512       0.0228611       0.0203946
            1024        0.194071        0.170996
            2048         1.52432         1.51736


Is it expected ?
Kind
WT
Comment 1 Gael Guennebaud 2018-12-10 21:02:39 UTC
For this kind of benchmark make sure that the frequencies of your CPUs are fixed. Nonetheless, I don't expect a x2 speedup because with x2 larger SIMD register there is a higher pressure on the bandwidth of the cache. It could be reduced by exploiting the additional registers offered by AVX512 but this has yet to be implemented.
Comment 2 Gael Guennebaud 2018-12-11 11:55:44 UTC
I did some experiments on a Xeon W-2155, and we're indeed far way to MKL's perf on AVX512:

float: 138 vs 186 GFlops
double: 60 vs  90 GFlops

I also tested larger kernels exploiting the additional registers for better pipeling and reduced load/stores, in GFlops (for the kernel alone):

          float         double
3pX4  r   152.91 	69.163    (current)
3pX8  r   166.905 	74.3742
4pX4  r   161.998 	75.5747
5pX4  r   176.016 	83.0351
6pX4  r   189.619 	88.7183

The clear winner is the 6pX4 kernel, which is a good news because it's much simpler to integrate than the 3pX8 one. To make it even easier to integrate, I also tried a 6pX4 variant with the LHS packed as for the 3pX4 (i.e., as it is now), and the results are quite surprising:

6pX4  p3  200.65 	96.5875

This is very promising!

Hopefully I did not make a mistake (I triple checked, but I don't have unit tests for these micro kernels).
Comment 3 Gael Guennebaud 2018-12-11 13:15:24 UTC
ok, I messed up with a stride, so the last variant reusing the current packing is just as fast, not faster (this makes more sense):

6pX4  p3  188.077 	88.0545
Comment 4 Gael Guennebaud 2018-12-11 14:53:44 UTC
a +10% for almost free, but no good rationale:

https://bitbucket.org/eigen/eigen/commits/b500fef42ced/
Summary:     Artificially increase l1-blocking size for AVX512. +10% speedup with current kernels.
With a 6pX4 kernel (not committed yet), this provides a +20% speedup.
Comment 5 Gael Guennebaud 2018-12-11 15:19:53 UTC
For the record, the previous trick (increasing l1), also yield a +2% speed up on some Haswell: +2% for a xeon E5 v3, and no effect on my i7 laptop.

For the 6pX4 kernel, better wait for 1633 first.
Comment 6 Hans Johnson 2019-10-29 16:20:13 UTC
NOTE:  #1633 has been committed.  Is this ready for moving forward now?
Comment 7 Christoph Hertzberg 2019-10-30 16:22:00 UTC
I guess this is Gael's call (I assume you have the necessary changes essentially ready on your machine?)
Comment 8 Nobody 2019-12-04 18:16:06 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1642.