Summary: | Implement faster GEMM kernel for AVX512 | ||||||
---|---|---|---|---|---|---|---|
Product: | Eigen | Reporter: | william.tambellini | ||||
Component: | Core - matrix products | Assignee: | Nobody <eigen.nobody> | ||||
Status: | CONFIRMED --- | ||||||
Severity: | Optimization | CC: | chtz, gael.guennebaud, hans.j.johnson, jacob.benoit.1, markos | ||||
Priority: | Normal | ||||||
Version: | 3.4 (development) | ||||||
Hardware: | x86 - AVX | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Bug Depends on: | 1633 | ||||||
Bug Blocks: | 1608 | ||||||
Attachments: |
|
Description
william.tambellini
2018-12-10 20:45:44 UTC
For this kind of benchmark make sure that the frequencies of your CPUs are fixed. Nonetheless, I don't expect a x2 speedup because with x2 larger SIMD register there is a higher pressure on the bandwidth of the cache. It could be reduced by exploiting the additional registers offered by AVX512 but this has yet to be implemented. I did some experiments on a Xeon W-2155, and we're indeed far way to MKL's perf on AVX512: float: 138 vs 186 GFlops double: 60 vs 90 GFlops I also tested larger kernels exploiting the additional registers for better pipeling and reduced load/stores, in GFlops (for the kernel alone): float double 3pX4 r 152.91 69.163 (current) 3pX8 r 166.905 74.3742 4pX4 r 161.998 75.5747 5pX4 r 176.016 83.0351 6pX4 r 189.619 88.7183 The clear winner is the 6pX4 kernel, which is a good news because it's much simpler to integrate than the 3pX8 one. To make it even easier to integrate, I also tried a 6pX4 variant with the LHS packed as for the 3pX4 (i.e., as it is now), and the results are quite surprising: 6pX4 p3 200.65 96.5875 This is very promising! Hopefully I did not make a mistake (I triple checked, but I don't have unit tests for these micro kernels). ok, I messed up with a stride, so the last variant reusing the current packing is just as fast, not faster (this makes more sense): 6pX4 p3 188.077 88.0545 a +10% for almost free, but no good rationale: https://bitbucket.org/eigen/eigen/commits/b500fef42ced/ Summary: Artificially increase l1-blocking size for AVX512. +10% speedup with current kernels. With a 6pX4 kernel (not committed yet), this provides a +20% speedup. For the record, the previous trick (increasing l1), also yield a +2% speed up on some Haswell: +2% for a xeon E5 v3, and no effect on my i7 laptop. For the 6pX4 kernel, better wait for 1633 first. NOTE: #1633 has been committed. Is this ready for moving forward now? I guess this is Gael's call (I assume you have the necessary changes essentially ready on your machine?) -- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1642. |