Summary: | array_reverse fails with clang >=6 + AVX + -O2 | ||
---|---|---|---|
Product: | Eigen | Reporter: | Gael Guennebaud <gael.guennebaud> |
Component: | Core - general | Assignee: | Nobody <eigen.nobody> |
Status: | DECISIONNEEDED --- | ||
Severity: | Wrong Result | CC: | chtz, gael.guennebaud, jacob.benoit.1 |
Priority: | Normal | ||
Version: | 3.4 (development) | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Bug Depends on: | |||
Bug Blocks: | 814 |
Description
Gael Guennebaud
2019-02-21 17:06:52 UTC
There is nothing obvious from the generated ASM. In the failing case, it is fully inlined with: vmovaps 96(%rsp), %xmm0 vmovaps 112(%rsp), %xmm1 vmovaps 128(%rsp), %xmm2 vmovaps 144(%rsp), %xmm3 vmovaps %xmm0, 144(%rsp) vmovaps %xmm3, 96(%rsp) vmovaps %xmm1, 128(%rsp) vmovaps %xmm2, 112(%rsp) otherwise (e.g., by adding exit(0)) the reverseInPlace function is not inlined, and its body is equivalent: vmovups (%rax), %xmm0 vmovups 16(%rax), %xmm1 vmovups 32(%rax), %xmm2 vmovups 48(%rax), %xmm3 vmovups %xmm0, 48(%rax) vmovups %xmm3, (%rax) vmovups %xmm1, 32(%rax) vmovups %xmm2, 16(%rax) By the way, here is what I get when it fails: m1: -0.8572 -0.7386 0.5127 0.5553 0.006737 0.3555 0.9296 -0.05272 -0.3065 0.316 -0.3913 -0.5955 -0.7644 -0.1698 -0.1655 0.6492 m2 after m2=m1; m2.rowwise().reverseInPlace(); 0.9296 0.5127 0.006737 -0.8572 -0.3913 0.9296 -0.3065 0.006737 -0.1655 -0.3913 -0.7644 -0.3065 0.5553 -0.1655 -0.7386 -0.7644 m1.rowwise().reverse(): 0.5553 0.5127 -0.7386 -0.8572 -0.05272 0.9296 0.3555 0.006737 -0.5955 -0.3913 0.316 -0.3065 0.6492 -0.1655 -0.1698 -0.7644 and recall that the implementation of reverseInPlace is: Index half = xpr.cols()/2; xpr.leftCols(half).swap(xpr.rightCols(half).rowwise().reverse()); OK, so now I managed to reproduce with something as simple as: template<typename T> EIGEN_DONT_INLINE void dowork(T& m1, T& m2) { m2 = m1; m2.rowwise().reverseInPlace(); } with T=Matrix4f. The generated ASM is indeed wrong: vmovaps (%rdi), %ymm0 vmovaps %ymm0, (%rsi) vmovaps 32(%rdi), %ymm1 vperm2f128 $1, %ymm0, %ymm1, %ymm2 # ymm2 = ymm1[2,3,0,1] vblendps $14, %ymm1, %ymm2, %ymm1 # ymm1 = ymm2[0],ymm1[1,2,3],ymm2[4,5,6,7] vmovaps .LCPI22_0(%rip), %ymm2 # ymm2 = [1,2,3,0,4,5,6,7] vpermilps %ymm2, %ymm1, %ymm1 vmovups %ymm1, (%rsi) vperm2f128 $1, %ymm0, %ymm0, %ymm1 # ymm1 = ymm0[2,3,0,1] vblendps $14, %ymm0, %ymm1, %ymm0 # ymm0 = ymm1[0],ymm0[1,2,3],ymm1[4,5,6,7] vpermilps %ymm2, %ymm0, %ymm0 vmovups %ymm0, 32(%rsi) We are only generating loads and stores, its clang/llvm that converted them to vperm2f/vblend. The AVX512 version is even more impressive: vmovaps .LCPI22_0(%rip), %zmm0 # zmm0 = [3,4,5,6,2,3,4,5,1,2,3,4,0,1,2,3] vpermps (%rdi), %zmm0, %zmm0 vmovups %zmm0, (%rsi) but even more clearly wrong: m1: 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 m2 after reverseInPlace: 3 2 1 0 4 3 2 1 5 4 3 2 6 5 4 3 Re-implementing `reverseInPlace` by something trivial exposes the same issue: https://godbolt.org/z/bQSLNk I did not manage to remove the `Matrix4f` dependency yet (but, should not be to complicated). This looks clearly like a clang issue ... Test case not depending on Eigen: https://godbolt.org/z/sDVk1B I can't find my llvm-bug account (I may never registered one). Can someone report this to bugs.llvm.org? Question: How do we work-around this? We could actually implement our own (correct) vpermps-based reverse operation (this would benefit gcc/msvc as well). Thanks, that's exactly what I wanted to try before reporting to clang, but this seems to be fixed in trunk. Regarding the workaround, implementing our own vperm based solution would be very tricky because we have to figure out that we can reverse and swap multiple columns at once. I'me also pretty sure that this issue could show up with other expressions. Submitted: https://bugs.llvm.org/show_bug.cgi?id=40815 So far the only solution I can think of is to ban clang6/7+AVX with a brutal #error. This would be very bad, hopefully clang/llvm folks we'll have some hint for us. Note that we could workaround reverseInPlace but we cannot prevent users to write: void broken(Eigen::Matrix4f& m1, Eigen::Matrix4f& m2) { m2 = m1; m2.col(0).swap(m2.col(3)); m2.col(1).swap(m2.col(2)); } https://bitbucket.org/eigen/eigen/commits/117db395220b/ Summary: Bug 1684: add simplified regression test for respective clang's bug (this also reveal the same bug in Apples's clang) (In reply to Gael Guennebaud from comment #6) > Regarding the workaround, implementing our own vperm based solution would be > very tricky [...] True, and I agree this won't fix the issue if the user manually swaps columns accordingly. Still, I opened a low-priority bug for that: Bug 1685 I applied a partial workaround working at the level of swap: https://bitbucket.org/eigen/eigen/commits/f8f6872d6f8 This should be enough as reproducing this issue without re-implementing swap seems to be pretty tough. Performance-wise, clang is brave enough to vectorize copies within swap on its own, so I don't expect any performance loss. I really don't feel disabling clang+AVX{512}. -- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1684. |