New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1684 - array_reverse fails with clang >=6 + AVX + -O2
Summary: array_reverse fails with clang >=6 + AVX + -O2
Status: DECISIONNEEDED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - general (show other bugs)
Version: 3.4 (development)
Hardware: All All
: Normal Wrong Result
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 3.4
  Show dependency treegraph
 
Reported: 2019-02-21 17:06 UTC by Gael Guennebaud
Modified: 2019-03-13 09:53 UTC (History)
3 users (show)



Attachments

Description Gael Guennebaud 2019-02-21 17:06:52 UTC
Since the following changeset:

https://bitbucket.org/eigen/eigen/commits/a5e0ab8403

the unit test array_reverse_3 (and 4) fail at runtime line 113 with clang 6 and 7 with -O2 (or -O3) and AVX (or above).

This is the call to rowwise().reverseInPlace() which is failing (line 112).

This error is very suspicious because the above changeset is only adding a few "inline" and it is impossible to reproduce it on a standalone test. For instance, only one of following changes is enough to silent the issue:

- add exit(0); line 114 of array_reverse.cpp
- make reverseInPlace() no-inline
Comment 1 Gael Guennebaud 2019-02-21 17:16:54 UTC
There is nothing obvious from the generated ASM. In the failing case, it is fully inlined with:

	vmovaps	96(%rsp), %xmm0
	vmovaps	112(%rsp), %xmm1
	vmovaps	128(%rsp), %xmm2
	vmovaps	144(%rsp), %xmm3
	vmovaps	%xmm0, 144(%rsp)
	vmovaps	%xmm3, 96(%rsp)
	vmovaps	%xmm1, 128(%rsp)
	vmovaps	%xmm2, 112(%rsp)

otherwise (e.g., by adding exit(0)) the reverseInPlace function is not inlined, and its body is equivalent:

	vmovups	(%rax), %xmm0
	vmovups	16(%rax), %xmm1
	vmovups	32(%rax), %xmm2
	vmovups	48(%rax), %xmm3
	vmovups	%xmm0, 48(%rax)
	vmovups	%xmm3, (%rax)
	vmovups	%xmm1, 32(%rax)
	vmovups	%xmm2, 16(%rax)


By the way, here is what I get when it fails:

m1:
 -0.8572   -0.7386    0.5127    0.5553
0.006737    0.3555    0.9296  -0.05272
 -0.3065     0.316   -0.3913   -0.5955
 -0.7644   -0.1698   -0.1655    0.6492

m2 after m2=m1; m2.rowwise().reverseInPlace();
  0.9296    0.5127  0.006737   -0.8572
 -0.3913    0.9296   -0.3065  0.006737
 -0.1655   -0.3913   -0.7644   -0.3065
  0.5553   -0.1655   -0.7386   -0.7644

m1.rowwise().reverse():
  0.5553    0.5127   -0.7386   -0.8572
-0.05272    0.9296    0.3555  0.006737
 -0.5955   -0.3913     0.316   -0.3065
  0.6492   -0.1655   -0.1698   -0.7644

and recall that the implementation of reverseInPlace is:

Index half = xpr.cols()/2;
xpr.leftCols(half).swap(xpr.rightCols(half).rowwise().reverse());
Comment 2 Gael Guennebaud 2019-02-21 17:52:00 UTC
OK, so now I managed to reproduce with something as simple as:
template<typename T> EIGEN_DONT_INLINE void dowork(T& m1, T& m2)
{
  m2 = m1;
  m2.rowwise().reverseInPlace();
}

with T=Matrix4f.

The generated ASM is indeed wrong:

	vmovaps	(%rdi), %ymm0
	vmovaps	%ymm0, (%rsi)
	vmovaps	32(%rdi), %ymm1
	vperm2f128	$1, %ymm0, %ymm1, %ymm2 # ymm2 = ymm1[2,3,0,1]
	vblendps	$14, %ymm1, %ymm2, %ymm1 # ymm1 = ymm2[0],ymm1[1,2,3],ymm2[4,5,6,7]
	vmovaps	.LCPI22_0(%rip), %ymm2  # ymm2 = [1,2,3,0,4,5,6,7]
	vpermilps	%ymm2, %ymm1, %ymm1
	vmovups	%ymm1, (%rsi)
	vperm2f128	$1, %ymm0, %ymm0, %ymm1 # ymm1 = ymm0[2,3,0,1]
	vblendps	$14, %ymm0, %ymm1, %ymm0 # ymm0 = ymm1[0],ymm0[1,2,3],ymm1[4,5,6,7]
	vpermilps	%ymm2, %ymm0, %ymm0
	vmovups	%ymm0, 32(%rsi)

We are only generating loads and stores, its clang/llvm that converted them to vperm2f/vblend.

The AVX512 version is even more impressive:

	vmovaps	.LCPI22_0(%rip), %zmm0  # zmm0 = [3,4,5,6,2,3,4,5,1,2,3,4,0,1,2,3]
	vpermps	(%rdi), %zmm0, %zmm0
	vmovups	%zmm0, (%rsi)

but even more clearly wrong:

m1:
 0   4   8  12
 1   5   9  13
 2   6  10  14
 3   7  11  15

m2 after reverseInPlace:
3  2  1  0
4  3  2  1
5  4  3  2
6  5  4  3
Comment 3 Christoph Hertzberg 2019-02-21 18:59:54 UTC
Re-implementing `reverseInPlace` by something trivial exposes the same issue:

https://godbolt.org/z/bQSLNk

I did not manage to remove the `Matrix4f` dependency yet (but, should not be to complicated).

This looks clearly like a clang issue ...
Comment 4 Christoph Hertzberg 2019-02-21 19:26:35 UTC
Test case not depending on Eigen:
https://godbolt.org/z/sDVk1B

I can't find my llvm-bug account (I may never registered one). Can someone report this to bugs.llvm.org?


Question: How do we work-around this?
We could actually implement our own (correct) vpermps-based reverse operation (this would benefit gcc/msvc as well).
Comment 5 Gael Guennebaud 2019-02-21 20:30:22 UTC
Thanks, that's exactly what I wanted to try before reporting to clang, but this seems to be fixed in trunk.
Comment 6 Gael Guennebaud 2019-02-21 20:35:19 UTC
Regarding the workaround, implementing our own vperm based solution would be very tricky because we have to figure out that we can reverse and swap multiple columns at once. I'me also pretty sure that this issue could show up with other expressions.
Comment 7 Gael Guennebaud 2019-02-22 08:58:31 UTC
Submitted: https://bugs.llvm.org/show_bug.cgi?id=40815
Comment 8 Gael Guennebaud 2019-02-22 09:02:46 UTC
So far the only solution I can think of is to ban clang6/7+AVX with a brutal #error. This would be very bad, hopefully clang/llvm folks we'll have some hint for us.

Note that we could workaround reverseInPlace but we cannot prevent users to write:

void broken(Eigen::Matrix4f& m1, Eigen::Matrix4f& m2)
{
  m2 = m1;
  m2.col(0).swap(m2.col(3));
  m2.col(1).swap(m2.col(2));
}
Comment 9 Gael Guennebaud 2019-02-22 10:05:26 UTC
https://bitbucket.org/eigen/eigen/commits/117db395220b/
Summary:     Bug 1684: add simplified regression test for respective clang's bug (this also reveal the same bug in Apples's clang)
Comment 10 Christoph Hertzberg 2019-02-22 10:39:13 UTC
(In reply to Gael Guennebaud from comment #6)
> Regarding the workaround, implementing our own vperm based solution would be
> very tricky [...]

True, and I agree this won't fix the issue if the user manually swaps columns accordingly.

Still, I opened a low-priority bug for that: Bug 1685
Comment 11 Gael Guennebaud 2019-03-13 09:52:55 UTC
I applied a partial workaround working at the level of swap:

https://bitbucket.org/eigen/eigen/commits/f8f6872d6f8

This should be enough as reproducing this issue without re-implementing swap seems to be pretty tough.

Performance-wise, clang is brave enough to vectorize copies within swap on its own, so I don't expect any performance loss.

I really don't feel disabling clang+AVX{512}.

Note You need to log in before you can comment on or make changes to this bug.