New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 973 - vectorization_logic fails on AVX
Summary: vectorization_logic fails on AVX
Status: RESOLVED FIXED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: unspecified
Hardware: x86 - AVX All
: Highest Internal Design
Assignee: Nobody
URL:
Whiteboard:
Keywords: performance
Depends on:
Blocks: 3.3
  Show dependency treegraph
 
Reported: 2015-02-27 21:27 UTC by Benoit Jacob
Modified: 2015-09-02 09:02 UTC (History)
5 users (show)



Attachments
re-enable vectorization of Vector4i and alignment of Vector4f (14.27 KB, patch)
2015-03-09 16:08 UTC, Gael Guennebaud
no flags Details | Diff
Refactoring of the macro-level control of alignment. (43.76 KB, patch)
2015-07-24 09:25 UTC, Gael Guennebaud
no flags Details | Diff

Description Benoit Jacob 2015-02-27 21:27:34 UTC
With EIGEN_TEST_AVX=ON, vectorization_logic fails like this:

 Expected Traversal == InnerVectorizedTraversal got LinearTraversal
 Expected Unrolling == CompleteUnrolling got CompleteUnrolling
Test vectorization_logic<int>::run() failed in /usr/local/google/home/benoitjacob/eigen/test/vectorization_logic.cpp (142)
    test_assign(Vector1(),Vector1(), InnerVectorizedTraversal,CompleteUnrolling)
Stack:
  - vectorization_logic<int>::run()
  - vectorization_logic
  - Seed: 1425068736
Comment 1 Christoph Hertzberg 2015-02-27 21:48:19 UTC
I thought I filed a bug for this already, but apparently, I didn't.

The problem is that on AVX Vector4i is vectorized, but sizeof(Vector4i)==16 and AlignmentSize is 32.
Actually, this exposes another problem, namely that Vector4f and Vector2d expressions do not get vectorized anymore with AVX (even though it would be possible using HalfPackets).

See this example

  #include <Eigen/Core>
  void testFun(Eigen::Vector2d& res, const Eigen::Vector2d& a, 
               const Eigen::Vector2d& b) { res = a+b; }

and compile with -O2 -S -mavx (stripped uninteresting parts):
        vmovsd  (%rsi), %xmm0
        vaddsd  (%rdx), %xmm0, %xmm0
        vmovsd  %xmm0, (%rdi)
        vmovsd  8(%rsi), %xmm0
        vaddsd  8(%rdx), %xmm0, %xmm0
        vmovsd  %xmm0, 8(%rdi)
compared to compiling with -O2 -S -msse2:
        movapd  (%rsi), %xmm0
        addpd   (%rdx), %xmm0
        movapd  %xmm0, (%rdi)
Comment 2 Benoit Jacob 2015-02-27 23:55:09 UTC
Right! So this test failure is a very important problem - and it's good that we have at least one test that fails for that. I had come across this regression too recently, see http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2015/01/msg00042.html

There's consensus that this should block 3.3; the question is do we want to properly fix this in 3.3 so that enabling AVX doesn't disable SSE anymore, or do we want to make AVX opt-in i.e. not only you have to pass -mavx but you also need to pass, say, -DENABLE_AVX_AND_DISABLE_SSE, so that only people who specifically want this behavior get it? Ugly, but easy and makes this a non-regression so could unblock 3.3.
Comment 3 Gael Guennebaud 2015-02-28 15:46:00 UTC
Since switching between SSE and AVX is costly, we could make AVX => no SSE by default. Then, in order to ease the use of multiple vector engine in the same application, we could template the evaluators and product kernels by the vectorization engine. This way one could enable AVX for the translation units where it makes sense, and SSE for integer or, as a current workaround, small float-based computations.

Nonetheless, we should still make sure that Vector4f/Vector4i/Vetor2i are properly aligned to a 16 bytes boundary even if they are not vectorized yet in AVX.
Comment 4 Christoph Hertzberg 2015-02-28 16:04:26 UTC
Switching between SSE and AVX is indeed costly, which is why compilers disable SSE if AVX is enabled.
Fortunately, you can still call _mm_add_pd and so on from AVX code. This is then automatically translated into the corresponding VEX.128 functions. See e.g.:

$ cat avxtest.cpp
#include <emmintrin.h>
void test(double * ptr) {
  __m128d p = _mm_loadu_pd(ptr);
  p = _mm_add_pd(p, p);
  _mm_storeu_pd(ptr, p);
}

$ g++ -msse2 avxtest.cpp -S -O2 && tail -6 avxtest.s | head -3
        movupd  (%rcx), %xmm0
        addpd   %xmm0, %xmm0
        movups  %xmm0, (%rcx)

$ g++ -mavx avxtest.cpp -S -O2 && tail -6 avxtest.s | head -3
        vmovupd (%rcx), %xmm0
        vaddpd  %xmm0, %xmm0, %xmm0
        vmovups %xmm0, (%rcx)
Comment 5 Gael Guennebaud 2015-02-28 23:11:36 UTC
yes, this is how we can support both Packet4f and Packet8f, but I was thinking about Packet4i (SSE only or AVX2) and Packet8f (AVX only)
Comment 6 Christoph Hertzberg 2015-03-01 01:17:42 UTC
Packet4i still works with AVX. In AVX mode this also basically just adds the VEX prefix to the instruction code:
http://en.wikipedia.org/wiki/VEX_prefix
All they "forgot" to implement in AVX were 256bit integer operations.

To solve the actual problem, perhaps meta packets are again the solution.
This would easily allow to handle all cases of Packet<float,4>, Packet<float,8>, Packet<int,4> and Packet<int,8> merely by having generic implementations and different specializations for SSE, AVX and AVX2.
We could also avoid lots of code duplication in /src/Core/arch/*/MathFunctions.h
Comment 7 Gael Guennebaud 2015-03-09 16:08:19 UTC
Created attachment 554 [details]
re-enable vectorization of Vector4i and alignment of Vector4f

This patch enable the alignment of Vector2d, Vector4f, and Vector4i when AVX is enabled. It also re-enable vectorization of Vector4i with AVX. Vectorization of Vector2d/Vector4f still required deeper changes as discussed in other entries.
Comment 8 Gael Guennebaud 2015-03-13 21:20:08 UTC
Patch applied:

https://bitbucket.org/eigen/eigen/commits/d8e950d48cb1/
Changeset:   d8e950d48cb1
User:        ggael
Date:        2015-03-13 20:15:50+00:00
Summary:     Bug 973, improve AVX support by enabling vectorization of Vector4i-like types, and enforcing alignement of Vector4f/Vector2d-like types to preserve compatibility with SSE and future Eigen versions that will vectorize them with AVX enabled.

This should be enough to close this specific bug as enabling vectorization through half-packet types is another topic and not a 3.3 blocker.
Comment 9 Christoph Hertzberg 2015-03-13 21:56:58 UTC
Sorry, for not having responded earlier, but doesn't the first part of this condition prevent Vector6d, Matrix<float, 4,3>, etc from getting aligned?

  bool TryHalf   =  bool(unpacket_traits<Packet>::size > Size)
                 && ...

Could we do some calculation using
  min((Size & -Size), EIGEN_MAX_ALIGN/sizeof(Scalar))
(and cap it to zero, for values < HalfPacket::size)

Also, is it possible somehow to write code that always uses AVX-like alignment, even for SSE code? (As it was the default for non-SSE code to be 16 byte aligned)
I'm actually ok, with making EIGEN_MAX_ALIGN=32 the default starting from Eigen3.3 (even for Matrix<int, 8, 1> to ensure AVX2 compatibility). We can't prevent some minor API breaks by introducing AVX, anyways. And compiling with
  -DEIGEN_MAX_ALIGN=16
should be easy enough for the experienced user.

Another thing: What shall the behavior of AlignedMaps be? I guess, making the alignment depend on the architecture could lead to confusing errors, when switching between both, e.g., the following will break with AVX:

  Matrix<double, 6, 1> A;  // should be 16 byte aligned on all architectures
  Vector4d::AlignedMapType head(A.data());   // both are 16 byte aligned, but 
  Vector4d::AlignedMapType tail(A.data()+2); // only one is 32 byte aligned

I'm afraid, for maximal compatibility and performance, we'll need to provide both a 16 and a 32 byte aligned (and an unaligned) Map.


I agree that we don't need to block 3.3 for the actual vectorization of half-packets.
Comment 10 Gael Guennebaud 2015-03-13 22:54:15 UTC
Ah indeed, I completely overlooked the (2*n+1)*half-packet-size cases.


I'd rather keep the current 16 bytes default alignment, and ask experienced users to compile their SSE-only code with -DEIGEN_MAX_ALIGN=32 if they want AVX compatibility.


It is clear to me that Map should accept Aligned16/Aligned32 flags. The question is what should be the meaning of the not specific "Aligned" flag, i.e.,  Map<...,Aligned> and Matrix::AlignedMap? I'm tempted to be conservative and interpret it as Aligned16.
Comment 11 Gael Guennebaud 2015-04-01 13:58:01 UTC
https://bitbucket.org/eigen/eigen/commits/d7c36b69bcac/
Date:        2015-04-01 11:55:09+00:00
Summary:     Bug 973: enable alignment of multiples of half-packet size (e.g., Vector6d with AVX)

Still have to work on Aligned* flags and a user controllable EIGEN_MAX_ALIGN.
Comment 12 Christoph Hertzberg 2015-04-01 23:07:06 UTC
(In reply to Gael Guennebaud from comment #11)
> Still have to work on Aligned* flags and a user controllable EIGEN_MAX_ALIGN.

Some questions/remarks on that:
* Shall Aligned16/Aligned32 flags be affected by EIGEN_MAX_ALIGN?
* Shall the Aligned* flags be usable for Plain types and interpreted as "ForceAlign", i.e., make Matrix<float, 3, 1, Aligned16> replace the (unsupported) AlignedVector3<float>? The same goes for Aligned Maps and Refs, i.e., we could support aligned maps to types which don't have a multiple of the alignment-size. This can still be beneficial, because all but the tail of the data can be accessed by aligned loads (as it is done for dynamic sizes already). Alternatively, Aligned16 could imply that we are allowed to read/overwrite excessive data after the actual data (which is what AlignedVector3 does already) -- this could be really beneficial for vectorization (care has to be taken with reductions).
* With the former, Matrix<float,4,1,Aligned16> and Matrix<float,4,1,AutoAlign> would be equivalent (unless EIGEN_MAX_ALIGN=0).
* Shall we allow EIGEN_MAX_ALIGN>32? I don't see a reason why not -- on the one hand for future development, on the other hand, people may want to (mis-) use this feature to force cache alignment. Of course, only 0 or powers of 2 shall be allowed.
* Should types for which we don't have vectorization (i.e., also no packet_traits<...>) be aligned? This would be important for people who use Array<short, ...> and want to operate on that with hand-coded SIMD instructions (actually, we could consider supporting [u]int{8,16} in the future -- this has many applications, e.g., in computer vision)
Comment 13 Gael Guennebaud 2015-06-09 12:28:30 UTC
(In reply to Christoph Hertzberg from comment #12)
> Some questions/remarks on that:

> * Shall Aligned16/Aligned32 flags be affected by EIGEN_MAX_ALIGN?

I don't think so. EIGEN_MAX_ALIGN should only control the effect of the default AutoAlign.

> * Shall the Aligned* flags be usable for Plain types and interpreted as
> "ForceAlign", i.e., make Matrix<float, 3, 1, Aligned16> replace the
> (unsupported) AlignedVector3<float>? The same goes for Aligned Maps and
> Refs, i.e., we could support aligned maps to types which don't have a
> multiple of the alignment-size. This can still be beneficial, because all
> but the tail of the data can be accessed by aligned loads (as it is done for
> dynamic sizes already). Alternatively, Aligned16 could imply that we are
> allowed to read/overwrite excessive data after the actual data (which is
> what AlignedVector3 does already) -- this could be really beneficial for
> vectorization (care has to be taken with reductions).


From a user point of view, the main use case will be in Map/Ref. In this case they will provide at compile-time the minimum alignment of the provided pointer, and nothing more. In particular, I don't think that this could allow us to read/write outside the bounds. We would need another flag for that.

A related welcome feature would be to offer the possibility to specify that all columns/rows are aligned the same say, or more generally, to set at compile-time that the outerstride is a multiple of some given number.


In Matrix<>/Array<>, IMO, they should only define an upper bound, that is, it will be ignored if the matrix size if not a multiple of the requested alignment. Some examples:

Array<float,8,1,Aligned32> with SSE -> aligned on 32bytes
Array<float,8,1,Aligned16> with AVX -> aligned on 16bytes only
Array<float,4,1,Aligned32> with ??? -> aligned on 16bytes only
Array<float,9,1,Aligned32> with ??? -> not aligned

In the future, we could imagine introducing an additional option to enable padding and thus enforce full alignment:

Array<float,7,7,Aligned32> -> will occupy a block of 8x8 floats with outer-stride=8.
Array<float,Dynamic,Dynamic,Aligned32> -> will exhibit an outer stride multiple of 8.

but that's another topic!



> * With the former, Matrix<float,4,1,Aligned16> and
> Matrix<float,4,1,AutoAlign> would be equivalent (unless EIGEN_MAX_ALIGN=0).

It would also be equivalent to Matrix<float,4,1,Align32>


> * Shall we allow EIGEN_MAX_ALIGN>32? I don't see a reason why not -- on the
> one hand for future development, on the other hand, people may want to
> (mis-) use this feature to force cache alignment. Of course, only 0 or
> powers of 2 shall be allowed.

yes we should allow EIGEN_MAX_ALIGN>32. Since EIGEN_MAX_ALIGN should define only an upper-bound, misuses should be limited. 


> * Should types for which we don't have vectorization (i.e., also no
> packet_traits<...>) be aligned? This would be important for people who use
> Array<short, ...> and want to operate on that with hand-coded SIMD
> instructions (actually, we could consider supporting [u]int{8,16} in the
> future -- this has many applications, e.g., in computer vision)

yes, I think that Array<T,Rows,Cols,Aligned32> should be aligned on 32 bytes for any T, as long as:

  sizeof(T)*Rows*Cols == k * 32
Comment 14 Gael Guennebaud 2015-07-24 09:25:29 UTC
Created attachment 593 [details]
Refactoring of the macro-level control of alignment.

Here is a first patch implementing the proposed novel mechanism to control alignment. This first patch focuses on the set of macros.

Changeset summary:
"Bug 973: update macro-level control of alignement by introducing user-controllable EIGEN_MAX_ALIGN_BYTES and EIGEN_MAX_STATIC_ALIGN_BYTES macros. This changeset also removes EIGEN_ALIGN (replaced by EIGEN_MAX_ALIGN_BYTES>0), EIGEN_ALIGN_STATICALLY (replaced by EIGEN_MAX_STATIC_ALIGN_BYTES>0), EIGEN_USER_ALIGN*, EIGEN_ALIGN_DEFAULT (replaced by EIGEN_ALIGN_MAX)."

The most interesting changes are visible in Macros.h and DenseStorage.h

The next step will be to add the Aligned16/32/64 constants to expressions and to propagate this information.
Comment 15 Gael Guennebaud 2015-07-29 08:25:37 UTC
I pushed an updated version of the previous patch:
https://bitbucket.org/eigen/eigen/commits/9bdd0b78e5ec/
Comment 16 Gael Guennebaud 2015-09-02 09:02:23 UTC
This changeset ece6e5c0b64d enable vectorization with half-packets, so I think we can close that one. Feel free to reopen if I oversaw some aspects.

Note You need to log in before you can comment on or make changes to this bug.