Summary: | vectorization_logic fails on AVX | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Eigen | Reporter: | Benoit Jacob <jacob.benoit.1> | ||||||
Component: | Core - vectorization | Assignee: | Nobody <eigen.nobody> | ||||||
Status: | RESOLVED FIXED | ||||||||
Severity: | Internal Design | CC: | benoit.steiner.goog, chtz, gael.guennebaud, jacob.benoit.1, markos | ||||||
Priority: | Highest | Keywords: | performance | ||||||
Version: | unspecified | ||||||||
Hardware: | x86 - AVX | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
Bug Depends on: | |||||||||
Bug Blocks: | 558 | ||||||||
Attachments: |
|
Description
Benoit Jacob
2015-02-27 21:27:34 UTC
I thought I filed a bug for this already, but apparently, I didn't. The problem is that on AVX Vector4i is vectorized, but sizeof(Vector4i)==16 and AlignmentSize is 32. Actually, this exposes another problem, namely that Vector4f and Vector2d expressions do not get vectorized anymore with AVX (even though it would be possible using HalfPackets). See this example #include <Eigen/Core> void testFun(Eigen::Vector2d& res, const Eigen::Vector2d& a, const Eigen::Vector2d& b) { res = a+b; } and compile with -O2 -S -mavx (stripped uninteresting parts): vmovsd (%rsi), %xmm0 vaddsd (%rdx), %xmm0, %xmm0 vmovsd %xmm0, (%rdi) vmovsd 8(%rsi), %xmm0 vaddsd 8(%rdx), %xmm0, %xmm0 vmovsd %xmm0, 8(%rdi) compared to compiling with -O2 -S -msse2: movapd (%rsi), %xmm0 addpd (%rdx), %xmm0 movapd %xmm0, (%rdi) Right! So this test failure is a very important problem - and it's good that we have at least one test that fails for that. I had come across this regression too recently, see http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2015/01/msg00042.html There's consensus that this should block 3.3; the question is do we want to properly fix this in 3.3 so that enabling AVX doesn't disable SSE anymore, or do we want to make AVX opt-in i.e. not only you have to pass -mavx but you also need to pass, say, -DENABLE_AVX_AND_DISABLE_SSE, so that only people who specifically want this behavior get it? Ugly, but easy and makes this a non-regression so could unblock 3.3. Since switching between SSE and AVX is costly, we could make AVX => no SSE by default. Then, in order to ease the use of multiple vector engine in the same application, we could template the evaluators and product kernels by the vectorization engine. This way one could enable AVX for the translation units where it makes sense, and SSE for integer or, as a current workaround, small float-based computations. Nonetheless, we should still make sure that Vector4f/Vector4i/Vetor2i are properly aligned to a 16 bytes boundary even if they are not vectorized yet in AVX. Switching between SSE and AVX is indeed costly, which is why compilers disable SSE if AVX is enabled. Fortunately, you can still call _mm_add_pd and so on from AVX code. This is then automatically translated into the corresponding VEX.128 functions. See e.g.: $ cat avxtest.cpp #include <emmintrin.h> void test(double * ptr) { __m128d p = _mm_loadu_pd(ptr); p = _mm_add_pd(p, p); _mm_storeu_pd(ptr, p); } $ g++ -msse2 avxtest.cpp -S -O2 && tail -6 avxtest.s | head -3 movupd (%rcx), %xmm0 addpd %xmm0, %xmm0 movups %xmm0, (%rcx) $ g++ -mavx avxtest.cpp -S -O2 && tail -6 avxtest.s | head -3 vmovupd (%rcx), %xmm0 vaddpd %xmm0, %xmm0, %xmm0 vmovups %xmm0, (%rcx) yes, this is how we can support both Packet4f and Packet8f, but I was thinking about Packet4i (SSE only or AVX2) and Packet8f (AVX only) Packet4i still works with AVX. In AVX mode this also basically just adds the VEX prefix to the instruction code: http://en.wikipedia.org/wiki/VEX_prefix All they "forgot" to implement in AVX were 256bit integer operations. To solve the actual problem, perhaps meta packets are again the solution. This would easily allow to handle all cases of Packet<float,4>, Packet<float,8>, Packet<int,4> and Packet<int,8> merely by having generic implementations and different specializations for SSE, AVX and AVX2. We could also avoid lots of code duplication in /src/Core/arch/*/MathFunctions.h Created attachment 554 [details]
re-enable vectorization of Vector4i and alignment of Vector4f
This patch enable the alignment of Vector2d, Vector4f, and Vector4i when AVX is enabled. It also re-enable vectorization of Vector4i with AVX. Vectorization of Vector2d/Vector4f still required deeper changes as discussed in other entries.
Patch applied: https://bitbucket.org/eigen/eigen/commits/d8e950d48cb1/ Changeset: d8e950d48cb1 User: ggael Date: 2015-03-13 20:15:50+00:00 Summary: Bug 973, improve AVX support by enabling vectorization of Vector4i-like types, and enforcing alignement of Vector4f/Vector2d-like types to preserve compatibility with SSE and future Eigen versions that will vectorize them with AVX enabled. This should be enough to close this specific bug as enabling vectorization through half-packet types is another topic and not a 3.3 blocker. Sorry, for not having responded earlier, but doesn't the first part of this condition prevent Vector6d, Matrix<float, 4,3>, etc from getting aligned? bool TryHalf = bool(unpacket_traits<Packet>::size > Size) && ... Could we do some calculation using min((Size & -Size), EIGEN_MAX_ALIGN/sizeof(Scalar)) (and cap it to zero, for values < HalfPacket::size) Also, is it possible somehow to write code that always uses AVX-like alignment, even for SSE code? (As it was the default for non-SSE code to be 16 byte aligned) I'm actually ok, with making EIGEN_MAX_ALIGN=32 the default starting from Eigen3.3 (even for Matrix<int, 8, 1> to ensure AVX2 compatibility). We can't prevent some minor API breaks by introducing AVX, anyways. And compiling with -DEIGEN_MAX_ALIGN=16 should be easy enough for the experienced user. Another thing: What shall the behavior of AlignedMaps be? I guess, making the alignment depend on the architecture could lead to confusing errors, when switching between both, e.g., the following will break with AVX: Matrix<double, 6, 1> A; // should be 16 byte aligned on all architectures Vector4d::AlignedMapType head(A.data()); // both are 16 byte aligned, but Vector4d::AlignedMapType tail(A.data()+2); // only one is 32 byte aligned I'm afraid, for maximal compatibility and performance, we'll need to provide both a 16 and a 32 byte aligned (and an unaligned) Map. I agree that we don't need to block 3.3 for the actual vectorization of half-packets. Ah indeed, I completely overlooked the (2*n+1)*half-packet-size cases. I'd rather keep the current 16 bytes default alignment, and ask experienced users to compile their SSE-only code with -DEIGEN_MAX_ALIGN=32 if they want AVX compatibility. It is clear to me that Map should accept Aligned16/Aligned32 flags. The question is what should be the meaning of the not specific "Aligned" flag, i.e., Map<...,Aligned> and Matrix::AlignedMap? I'm tempted to be conservative and interpret it as Aligned16. https://bitbucket.org/eigen/eigen/commits/d7c36b69bcac/ Date: 2015-04-01 11:55:09+00:00 Summary: Bug 973: enable alignment of multiples of half-packet size (e.g., Vector6d with AVX) Still have to work on Aligned* flags and a user controllable EIGEN_MAX_ALIGN. (In reply to Gael Guennebaud from comment #11) > Still have to work on Aligned* flags and a user controllable EIGEN_MAX_ALIGN. Some questions/remarks on that: * Shall Aligned16/Aligned32 flags be affected by EIGEN_MAX_ALIGN? * Shall the Aligned* flags be usable for Plain types and interpreted as "ForceAlign", i.e., make Matrix<float, 3, 1, Aligned16> replace the (unsupported) AlignedVector3<float>? The same goes for Aligned Maps and Refs, i.e., we could support aligned maps to types which don't have a multiple of the alignment-size. This can still be beneficial, because all but the tail of the data can be accessed by aligned loads (as it is done for dynamic sizes already). Alternatively, Aligned16 could imply that we are allowed to read/overwrite excessive data after the actual data (which is what AlignedVector3 does already) -- this could be really beneficial for vectorization (care has to be taken with reductions). * With the former, Matrix<float,4,1,Aligned16> and Matrix<float,4,1,AutoAlign> would be equivalent (unless EIGEN_MAX_ALIGN=0). * Shall we allow EIGEN_MAX_ALIGN>32? I don't see a reason why not -- on the one hand for future development, on the other hand, people may want to (mis-) use this feature to force cache alignment. Of course, only 0 or powers of 2 shall be allowed. * Should types for which we don't have vectorization (i.e., also no packet_traits<...>) be aligned? This would be important for people who use Array<short, ...> and want to operate on that with hand-coded SIMD instructions (actually, we could consider supporting [u]int{8,16} in the future -- this has many applications, e.g., in computer vision) (In reply to Christoph Hertzberg from comment #12) > Some questions/remarks on that: > * Shall Aligned16/Aligned32 flags be affected by EIGEN_MAX_ALIGN? I don't think so. EIGEN_MAX_ALIGN should only control the effect of the default AutoAlign. > * Shall the Aligned* flags be usable for Plain types and interpreted as > "ForceAlign", i.e., make Matrix<float, 3, 1, Aligned16> replace the > (unsupported) AlignedVector3<float>? The same goes for Aligned Maps and > Refs, i.e., we could support aligned maps to types which don't have a > multiple of the alignment-size. This can still be beneficial, because all > but the tail of the data can be accessed by aligned loads (as it is done for > dynamic sizes already). Alternatively, Aligned16 could imply that we are > allowed to read/overwrite excessive data after the actual data (which is > what AlignedVector3 does already) -- this could be really beneficial for > vectorization (care has to be taken with reductions). From a user point of view, the main use case will be in Map/Ref. In this case they will provide at compile-time the minimum alignment of the provided pointer, and nothing more. In particular, I don't think that this could allow us to read/write outside the bounds. We would need another flag for that. A related welcome feature would be to offer the possibility to specify that all columns/rows are aligned the same say, or more generally, to set at compile-time that the outerstride is a multiple of some given number. In Matrix<>/Array<>, IMO, they should only define an upper bound, that is, it will be ignored if the matrix size if not a multiple of the requested alignment. Some examples: Array<float,8,1,Aligned32> with SSE -> aligned on 32bytes Array<float,8,1,Aligned16> with AVX -> aligned on 16bytes only Array<float,4,1,Aligned32> with ??? -> aligned on 16bytes only Array<float,9,1,Aligned32> with ??? -> not aligned In the future, we could imagine introducing an additional option to enable padding and thus enforce full alignment: Array<float,7,7,Aligned32> -> will occupy a block of 8x8 floats with outer-stride=8. Array<float,Dynamic,Dynamic,Aligned32> -> will exhibit an outer stride multiple of 8. but that's another topic! > * With the former, Matrix<float,4,1,Aligned16> and > Matrix<float,4,1,AutoAlign> would be equivalent (unless EIGEN_MAX_ALIGN=0). It would also be equivalent to Matrix<float,4,1,Align32> > * Shall we allow EIGEN_MAX_ALIGN>32? I don't see a reason why not -- on the > one hand for future development, on the other hand, people may want to > (mis-) use this feature to force cache alignment. Of course, only 0 or > powers of 2 shall be allowed. yes we should allow EIGEN_MAX_ALIGN>32. Since EIGEN_MAX_ALIGN should define only an upper-bound, misuses should be limited. > * Should types for which we don't have vectorization (i.e., also no > packet_traits<...>) be aligned? This would be important for people who use > Array<short, ...> and want to operate on that with hand-coded SIMD > instructions (actually, we could consider supporting [u]int{8,16} in the > future -- this has many applications, e.g., in computer vision) yes, I think that Array<T,Rows,Cols,Aligned32> should be aligned on 32 bytes for any T, as long as: sizeof(T)*Rows*Cols == k * 32 Created attachment 593 [details] Refactoring of the macro-level control of alignment. Here is a first patch implementing the proposed novel mechanism to control alignment. This first patch focuses on the set of macros. Changeset summary: "Bug 973: update macro-level control of alignement by introducing user-controllable EIGEN_MAX_ALIGN_BYTES and EIGEN_MAX_STATIC_ALIGN_BYTES macros. This changeset also removes EIGEN_ALIGN (replaced by EIGEN_MAX_ALIGN_BYTES>0), EIGEN_ALIGN_STATICALLY (replaced by EIGEN_MAX_STATIC_ALIGN_BYTES>0), EIGEN_USER_ALIGN*, EIGEN_ALIGN_DEFAULT (replaced by EIGEN_ALIGN_MAX)." The most interesting changes are visible in Macros.h and DenseStorage.h The next step will be to add the Aligned16/32/64 constants to expressions and to propagate this information. I pushed an updated version of the previous patch: https://bitbucket.org/eigen/eigen/commits/9bdd0b78e5ec/ This changeset ece6e5c0b64d enable vectorization with half-packets, so I think we can close that one. Feel free to reopen if I oversaw some aspects. -- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/973. |