The idea is to have meta-packages which consist of smaller packages. This can help casting different-sized scalars using the vector machine, implement more clean code for unrolled loops (the actual unrolling happens in the meta-package code), implement horizontal reduction, transpose matrix blocks, ... This may also help porting to AVX or other vectorization engines which have package sizes above 16 bytes.
I guess you meant "packets".
> I guess you meant "packets". Yes, of course ...
I have experimented with a naive implementation of a square meta packet, which looks roughly like this for SSE floats: typedef struct { Packet4f packets[4]; } Kernel4f; template<> void pbroadcast<Packet4f>(const float* from, Kernel4f* to) { EIGEN_ASM_COMMENT("Start pbroadcast<Packet4f>"); to->packets[3] = pload<Packet4f>(from); to->packets[0] = vec4f_swizzle1(to->packets[3], 0,0,0,0); to->packets[1] = vec4f_swizzle1(to->packets[3], 1,1,1,1); to->packets[2] = vec4f_swizzle1(to->packets[3], 2,2,2,2); to->packets[3] = vec4f_swizzle1(to->packets[3], 3,3,3,3); EIGEN_ASM_COMMENT("Done pbroadcast<Packet4f>"); } int main() { float vals[4] = {1, 2, 3, 4}; Kernel4f kernel; internal::pbroadcast<Packet4f>(vals, &kernel); for (int i = 0; i < 4; ++i) { cout << kernel.packets[i][0] << " " << kernel.packets[i][1] << " " <<kernel.packets[i][2] << " " <<kernel.packets[i][3] << " " << endl; } return 0; } Unfortunately gcc will not keep the content of the Kernel4f struct on SSE registers. Here's the assembly code that it generates: # 683 "Eigen/src/Core/arch/SSE/PacketMath.h" 1 #Start pbroadcast<Packet4f> # 0 "" 2 #NO_APP movdqa (%rdi), %xmm0 pshufd $0, %xmm0, %xmm1 movaps %xmm1, (%rsi) pshufd $85, %xmm0, %xmm1 movaps %xmm1, 16(%rsi) pshufd $170, %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 movaps %xmm1, 32(%rsi) movaps %xmm0, 48(%rsi) #APP # 689 "Eigen/src/Core/arch/SSE/PacketMath.h" 1 #Done pbroadcast<Packet4f> Instead I'll have to leverage the power of templates to implement Kernel4f. The idea is to define the MetaPacket as follow, assuming that S is always even: template <typename Packet, int S> struct MetaPacket { MetaPacket<Packet, S/2> first_half; MetaPacket<Packet, S/2> second_half; }; template <typename Packet, 1> struct MetaPacket { Packet packet; }; and create template accessors to be able to access individual packets by index that would look like this pseudo code: template <typename Packet, int S, int I> const Packet& get(const MetaPacket<Packet, S>& mp) { if (I<S/2) { return get<Packet, S/2, I>(mp.first_half); } else { return get<Packet, S/2, I-S/2>(mp.second_half); } } template <typename Packet, 1, 0> const Packet& get(const MetaPacket<Packet, 1>& mp) { return mp.packet; } To the compiler optimizer the meta packet will look like a bunch of __m128, which it should be able to properly keep on registers.
(In reply to comment #3) > I have experimented with a naive implementation of a square meta packet, which > looks roughly like this for SSE floats: > > typedef struct { > Packet4f packets[4]; > } Kernel4f; > > template<> void pbroadcast<Packet4f>(const float* from, Kernel4f* to) { I tried that, but replaced Kernel4f* by Kernel4f&. > int main() { > float vals[4] = {1, 2, 3, 4}; > Kernel4f kernel; > internal::pbroadcast<Packet4f>(vals, &kernel); > > for (int i = 0; i < 4; ++i) { > cout << kernel.packets[i][0] << " " << kernel.packets[i][1] << " " > <<kernel.packets[i][2] << " " <<kernel.packets[i][3] << " " << endl; > } It would not make sense to store the Packets into registers in your case. I made a simple copy operation afterwards. The result (using g++ 4.7.1, compiled with -O2) was basically: #APP # 245 "eigen_test.cpp" 1 #Start pbroadcast<Packet4f> # 0 "" 2 #NO_APP movdqa 32(%esp), %xmm0 pshufd $0, %xmm0, %xmm3 pshufd $85, %xmm0, %xmm2 pshufd $170, %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 #APP # 251 "eigen_test.cpp" 1 #Done pbroadcast<Packet4f> # 0 "" 2 # 266 "eigen_test.cpp" 1 #Copy a to b # 0 "" 2 #NO_APP movaps %xmm3, 64(%esp) movaps %xmm2, 80(%esp) movaps %xmm1, 96(%esp) movaps %xmm0, 112(%esp) #APP So basically pbroadcast works as intended :) Actually, using reference or pointer does not seem to make a difference, but I think passing by reference makes it more obvious (at least to the human reader) that the Kernel is not required to be stored in memory.
Created attachment 443 [details] Patch against the latest version of the codebase that introduces the notion of MetaPacket as well as the broadcast_from() initializer method.
Indeed, it looks like gcc-4.8 is able to optimize the code as well. I have attached the corresponding patch. As an experiment, I have updated the gebp_kernel to use meta packets to load the data from the rhs matrix. I have measured similar performance on Nehalem to what I get after applying the patch for bug 721.
We have a PacketBlock, which is so far used for optimized loading inside products: https://bitbucket.org/eigen/eigen/src/9c5d2b7/Eigen/src/Core/GenericPacketMath.h#cl-423
I re-opened this, since the current implementation is far from being feature-complete -- at least compared to what I had in mind originally. To really help solving the dependent bugs, we need something like template<class Scalar, int size> struct MetaPacket; with generic implementations for all p-functions. (padd, pmul, ...) For SSE/AVX we can then specialize (the following code is simplified necessarily working): #if SSE template<> struct MetaPacket<float,4> { __m128 packet; }; typedef MetaPacket<float, 4> Packet4f; template<> Packet4f padd(const Packet4f& a, const Packet4f& b) { return _mm_add_ps(a,b); } #endif #if AVX template<> struct MetaPacket<float,8> { __m256 packet; }; typedef MetaPacket<float, 8> Packet8f; Packet8f padd(const Packet8f& a, const Packet8f& b) { return _mm256_add_ps(a,b); } #endif This would also be very easy to extend to AVX2 (256bit integer operations) and AVX-512 in the future.
The default implementation could be implemented recursively as follows: template<typename S> struct MetaPacket<S,1> { S val; }; template<typename S> Packet<S,1> padd(Packet<S,1> a, Packet<S,1> b) { return a.val + b.val; } template<typename S,int N> struct MetaPacket<S,N> { MetaPacket<S,N/2> p1, p2; }; template<typename S,int N> Packet<S,N> padd(Packet<S,N> a, Packet<S,N> b) { return Packet<S,N>(padd(a.p1,b.p1), padd(a.p2,b.p2)); }
Created attachment 553 [details] Proof-of-concept implementation for meta packets I attached a proof-of-concept implementation. For simple functions like the following it generates reasonable code with SSE or AVX enabled or disabled: Packet4d testFun(const Packet4d& a, const Packet4d& b) { return pmul(a, pneg(b)); } Of course, a lot of work has to be done for a complete port. Also some design decisions have to be made (and some TODOs): * I would suggest asserting that N is a power of 2 * Shall p1 and p2 be made private? --> We could provide upper() and lower() member functions instead * I haven't yet thought about how to elegantly implement vectorized higher-level functions in a way that std-functions are still called for the Scalar case (should not be too complicated) * Specialization for different architectures could also be aided by macros * Packet2f could be specialized for SSE, by ignoring one half of the register * Constructors from aligned or unaligned memory are missing --> We could just stick to the current pload methods * What's the return type of comparison operators? Without vectorization, bool makes more sense, otherwise Packet<S,N> is better for pblend
Sounds promising. (In reply to Christoph Hertzberg from comment #10) > Also some design decisions have to be made (and some TODOs): > * I would suggest asserting that N is a power of 2 I'm a bit concerned about compilation overhead as Packets are instantiated all the times. Perhaps we add empty partial specializations of MetaPacket for small non-power-of-two sizes (3,5,7,9,10,11,12,13,14,15). > * Shall p1 and p2 be made private? > --> We could provide upper() and lower() member functions instead Good idea, this way in AVX Packet8f::upper()/lower() would be possible too. > * I haven't yet thought about how to elegantly implement vectorized > higher-level functions in a way that std-functions are still called > for the Scalar case (should not be too complicated) If we have overloads for MetaPacket<T,1>, MetaPacket<float,N>, MetaPacket<double,N>, then we'll end up with ambiguous calls which can be resolved by either: - overloads for MetaPacket<float,1> and MetaPacket<double,1> - or through explicit overloads in each vector engine calling the respective generic vectorized implementation. This way each vectorization engine can choose to enable only a few subset of the vectorized function (I don't known why) and this also leaves the opportunity to specialize the scalar path. > * Packet2f could be specialized for SSE, by ignoring one half of the register I've string doubt about the possible performance gain, but indeed that's theoretically possible and somewhat easy. > * Constructors from aligned or unaligned memory are missing > --> We could just stick to the current pload methods yes. > * What's the return type of comparison operators? Without vectorization, > bool makes more sense, otherwise Packet<S,N> is better for pblend I'd address this question at the evaluator level to catch all subexpressions which can be advantageously mapped to a pblend.
(In reply to Gael Guennebaud from comment #11) > > * I would suggest asserting that N is a power of 2 > I'm a bit concerned about compilation overhead as Packets are instantiated > all the times. Perhaps we add empty partial specializations of MetaPacket > for small non-power-of-two sizes (3,5,7,9,10,11,12,13,14,15). Good point on the overhead. As it will be an just internal class anyways, we could also simply document the behavior as undefined/suboptimal for non-powers of 2. > > * I haven't yet thought about how to elegantly implement vectorized > > higher-level functions in a way that std-functions are still called > > for the Scalar case (should not be too complicated) > If we have overloads for MetaPacket<T,1>, MetaPacket<float,N>, > MetaPacket<double,N>, then we'll end up with ambiguous calls which can be > resolved by either: > - overloads for MetaPacket<float,1> and MetaPacket<double,1> > - or through explicit overloads in each vector engine calling the > respective generic vectorized implementation. This way each vectorization > engine can choose to enable only a few subset of the vectorized function (I > don't known why) and this also leaves the opportunity to specialize the > scalar path. Perhaps we need some kind of delegator structs anyways (i.e., with the actual implementation in a sin_impl::run() method). This is likely necessary to prevent calling these functions for complex numbers -- or to provide different specializations, if that is feasible. Specializing for the scalar path might indeed be worth evaluating. I never benchmarked the performance of std-math versus our implementation. They can't be significantly different, if they don't fall back to x87 math (only actual difference might be the usage of branches instead of masking/blending). > > * Packet2f could be specialized for SSE, by ignoring one half of the register > > I've string doubt about the possible performance gain, but indeed that's > theoretically possible and somewhat easy. Yes, I haven't really thought that through. A possible problem is that sizeof(Packet2f) would be 16 instead of 8 (not sure if that matters anywhere). And I'm not sure if that gains performance anywhere, either. > > * What's the return type of comparison operators? Without vectorization, > > bool makes more sense, otherwise Packet<S,N> is better for pblend > > I'd address this question at the evaluator level to catch all subexpressions > which can be advantageously mapped to a pblend. Right, so on this level we simply implement pcmp methods which return Packet<S,N>. They could already be used to implement MathFunctions.h and the decision on how they can be used for select, etc, can be delegated to the evaluator (and Bug 97).
Another small question: How shall we call the cast functions? SSE/AVX use *cast* for 'reinterpret_cast' and *cvt* for type conversion. In our API, cast<>() refers to type conversion (and we don't have functionality for the former).
I'm ok with cast and cvt as in SSE/AVX intrinsics.
N.B.: On architectures with masked load/store instructions (such as AVX and AVX512), we could also allow odd-sized packets. This could really benefit expressions where the size is not a multiple of a natural packet-size (also for calculating the last elements of a dynamic-sized expression).
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/692.