New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 692 - Meta-Packets
Summary: Meta-Packets
Status: DECISIONNEEDED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: 3.3 (current stable)
Hardware: All All
: Normal Internal Design
Assignee: Nobody
URL:
Whiteboard:
Keywords: performance
Depends on:
Blocks: 97 312 512 976 1714
  Show dependency treegraph
 
Reported: 2013-10-28 13:08 UTC by Christoph Hertzberg
Modified: 2019-05-24 15:13 UTC (History)
4 users (show)



Attachments
Patch against the latest version of the codebase that introduces the notion of MetaPacket as well as the broadcast_from() initializer method. (10.19 KB, patch)
2014-03-26 19:34 UTC, Benoit Steiner
no flags Details | Diff
Proof-of-concept implementation for meta packets (3.43 KB, text/plain)
2015-03-04 02:51 UTC, Christoph Hertzberg
no flags Details

Description Christoph Hertzberg 2013-10-28 13:08:35 UTC
The idea is to have meta-packages which consist of smaller packages.
This can help casting different-sized scalars using the vector machine, implement more clean code for unrolled loops (the actual unrolling happens in the meta-package code), implement horizontal reduction, transpose matrix blocks, ...

This may also help porting to AVX or other vectorization engines which have package sizes above 16 bytes.
Comment 1 Gael Guennebaud 2013-10-28 15:59:07 UTC
I guess you meant "packets".
Comment 2 Christoph Hertzberg 2013-10-28 16:23:33 UTC
> I guess you meant "packets".
Yes, of course ...
Comment 3 Benoit Steiner 2014-03-26 03:04:08 UTC
I have experimented with a naive implementation of a square meta packet, which looks roughly like this for SSE floats:

typedef struct {
  Packet4f packets[4];
} Kernel4f;

template<> void pbroadcast<Packet4f>(const float* from, Kernel4f* to) {
  EIGEN_ASM_COMMENT("Start pbroadcast<Packet4f>");
  to->packets[3] = pload<Packet4f>(from);
  to->packets[0] = vec4f_swizzle1(to->packets[3], 0,0,0,0);
  to->packets[1] = vec4f_swizzle1(to->packets[3], 1,1,1,1);
  to->packets[2] = vec4f_swizzle1(to->packets[3], 2,2,2,2);
  to->packets[3] = vec4f_swizzle1(to->packets[3], 3,3,3,3);
  EIGEN_ASM_COMMENT("Done pbroadcast<Packet4f>");
}

int main() {
  float vals[4] = {1, 2, 3, 4};
  Kernel4f kernel;
  internal::pbroadcast<Packet4f>(vals, &kernel);

  for (int i = 0; i < 4; ++i) {
    cout << kernel.packets[i][0] << " " << kernel.packets[i][1] << " " <<kernel.packets[i][2] << " " <<kernel.packets[i][3] << " " << endl;
  }
  return 0;
}

Unfortunately gcc will not keep the content of the Kernel4f struct on SSE registers. Here's the assembly code that it generates:

# 683 "Eigen/src/Core/arch/SSE/PacketMath.h" 1
        #Start pbroadcast<Packet4f>
# 0 "" 2
#NO_APP
        movdqa  (%rdi), %xmm0
        pshufd  $0, %xmm0, %xmm1
        movaps  %xmm1, (%rsi)
        pshufd  $85, %xmm0, %xmm1
        movaps  %xmm1, 16(%rsi)
        pshufd  $170, %xmm0, %xmm1
        pshufd  $255, %xmm0, %xmm0
        movaps  %xmm1, 32(%rsi)
        movaps  %xmm0, 48(%rsi)
#APP
# 689 "Eigen/src/Core/arch/SSE/PacketMath.h" 1
        #Done pbroadcast<Packet4f>

Instead I'll have to leverage the power of templates to implement Kernel4f. The idea is to define the MetaPacket as follow, assuming that S is always even:
template <typename Packet, int S> struct MetaPacket {
  MetaPacket<Packet, S/2> first_half;
  MetaPacket<Packet, S/2> second_half;
};
template <typename Packet, 1> struct MetaPacket {
  Packet packet;
};
and create template accessors to be able to access individual packets by index that would look like this pseudo code:
template <typename Packet, int S, int I>
const Packet& get(const MetaPacket<Packet, S>& mp) {
  if (I<S/2) {
    return get<Packet, S/2, I>(mp.first_half);
  } else {
    return get<Packet, S/2, I-S/2>(mp.second_half);
  }
}
template <typename Packet, 1, 0>
const Packet& get(const MetaPacket<Packet, 1>& mp) {
  return mp.packet;
}

To the compiler optimizer the meta packet will look like a bunch of __m128, which it should be able to properly keep on registers.
Comment 4 Christoph Hertzberg 2014-03-26 10:57:03 UTC
(In reply to comment #3)
> I have experimented with a naive implementation of a square meta packet, which
> looks roughly like this for SSE floats:
> 
> typedef struct {
>   Packet4f packets[4];
> } Kernel4f;
> 
> template<> void pbroadcast<Packet4f>(const float* from, Kernel4f* to) {

I tried that, but replaced Kernel4f* by Kernel4f&.

> int main() {
>   float vals[4] = {1, 2, 3, 4};
>   Kernel4f kernel;
>   internal::pbroadcast<Packet4f>(vals, &kernel);
> 
>   for (int i = 0; i < 4; ++i) {
>     cout << kernel.packets[i][0] << " " << kernel.packets[i][1] << " "
> <<kernel.packets[i][2] << " " <<kernel.packets[i][3] << " " << endl;
>   }

It would not make sense to store the Packets into registers in your case.
I made a simple copy operation afterwards.
The result (using g++ 4.7.1, compiled with -O2) was basically:

#APP
# 245 "eigen_test.cpp" 1
	#Start pbroadcast<Packet4f>
# 0 "" 2
#NO_APP
	movdqa	32(%esp), %xmm0
	pshufd	$0, %xmm0, %xmm3
	pshufd	$85, %xmm0, %xmm2
	pshufd	$170, %xmm0, %xmm1
	pshufd	$255, %xmm0, %xmm0
#APP
# 251 "eigen_test.cpp" 1
	#Done pbroadcast<Packet4f>
# 0 "" 2
# 266 "eigen_test.cpp" 1
	#Copy a to b
# 0 "" 2
#NO_APP
	movaps	%xmm3, 64(%esp)
	movaps	%xmm2, 80(%esp)
	movaps	%xmm1, 96(%esp)
	movaps	%xmm0, 112(%esp)
#APP

So basically pbroadcast works as intended :)

Actually, using reference or pointer does not seem to make a difference, but I think passing by reference makes it more obvious (at least to the human reader) that the Kernel is not required to be stored in memory.
Comment 5 Benoit Steiner 2014-03-26 19:34:00 UTC
Created attachment 443 [details]
Patch against the latest version of the codebase that introduces the notion of MetaPacket as well as the broadcast_from() initializer method.
Comment 6 Benoit Steiner 2014-03-26 19:37:28 UTC
Indeed, it looks like gcc-4.8 is able to optimize the code as well. I have attached the corresponding patch.
As an experiment, I have updated the gebp_kernel to use meta packets to load the data from the rhs matrix. I have measured similar performance on Nehalem to what I get after applying the patch for bug 721.
Comment 7 Christoph Hertzberg 2014-06-20 16:54:01 UTC
We have a PacketBlock, which is so far used for optimized loading inside products:
https://bitbucket.org/eigen/eigen/src/9c5d2b7/Eigen/src/Core/GenericPacketMath.h#cl-423
Comment 8 Christoph Hertzberg 2015-03-03 17:57:28 UTC
I re-opened this, since the current implementation is far from being feature-complete -- at least compared to what I had in mind originally.

To really help solving the dependent bugs, we need something like

  template<class Scalar, int size>
  struct MetaPacket;

with generic implementations for all p-functions. (padd, pmul, ...)

For SSE/AVX we can then specialize (the following code is simplified necessarily working):

#if SSE
  template<>
  struct MetaPacket<float,4> {
    __m128 packet;
  };
  typedef MetaPacket<float, 4> Packet4f;
  template<>
  Packet4f padd(const Packet4f& a, const Packet4f& b) { 
    return _mm_add_ps(a,b);
  }
#endif
#if AVX
  template<>
  struct MetaPacket<float,8> {
    __m256 packet;
  };
  typedef MetaPacket<float, 8> Packet8f;
  Packet8f padd(const Packet8f& a, const Packet8f& b) {
    return _mm256_add_ps(a,b);
  }
#endif

This would also be very easy to extend to AVX2 (256bit integer operations) and AVX-512 in the future.
Comment 9 Gael Guennebaud 2015-03-03 23:29:15 UTC
The default implementation could be implemented recursively as follows:

template<typename S> struct MetaPacket<S,1> { S val; };

template<typename S>
Packet<S,1> padd(Packet<S,1> a, Packet<S,1> b) {
  return a.val + b.val;
}

template<typename S,int N> struct MetaPacket<S,N> { MetaPacket<S,N/2> p1, p2; };

template<typename S,int N>
Packet<S,N> padd(Packet<S,N> a, Packet<S,N> b) {
  return Packet<S,N>(padd(a.p1,b.p1), padd(a.p2,b.p2));
}
Comment 10 Christoph Hertzberg 2015-03-04 02:51:33 UTC
Created attachment 553 [details]
Proof-of-concept implementation for meta packets

I attached a proof-of-concept implementation.

For simple functions like the following it generates reasonable code with SSE or AVX enabled or disabled:

  Packet4d testFun(const Packet4d& a, const Packet4d& b) {
    return  pmul(a, pneg(b));
  }

Of course, a lot of work has to be done for a complete port.

Also some design decisions have to be made (and some TODOs):
* I would suggest asserting that N is a power of 2
* Shall p1 and p2 be made private?
  --> We could provide upper() and lower() member functions instead
* I haven't yet thought about how to elegantly implement vectorized 
  higher-level functions in a way that std-functions are still called 
  for the Scalar case (should not be too complicated)
* Specialization for different architectures could also be aided by macros
* Packet2f could be specialized for SSE, by ignoring one half of the register
* Constructors from aligned or unaligned memory are missing
  --> We could just stick to the current pload methods
* What's the return type of comparison operators? Without vectorization, 
  bool makes more sense, otherwise Packet<S,N> is better for pblend
Comment 11 Gael Guennebaud 2015-03-04 11:28:07 UTC
Sounds promising.

(In reply to Christoph Hertzberg from comment #10)
> Also some design decisions have to be made (and some TODOs):
> * I would suggest asserting that N is a power of 2

I'm a bit concerned about compilation overhead as Packets are instantiated all the times. Perhaps we add empty partial specializations of MetaPacket for small non-power-of-two sizes (3,5,7,9,10,11,12,13,14,15).

> * Shall p1 and p2 be made private?
>   --> We could provide upper() and lower() member functions instead

Good idea, this way in AVX Packet8f::upper()/lower() would be possible too.

> * I haven't yet thought about how to elegantly implement vectorized 
>   higher-level functions in a way that std-functions are still called 
>   for the Scalar case (should not be too complicated)

If we have overloads for MetaPacket<T,1>, MetaPacket<float,N>, MetaPacket<double,N>, then we'll end up with ambiguous calls which can be resolved by either:
 - overloads for MetaPacket<float,1> and MetaPacket<double,1>
 - or through explicit overloads in each vector engine calling the respective generic vectorized implementation. This way each vectorization engine can choose to enable only a few subset of the vectorized function (I don't known why) and this also leaves the opportunity to specialize the scalar path.

> * Packet2f could be specialized for SSE, by ignoring one half of the register

I've string doubt about the possible performance gain, but indeed that's theoretically possible and somewhat easy.

> * Constructors from aligned or unaligned memory are missing
>   --> We could just stick to the current pload methods

yes.

> * What's the return type of comparison operators? Without vectorization, 
>   bool makes more sense, otherwise Packet<S,N> is better for pblend

I'd address this question at the evaluator level to catch all subexpressions which can be advantageously mapped to a pblend.
Comment 12 Christoph Hertzberg 2015-03-04 13:15:53 UTC
(In reply to Gael Guennebaud from comment #11)
> > * I would suggest asserting that N is a power of 2
> I'm a bit concerned about compilation overhead as Packets are instantiated
> all the times. Perhaps we add empty partial specializations of MetaPacket
> for small non-power-of-two sizes (3,5,7,9,10,11,12,13,14,15).

Good point on the overhead. As it will be an just internal class anyways, we could also simply document the behavior as undefined/suboptimal for non-powers of 2.

> > * I haven't yet thought about how to elegantly implement vectorized 
> >   higher-level functions in a way that std-functions are still called 
> >   for the Scalar case (should not be too complicated)
> If we have overloads for MetaPacket<T,1>, MetaPacket<float,N>,
> MetaPacket<double,N>, then we'll end up with ambiguous calls which can be
> resolved by either:
>  - overloads for MetaPacket<float,1> and MetaPacket<double,1>
>  - or through explicit overloads in each vector engine calling the
> respective generic vectorized implementation. This way each vectorization
> engine can choose to enable only a few subset of the vectorized function (I
> don't known why) and this also leaves the opportunity to specialize the
> scalar path.

Perhaps we need some kind of delegator structs anyways (i.e., with the actual implementation in a sin_impl::run() method). This is likely necessary to prevent calling these functions for complex numbers -- or to provide different specializations, if that is feasible.
Specializing for the scalar path might indeed be worth evaluating. I never benchmarked the performance of std-math versus our implementation. They can't be significantly different, if they don't fall back to x87 math (only actual difference might be the usage of branches instead of masking/blending).

> > * Packet2f could be specialized for SSE, by ignoring one half of the register
> 
> I've string doubt about the possible performance gain, but indeed that's
> theoretically possible and somewhat easy.

Yes, I haven't really thought that through. A possible problem is that sizeof(Packet2f) would be 16 instead of 8 (not sure if that matters anywhere). And I'm not sure if that gains performance anywhere, either.

> > * What's the return type of comparison operators? Without vectorization, 
> >   bool makes more sense, otherwise Packet<S,N> is better for pblend
> 
> I'd address this question at the evaluator level to catch all subexpressions
> which can be advantageously mapped to a pblend.

Right, so on this level we simply implement pcmp methods which return Packet<S,N>.
They could already be used to implement MathFunctions.h and the decision on how they can be used for select, etc, can be delegated to the evaluator (and Bug 97).
Comment 13 Christoph Hertzberg 2015-03-04 13:20:28 UTC
Another small question: How shall we call the cast functions?
SSE/AVX use *cast* for 'reinterpret_cast' and *cvt* for type conversion. In our API, cast<>() refers to type conversion (and we don't have functionality for the former).
Comment 14 Gael Guennebaud 2015-03-05 16:09:42 UTC
I'm ok with cast and cvt as in SSE/AVX intrinsics.
Comment 15 Christoph Hertzberg 2019-05-24 15:13:14 UTC
N.B.: On architectures with masked load/store instructions (such as AVX and AVX512), we could also allow odd-sized packets. 
This could really benefit expressions where the size is not a multiple of a natural packet-size (also for calculating the last elements of a dynamic-sized expression).

Note You need to log in before you can comment on or make changes to this bug.