This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen
Bug 359 - Vectorization of Matrix*Vector with aligned matrix and unaligned vector
Summary: Vectorization of Matrix*Vector with aligned matrix and unaligned vector
Status: RESOLVED FIXED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - matrix products (show other bugs)
Version: 3.0
Hardware: x86 - SSE All
: --- enhancement
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 404
  Show dependency treegraph
 
Reported: 2011-10-12 15:58 UTC by Christoph Hertzberg
Modified: 2019-12-04 11:10 UTC (History)
2 users (show)



Attachments
Implements vectorized product between aligned matrix and unaligned vector (please review) (4.83 KB, patch)
2013-10-21 14:33 UTC, Christoph Hertzberg
chtz: review? (gael.guennebaud)
Details | Diff
Fix AlignedBit flag of coeff based products (1.29 KB, patch)
2013-10-28 11:43 UTC, Gael Guennebaud
no flags Details | Diff

Description Christoph Hertzberg 2011-10-12 15:58:08 UTC
The following is very well vectorized for even DIM but not for odd, although the vector is only accessed by movddup which does not require alignment.

enum{DIM = 4}; // does not vectorize for odd values

void mult_test(const Eigen::Matrix<double, 2, DIM> &a, Eigen::Matrix<double, DIM, 1> &v)
{
	Eigen::Vector2d temp;
	EIGEN_ASM_COMMENT("mult_test");
	temp = a * v;
	EIGEN_ASM_COMMENT("mult-transposed_test");
	v = a.transpose() * temp;
	EIGEN_ASM_COMMENT("end of mult_test");
}


Furthermore in the transposed product the following lines
	haddpd	%xmm1, %xmm1
	haddpd	%xmm0, %xmm0
	movlpd	%xmm1, -24(%ebp)
	movlpd	%xmm0, -16(%ebp)
## (some independent code) ##
	movapd	-24(%ebp), %xmm0

could be replaced by a simple:
	haddpd	%xmm1, %xmm0

maybe the other way around, I sometimes mix this up. The point is that haddpd can calculate two horizontal sums at the same time and already store it packed into a single register:
http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/HADDPD--Packed_Double-FP_Horizontal_Add.htm
Comment 1 Christoph Hertzberg 2012-01-12 15:05:32 UTC
This is not strictly blocking bug 404, but it is one possible optimization.
Furthermore, I accidentally found that my second issue had already been noticed in bug 312.
Comment 2 Christoph Hertzberg 2013-10-21 14:33:13 UTC
Created attachment 387 [details]
Implements vectorized product between aligned matrix and unaligned vector
(please review)

I think the problem is that Assign.h checks if the source is aligned and not if it is vectorizable. I attached a patch which fixes that. For ColMajor expressions I also make the packet loading depend on the alignment of the LHS not of the product expression.

Maybe it would be simpler to directly set the AlignedBit in CoeffBasedProduct, instead of just the PacketAccessBit. Is there actually a point, where we need to distinguish between both?
Comment 3 Gael Guennebaud 2013-10-28 10:59:19 UTC
Comment on attachment 387 [details]
Implements vectorized product between aligned matrix and unaligned vector
(please review)

Review of attachment 387 [details]:
-----------------------------------------------------------------

overall, I think a more proper fix would be to fix traits<CoeffBasedProduct>::Flags to make it has the AlignedBit in this case. See inline comments for details.

::: Eigen/src/Core/Assign.h
@@ +47,5 @@
>      StorageOrdersAgree = (int(Derived::IsRowMajor) == int(OtherDerived::IsRowMajor)),
>      MightVectorize = StorageOrdersAgree
>                    && (int(Derived::Flags) & int(OtherDerived::Flags) & ActualPacketAccessBit),
>      MayInnerVectorize  = MightVectorize && int(InnerSize)!=Dynamic && int(InnerSize)%int(PacketSize)==0
> +                       && int(DstIsAligned),

This change will affect many more expressions. I think this "&& int(SrcIsAligned)" was there because we found that for small fixed size objects, unligned loads were not worth the effort.

::: Eigen/src/Core/products/CoeffBasedProduct.h
@@ -304,5 @@
>    static EIGEN_STRONG_INLINE void run(Index row, Index col, const Lhs& lhs, const Rhs& rhs, RetScalar &res)
>    {
>      Packet pres;
>      product_coeff_vectorized_unroller<UnrollingIndex+1-PacketSize, Lhs, Rhs, Packet>::run(row, col, lhs, rhs, pres);
> -    product_coeff_impl<DefaultTraversal,UnrollingIndex,Lhs,Rhs,RetScalar>::run(row, col, lhs, rhs, res);

oops, this line is very strange. I guess the compiler was able to optimize it away

@@ +379,5 @@
>  template<int UnrollingIndex, typename Lhs, typename Rhs, typename Packet, int LoadMode>
>  struct product_packet_impl<ColMajor, UnrollingIndex, Lhs, Rhs, Packet, LoadMode>
>  {
>    typedef typename Lhs::Index Index;
> +  enum {LhsLoadMode = Lhs::Flags  & ActualPacketAccessBit ? Aligned : Unaligned};

I think the template parameter 'LoadMode' should still be used, but it should be more properly set, and perhaps renamed 'LhsLoadMode' ?
Comment 4 Gael Guennebaud 2013-10-28 11:43:18 UTC
Created attachment 388 [details]
Fix AlignedBit flag of coeff based products
Comment 5 Christoph Hertzberg 2013-10-28 12:45:49 UTC
(In reply to comment #3)
> overall, I think a more proper fix would be to fix
> traits<CoeffBasedProduct>::Flags to make it has the AlignedBit in this case.

Actually, I was wondering if there is an intended difference between AlignedBit and PacketAccessBit?
Eventually, I would suggest to set the PacketAccessBit if it is practically possible to load the expression packet-wise and introduce a PackageReadCost additionally to the CoeffReadCost. Then the evaluator can decide if the expression shall be evaluated coefficient-wise or packet-wise.

> ::: Eigen/src/Core/Assign.h
> @@ +47,5 @@
> >      StorageOrdersAgree = (int(Derived::IsRowMajor) == int(OtherDerived::IsRowMajor)),
> >      MightVectorize = StorageOrdersAgree
> >                    && (int(Derived::Flags) & int(OtherDerived::Flags) & ActualPacketAccessBit),
> >      MayInnerVectorize  = MightVectorize && int(InnerSize)!=Dynamic && int(InnerSize)%int(PacketSize)==0
> > +                       && int(DstIsAligned),
> 
> This change will affect many more expressions. I think this "&&
> int(SrcIsAligned)" was there because we found that for small fixed size
> objects, unligned loads were not worth the effort.

Ok, so you'd suggest to rather have CoeffBasedProduct claim to be aligned (also the product itself does not reside in memory)?

> > -    product_coeff_impl<DefaultTraversal,UnrollingIndex,Lhs,Rhs,RetScalar>::run(row, col, lhs, rhs, res);
> 
> oops, this line is very strange. I guess the compiler was able to optimize it
> away

Yes, even without optimization it resulted in a call to a NOP-method.


> @@ +379,5 @@
> >  template<int UnrollingIndex, typename Lhs, typename Rhs, typename Packet, int LoadMode>
> >  struct product_packet_impl<ColMajor, UnrollingIndex, Lhs, Rhs, Packet, LoadMode>
> >  {
> >    typedef typename Lhs::Index Index;
> > +  enum {LhsLoadMode = Lhs::Flags  & ActualPacketAccessBit ? Aligned : Unaligned};
> 
> I think the template parameter 'LoadMode' should still be used, but it should
> be more properly set, and perhaps renamed 'LhsLoadMode' ?

It kind-of depends on the type of product whether Lhs or Rhs alignment is required, doesn't it?
Comment 6 Gael Guennebaud 2013-10-28 15:34:14 UTC
(In reply to comment #5)
> Actually, I was wondering if there is an intended difference between AlignedBit
> and PacketAccessBit?
> Eventually, I would suggest to set the PacketAccessBit if it is practically
> possible to load the expression packet-wise and introduce a PackageReadCost
> additionally to the CoeffReadCost. Then the evaluator can decide if the
> expression shall be evaluated coefficient-wise or packet-wise.

Yes, AlignedBit says that the expression can be accessed using aligned loads/stores without skipping any elements. It allows some optimizations. A PacketReadCost would be difficult to estimate because in most cases only the evaluator knows whether it can guarantee aligned or non-aligned accesses. Perhaps, with a evaluator and a top-down summation...


> > This change will affect many more expressions. I think this "&&
> > int(SrcIsAligned)" was there because we found that for small fixed size
> > objects, unligned loads were not worth the effort.
> 
> Ok, so you'd suggest to rather have CoeffBasedProduct claim to be aligned (also
> the product itself does not reside in memory)?

That's what my patch does. Currently the AlignedBit was set only when both the rhs and lhs had it.


> > I think the template parameter 'LoadMode' should still be used, but it should
> > be more properly set, and perhaps renamed 'LhsLoadMode' ?
> 
> It kind-of depends on the type of product whether Lhs or Rhs alignment is
> required, doesn't it?

Actually, this is related to the comment above: only the evaluator knows if the loads can be aligned or not. For instance it is ok to call:

(M.lazyProduct(v)).packet<Unaligned>(1);

So I think my simple patch fixes all these issues.
Comment 7 Christoph Hertzberg 2013-10-28 17:29:53 UTC
Comment on attachment 388 [details]
Fix AlignedBit flag of coeff based products

Review of attachment 388 [details]:
-----------------------------------------------------------------

I think this patch is good. But somewhere we need to clarify what 'AlignedBit' is supposed to mean for non-DirectAccess types.

::: Eigen/src/Core/products/CoeffBasedProduct.h
@@ +85,5 @@
>        Flags = ((unsigned int)(LhsFlags | RhsFlags) & HereditaryBits & ~RowMajorBit)
>              | (EvalToRowMajor ? RowMajorBit : 0)
>              | NestingFlags
> +            | (CanVectorizeLhs ? (LhsFlags & AlignedBit) : 0)
> +            | (CanVectorizeRhs ? (RhsFlags & AlignedBit) : 0)

If you also check for SameType, this would be exactly the same as PacketAccessBit, wouldn't it?
Is checking for SameType necessary, or do we rely that this never happens, since type casting is not vectorized yet?)
Comment 8 Gael Guennebaud 2013-10-28 17:54:14 UTC
I hope that's better for AlignedBit:

https://bitbucket.org/eigen/eigen/commits/cabee3ae2ac3/
Changeset:   cabee3ae2ac3
User:        ggael
Date:        2013-10-28 17:44:07
Summary:     Clarify the meaning of AlignedBit (bug 359)

And here is the real fix:

https://bitbucket.org/eigen/eigen/commits/0b95bbbf5833/
Changeset:   0b95bbbf5833
User:        ggael
Date:        2013-10-28 17:48:32
Summary:     Fix bug 359: fix AlignedBit flag of CoeffBasedProduct thus enabling the vectorization of more matrix products

It is the same as PacketAccessBit only for fixed size matrices. Also, I've just realized that because of this "SameType" limitation CoeffBasedProduct can have AlignedBit without PacketAccessBit that is quite stupid and in contradiction with my extended documentation of AlignedBit! Anyway, that has no practical impact.
Comment 9 Christoph Hertzberg 2013-10-29 13:15:10 UTC
Thanks, that made it somewhat clearer. 

However, it seems to me that now the documentation to PacketAccessBit basically says the same. If I understood you correctly, PacketAccessBit shall be enabled if it is theoretically possible to access elements packet-wise regardless of the actual alignment?

Currently, for Matrix<>::Flags the AlignedBit is set if the size is dynamic or a multiple of 16 (and DontAlign is not set). On the other hand PacketAccessBit requires that the scalar type is vectorizable AND the AlignedBit is set.

E.g., Matrix<short, 8, 1> is aligned but not packet-accessible (at the moment), Matrix<double, 2, 1, DontAlign> is neither (although unaligned packet-access would be possible).
Comment 10 Gael Guennebaud 2013-10-29 13:55:12 UTC
good point, it seems there are some redundancies in our mechanism as the vectorization of fixed sized matrices is disabled multiple times if not aligned. The one that is matter the most is the one in the assignment logic. In Matrix, it's true that Matrix<double, 2, 1, DontAlign> could still has the PacketAccessBit, even though it will never be used later on.

Now let's see an example where they clearly differ:

MatrixXd A;
Matrix4d B;

A and B have both PacketAcessBit and AlignedBit.

A.block(...) and B.block(...) have PacketAcessBit but not AlignedBit. Vectorization will still be enabled for the former, but disabled for the later.

B.col(2) has both and will be vectorized.

A.col(2) has only PacketAcessBit and will be vectorized too.
Comment 11 Nobody 2019-12-04 11:10:51 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/359.

Note You need to log in before you can comment on or make changes to this bug.