New user self-registration is currently disabled. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Bug 256 - (0x100) Assigning a vectorizable expression to an unaligned destination loses vectorization
(0x100)
Assigning a vectorizable expression to an unaligned destination loses vectori...
Status: DECISIONNEEDED
Product: Eigen
Classification: Unclassified
Component: Core - vectorization
3.0
All All
: Normal Optimization
Assigned To: Nobody
:
Depends on: ExprEval
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-26 18:00 UTC by Christoph Hertzberg
Modified: 2017-08-22 14:15 UTC (History)
2 users (show)



Attachments
hand-coded test (712 bytes, text/source)
2011-04-26 21:31 UTC, Christoph Hertzberg
no flags Details
a benchmark for unaligned vectorization of small fixed size objects (2.51 KB, text/plain)
2011-04-26 22:13 UTC, Gael Guennebaud
no flags Details
Enable unaligned vectorization (16.12 KB, patch)
2016-05-24 11:38 UTC, Gael Guennebaud
no flags Details | Diff

Description Christoph Hertzberg 2011-04-26 18:00:51 UTC
Functions such as

void minus_test(Eigen::Vector2d &a, Eigen::Vector2d &b, Eigen::Vector3d &d){
        d.head<2>() = a - b;
}

could easily use vectorization of (a - b), requiring just a single unaligned store to d.head<2>() (similar use-cases with Map<Vector2d> exist).

One question would be to what extent this would be worthwhile (e.g. unaligned = aligned + unaligned would require (just) one additional unaligned load).
If the logic to do this automatically gets to complicated, maybe some modifier could be introduced:

        d.head<2>().packetWise() = a - b.packetWise();

where packetWise returns a wrapper which overwrites packet() and copyPacket(), and pretends to be Aligned.
Comment 1 Gael Guennebaud 2011-04-26 20:40:57 UTC
for small fixed sizes unaligned loads or stores are much too expensive to be worthwhile. You can easily check that by tweaking the logic in the Assign.h or writing some manual SSE code.
Comment 2 Christoph Hertzberg 2011-04-26 21:31:11 UTC
Created attachment 163 [details]
hand-coded test

Test case showing that unaligned store can be worthwhile (not intended to win code beauty contest).
Compile with -msse2 -O2 -DNDEBUG and optionally with -DUNALIGNED_PACKET.
I got performance gain of about 50% on my AMD Phenom(tm) II X6 1055T.
Comment 3 Christoph Hertzberg 2011-04-26 21:36:52 UTC
(In reply to comment #1)
> for small fixed sizes unaligned loads or stores are much too expensive to be
> worthwhile. You can easily check that by tweaking the logic in the Assign.h or
> writing some manual SSE code.

Just attached a simple test case. For float the performance gain is even slightly higher. Also generated code seems to be way smaller.
Comment 4 Gael Guennebaud 2011-04-26 22:08:09 UTC
hm, here (Intel Core2), the "unaligned" version is twice slower (17s vs 11s). Moreover, your example is biased in the sense the res buffer is actually aligned, and the unaligned store instruction is much faster on aligned pointers.

Nevertheless, I have another test case that shows that thanks to our fast unaligned store implementation, vectorizing small unaligned objects might still be worthwhile.
Comment 5 Gael Guennebaud 2011-04-26 22:13:16 UTC
Created attachment 164 [details]
a benchmark for unaligned vectorization of small fixed size objects
Comment 6 Gael Guennebaud 2011-04-26 22:14:32 UTC
for the record I get this output on my core2:

unaligned       default         aligned
0.00152814      0.00192138      0.000358575
Comment 7 Gael Guennebaud 2011-04-27 11:43:12 UTC
Here are the results on various CPUs, for A = 2 * B + C where A, B, and C are unaligned fixed size objects of size 2*packet_size (i.e., 8 for floats, and 4 for double). The generated ASM looks pretty good in all cases.


Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
        unaligned       default         aligned
float   0.000450109     0.000637118     0.000263085
double  0.000487511     0.000375307     0.000300493


Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
        unaligned       default         aligned
float   0.00037499      0.000530938     0.00021904
double  0.000343805     0.000312618     0.000250232


Six-Core AMD Opteron(tm) Processor 8439 SE
        unaligned       default         aligned
float   0.000509662     0.00071595      0.000286267
double  0.000393508     0.000381583     0.000286296


Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz
        unaligned       default         aligned
float   0.000430173     0.000609098     0.000439273
double  0.000689779     0.000627162     0.000501908


Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
        unaligned       default         aligned
float   0.000417886     0.00070977      0.00029277
double  0.000459579     0.000417881     0.000334474



all the results seem to be pretty consistent: unaligned vectorization is worth it for float but not for double though the overhead is not high.
Comment 8 Christoph Hertzberg 2011-04-27 14:08:31 UTC
(In reply to comment #7)

> all the results seem to be pretty consistent: unaligned vectorization is worth
> it for float but not for double though the overhead is not high.

And this was the extreme case that everything is unaligned. My original case was that just the destination is unaligned, which I assume is worthy to vectorize even for double. I guess it is really hard to tell and very machine dependent for what operations the overhead is compensated. E.g., just doing a copy would hardly ever be worthwhile -- but I guess a matrix vector product is worth the effort even with source and destination vector unaligned, as long as the matrix is aligned.
Comment 9 Christoph Hertzberg 2014-09-07 15:55:37 UTC
For newer hardware the cost of unaligned stores/writes tends to get smaller, which makes it impossible to optimize for all processors at the same time.
It might sometimes be worthwhile to evaluate sub-expressions in an aligned manner (this requires bug 99)
Comment 10 Gael Guennebaud 2016-05-24 11:38:36 UTC
Created attachment 709 [details]
Enable unaligned vectorization

I guess it is time to reflect HW improvement in unaligned load/stores: this patch enables vectorization for unaligned inout/output, regardless on the object size and hardware architecture. The use can still control this feature by defining EIGEN_UNALIGNED_VECTORIZE=0


On recent x86, we could even go further by replacing aligned load/store intrinsics by unaligned ones for SSE for which their is no overhead anymore. This why, people who does not enable AVX would not have to care about alignment issues at all...
Comment 11 Gael Guennebaud 2016-05-24 19:59:18 UTC
I applied an updated version on the previous patch:

https://bitbucket.org/eigen/eigen/commits/6c2dc56e73b3/
Date:        2016-05-24 19:54:03+00:00
Summary:     Bug 256: enable vectorization with unaligned loads/stores.
This concerns all architectures and all sizes.
This new behavior can be disabled by defining EIGEN_UNALIGNED_VECTORIZE=0
Comment 12 Christoph Hertzberg 2016-05-24 20:07:16 UTC
I agree that it was time to enable vectorization in these cases. We should of course document that new macro.
Furthermore, we may consider partial vectorization of (constant size) expressions which are not multiples of packet sizes.
Comment 13 Gael Guennebaud 2016-05-24 20:52:37 UTC
This is already partly the case, the current condition is:

int(InnerMaxSize)>=3*InnerPacketSize

But this is for the "SliceVectorizedTraversal" path, which performs a runtime re-alignment for each column. When unaligned stores has zero-overhead, we could rather complete the unrolling paths with half-packet and scalar iterations for the remaining coeffs...
Comment 14 Gael Guennebaud 2016-05-24 21:25:40 UTC
For the record, here are some relative timings for unaligned loads/stores on Haswell (relative to aligned load/store):

Load:
type : offset : u-SSE		; u-AVX
Ref: 0.000839483 0.00049734
f    :  0     :  99   	     	;  100
f    :  1     :  100   	     	;  101
f    :  2     :  100   	     	;  101
f    :  3     :  100   	    	;  101
f    :  4     :  99   	     	;  101
f    :  5     :  100   	     	;  101
f    :  6     :  100   	     	;  101
f    :  7     :  106   	     	;  101
Ref: 0.00162389 0.000943376
d    :  0     :  99   	     	;  99
d    :  1     :  109   	     	;  120
d    :  2     :  100   	     	;  120
d    :  3     :  109   	     	;  120

Store:
type : offset : u-SSE	 	; u-AVX	: s-AVX
Ref: 0.000633079 0.000427755
f    :  0     :  100   	   	;  86 	:  663
f    :  1     :  108   	     	;  126 	:  663
f    :  2     :  108   	     	;  126 	:  663
f    :  3     :  108   	     	;  126 	:  663
f    :  4     :  99   	     	;  126 	:  663
f    :  5     :  108   	     	;  126 	:  663
f    :  6     :  108   	    	;  126 	:  663
f    :  7     :  108   	    	;  126 	:  663
Ref: 0.00126168 0.000633104
d    :  0     :  103   	     	;  100 	:  649
d    :  1     :  137   	     	;  168 	:  649
d    :  2     :  99   	     	;  168 	:  649
d    :  3     :  137   	     	;  168 	:  649

"s-AVX"  is the "streaming" store instruction.

As we can see, overhead is marginal, and the most interesting thing is that if the input appears to be aligned (offset=0), then calling the unaligned instruction has no overhead at all.
Comment 15 Christoph Hertzberg 2017-08-03 17:14:49 UTC
(In reply to Gael Guennebaud from comment #11)
> I applied an updated version on the previous patch:
> 
> https://bitbucket.org/eigen/eigen/commits/6c2dc56e73b3/

That commit seems to break things with clang 3.9, for some specific cases:
https://stackoverflow.com/questions/45469057/

Quite likely, this is a bug in clang 3.9 (it works fine with 3.8 and 4.0), however, we could by default disable EIGEN_UNALIGNED_VECTORIZE on clang 3.9
Comment 16 Gael Guennebaud 2017-08-22 14:15:29 UTC
I'm not sure, it seems to me that same clang bug could show up anytime even with EIGEN_UNALIGNED_VECTORIZE off. The SO example is very specific, and changing only one bit of this example is enough to switch off the bug, like:

cwiseAbs -> cwiseAbs2
norm -> sum
(1,2) -> (0,2)
etc.

Note You need to log in before you can comment on or make changes to this bug.