Functions such as void minus_test(Eigen::Vector2d &a, Eigen::Vector2d &b, Eigen::Vector3d &d){ d.head<2>() = a - b; } could easily use vectorization of (a - b), requiring just a single unaligned store to d.head<2>() (similar use-cases with Map<Vector2d> exist). One question would be to what extent this would be worthwhile (e.g. unaligned = aligned + unaligned would require (just) one additional unaligned load). If the logic to do this automatically gets to complicated, maybe some modifier could be introduced: d.head<2>().packetWise() = a - b.packetWise(); where packetWise returns a wrapper which overwrites packet() and copyPacket(), and pretends to be Aligned.
for small fixed sizes unaligned loads or stores are much too expensive to be worthwhile. You can easily check that by tweaking the logic in the Assign.h or writing some manual SSE code.
Created attachment 163 [details] hand-coded test Test case showing that unaligned store can be worthwhile (not intended to win code beauty contest). Compile with -msse2 -O2 -DNDEBUG and optionally with -DUNALIGNED_PACKET. I got performance gain of about 50% on my AMD Phenom(tm) II X6 1055T.
(In reply to comment #1) > for small fixed sizes unaligned loads or stores are much too expensive to be > worthwhile. You can easily check that by tweaking the logic in the Assign.h or > writing some manual SSE code. Just attached a simple test case. For float the performance gain is even slightly higher. Also generated code seems to be way smaller.
hm, here (Intel Core2), the "unaligned" version is twice slower (17s vs 11s). Moreover, your example is biased in the sense the res buffer is actually aligned, and the unaligned store instruction is much faster on aligned pointers. Nevertheless, I have another test case that shows that thanks to our fast unaligned store implementation, vectorizing small unaligned objects might still be worthwhile.
Created attachment 164 [details] a benchmark for unaligned vectorization of small fixed size objects
for the record I get this output on my core2: unaligned default aligned 0.00152814 0.00192138 0.000358575
Here are the results on various CPUs, for A = 2 * B + C where A, B, and C are unaligned fixed size objects of size 2*packet_size (i.e., 8 for floats, and 4 for double). The generated ASM looks pretty good in all cases. Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz unaligned default aligned float 0.000450109 0.000637118 0.000263085 double 0.000487511 0.000375307 0.000300493 Intel(R) Xeon(R) CPU X5560 @ 2.80GHz unaligned default aligned float 0.00037499 0.000530938 0.00021904 double 0.000343805 0.000312618 0.000250232 Six-Core AMD Opteron(tm) Processor 8439 SE unaligned default aligned float 0.000509662 0.00071595 0.000286267 double 0.000393508 0.000381583 0.000286296 Intel(R) Xeon(R) CPU E5540 @ 2.53GHz unaligned default aligned float 0.000430173 0.000609098 0.000439273 double 0.000689779 0.000627162 0.000501908 Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz unaligned default aligned float 0.000417886 0.00070977 0.00029277 double 0.000459579 0.000417881 0.000334474 all the results seem to be pretty consistent: unaligned vectorization is worth it for float but not for double though the overhead is not high.
(In reply to comment #7) > all the results seem to be pretty consistent: unaligned vectorization is worth > it for float but not for double though the overhead is not high. And this was the extreme case that everything is unaligned. My original case was that just the destination is unaligned, which I assume is worthy to vectorize even for double. I guess it is really hard to tell and very machine dependent for what operations the overhead is compensated. E.g., just doing a copy would hardly ever be worthwhile -- but I guess a matrix vector product is worth the effort even with source and destination vector unaligned, as long as the matrix is aligned.
For newer hardware the cost of unaligned stores/writes tends to get smaller, which makes it impossible to optimize for all processors at the same time. It might sometimes be worthwhile to evaluate sub-expressions in an aligned manner (this requires bug 99)
Created attachment 709 [details] Enable unaligned vectorization I guess it is time to reflect HW improvement in unaligned load/stores: this patch enables vectorization for unaligned inout/output, regardless on the object size and hardware architecture. The use can still control this feature by defining EIGEN_UNALIGNED_VECTORIZE=0 On recent x86, we could even go further by replacing aligned load/store intrinsics by unaligned ones for SSE for which their is no overhead anymore. This why, people who does not enable AVX would not have to care about alignment issues at all...
I applied an updated version on the previous patch: https://bitbucket.org/eigen/eigen/commits/6c2dc56e73b3/ Date: 2016-05-24 19:54:03+00:00 Summary: Bug 256: enable vectorization with unaligned loads/stores. This concerns all architectures and all sizes. This new behavior can be disabled by defining EIGEN_UNALIGNED_VECTORIZE=0
I agree that it was time to enable vectorization in these cases. We should of course document that new macro. Furthermore, we may consider partial vectorization of (constant size) expressions which are not multiples of packet sizes.
This is already partly the case, the current condition is: int(InnerMaxSize)>=3*InnerPacketSize But this is for the "SliceVectorizedTraversal" path, which performs a runtime re-alignment for each column. When unaligned stores has zero-overhead, we could rather complete the unrolling paths with half-packet and scalar iterations for the remaining coeffs...
For the record, here are some relative timings for unaligned loads/stores on Haswell (relative to aligned load/store): Load: type : offset : u-SSE ; u-AVX Ref: 0.000839483 0.00049734 f : 0 : 99 ; 100 f : 1 : 100 ; 101 f : 2 : 100 ; 101 f : 3 : 100 ; 101 f : 4 : 99 ; 101 f : 5 : 100 ; 101 f : 6 : 100 ; 101 f : 7 : 106 ; 101 Ref: 0.00162389 0.000943376 d : 0 : 99 ; 99 d : 1 : 109 ; 120 d : 2 : 100 ; 120 d : 3 : 109 ; 120 Store: type : offset : u-SSE ; u-AVX : s-AVX Ref: 0.000633079 0.000427755 f : 0 : 100 ; 86 : 663 f : 1 : 108 ; 126 : 663 f : 2 : 108 ; 126 : 663 f : 3 : 108 ; 126 : 663 f : 4 : 99 ; 126 : 663 f : 5 : 108 ; 126 : 663 f : 6 : 108 ; 126 : 663 f : 7 : 108 ; 126 : 663 Ref: 0.00126168 0.000633104 d : 0 : 103 ; 100 : 649 d : 1 : 137 ; 168 : 649 d : 2 : 99 ; 168 : 649 d : 3 : 137 ; 168 : 649 "s-AVX" is the "streaming" store instruction. As we can see, overhead is marginal, and the most interesting thing is that if the input appears to be aligned (offset=0), then calling the unaligned instruction has no overhead at all.
(In reply to Gael Guennebaud from comment #11) > I applied an updated version on the previous patch: > > https://bitbucket.org/eigen/eigen/commits/6c2dc56e73b3/ That commit seems to break things with clang 3.9, for some specific cases: https://stackoverflow.com/questions/45469057/ Quite likely, this is a bug in clang 3.9 (it works fine with 3.8 and 4.0), however, we could by default disable EIGEN_UNALIGNED_VECTORIZE on clang 3.9
I'm not sure, it seems to me that same clang bug could show up anytime even with EIGEN_UNALIGNED_VECTORIZE off. The SO example is very specific, and changing only one bit of this example is enough to switch off the bug, like: cwiseAbs -> cwiseAbs2 norm -> sum (1,2) -> (0,2) etc.
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/256.