|Summary:||vectorization_logic fails: Matrix3().cwiseQuotient(Matrix3()) expected CompleteUnrolling, got NoUnrolling|
|Product:||Eigen||Reporter:||Benoit Jacob <jacob.benoit.1>|
|Component:||Core - expression templates||Assignee:||Benoit Steiner <benoit.steiner.goog>|
|Severity:||Failed Unit Test||CC:||benoit.steiner.goog, chtz, gael.guennebaud, jacob.benoit.1, neil.gatenby, netjoke.r.45|
|Version:||3.3 (current stable)|
|Bug Depends on:|
Description Benoit Jacob 2016-04-08 12:43:04 UTC
The vectorization_logic fails for me (Ubuntu 14.04, x86-64, GCC 4.8): Initializing random number generator with seed 1460119105 Repeating each test 10 times Src: | Linear Dst: | Lvalue | Direct | NestByRef | Packet | Linear | Direct DstXpr: N5Eigen6MatrixIiLi6ELi2ELi0ELi6ELi2EEE SrcXpr: N5Eigen13CwiseBinaryOpINS_8internal18scalar_quotient_opIiiEEKNS_6MatrixIiLi6ELi2ELi0ELi6ELi2EEES6_EE DstFlags = 58 ( | Packet | Linear | Direct ) SrcFlags = 10 ( | Linear ) DstAlignment = 16 SrcAlignment = 16 RequiredAlignment = 16 JointAlignment = 16 InnerSize = 6 InnerMaxSize = 6 PacketSize = 4 StorageOrdersAgree = 1 MightVectorize = 0 MayLinearize = 1 MayInnerVectorize = 0 MayLinearVectorize = 0 MaySliceVectorize = 0 Traversal = 1 (LinearTraversal) UnrollingLimit = 100 MayUnrollCompletely = 0 MayUnrollInner = 1 Unrolling = 0 (NoUnrolling) Expected Traversal == LinearTraversal got LinearTraversal Expected Unrolling == CompleteUnrolling got NoUnrolling Test vectorization_logic<int>::run() failed in /usr/local/google/home/benoitjacob/eigen/test/vectorization_logic.cpp (163) test_assign(Matrix3(),Matrix3().cwiseQuotient(Matrix3()), PacketTraits::HasDiv ? LinearVectorizedTraversal : LinearTraversal,CompleteUnrolling) Stack: - vectorization_logic<int>::run() - vectorization_logic - Seed: 1460119105
Comment 1 Christoph Hertzberg 2016-04-11 15:56:06 UTC
This appears to be the commit which causes the fail: https://bitbucket.org/eigen/eigen/commits/1254b8c4 Apparently the cost for division was increased so high that it is not worth to unroll anymore here. I don't know if we should increase the UnrollingLimit, or relax the test case somehow (or reduce the estimated cost of divisions?).
Comment 2 Christoph Hertzberg 2016-04-11 15:59:44 UTC
(In reply to Christoph Hertzberg from comment #1) > https://bitbucket.org/eigen/eigen/commits/1254b8c4 That's actually the last working, the next one starts failing: https://bitbucket.org/eigen/eigen/commits/b4ec7935eeaa
Comment 3 Benoit Steiner 2016-04-11 19:39:21 UTC
The cost of division was underestimated so much than it was unusable for any kind of tuning, which is why we came up with a more accurate estimate. I can update the test to use the cwiseProduct instead of cwiseQuotient to fix the issue.
Comment 4 Gael Guennebaud 2016-04-11 20:15:47 UTC
Benoit Steiner: in the Tensor module, are you using the cost estimate for other purposes than inlining/sub-expression evals? For such use cases, I don't really see the point in having such "accurate" cost estimate...
Comment 5 Benoit Steiner 2016-04-11 20:28:39 UTC
To distribute the computation over the K threads of a thread pool, we simply shard it evenly over the K threads. This works well for very large and expensive computations. However for smaller computations, the overhead of a context switch is so high that it's much more efficient to either do the computation inline in the main thread or shard it over a small subset of the available threads. This strategy has led to performance improvements of 2x to 100x (2 to 10x improvement being the common case, and 100x improvements corresponding to pathological cases). However for this strategy to work reliably we need to be able to assess the cost of a computation with some degree of accuracy, and being more than 10x off like we were for the cost estimate of a division was too much. We haven't upstreamed this code yet, but expect to do it in the next few days or week. We're also going to upstream improvements to the cost estimates of functions such as exp, log, sqrt, and so on.
Comment 6 Gael Guennebaud 2016-04-13 13:26:24 UTC
Indeed, for work balancing you definitely need more accuracy. I see you updated the unit test, so good for me, thanks.
Comment 7 Gael Guennebaud 2016-08-23 13:29:05 UTC
I'm reopening because I still think that this NumTraits::Div cost estimates have to be cleaned up: 1 - the current logic is limited to int+float+double on Haswell 2 - the architecture specific logic must be handled in arch/* 3 - adjusting the cost wrt to the vectorization status is not sound because when NumTraits<>::Div<vectorized> is called we don't know yet whether the expression is really going to be vectorized (and which packet size are going to be used...). 4 - the current values do not match the ones given there: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,AVX&text=div&expand=2047,2050,2059,2056,2080,2077,2050,2047,2059 Regarding point #4, for Haswell we can summarize the throughputs as follows: scalar 128 256 float 5 5 13 double 12 12 25 Then, in order to address points #1 and #2, we could remove NumTraits::Div and rather implement an internal helper structure (e.g., div_cost) implementing a generic estimate from NumTraits::MulCost. This struct could then be specialized in arch/*. Still remains points #3 that would require to completely revise the cost model so that one can query the cost for arbitrary packet size (including scalar). However, this seems to be overkill to me.
Comment 8 Gael Guennebaud 2016-08-23 13:50:59 UTC
For the record, on skylake, divisions have been significantly optimized: scalar 128 256 float 3 3 5 double 4 4 8 though there is still a penalty when moving to AVX. Here some numbers for Atom (~10x slower): scalar 128 float 30 60 double 60 120
Comment 9 Gael Guennebaud 2016-09-09 13:10:41 UTC
https://bitbucket.org/eigen/eigen/commits/90bfa4423d31/ Summary: Bug 1195: move NumTraits::Div<>::Cost to internal::scalar_div_cost (with some specializations in arch/SSE and arch/AVX)
Comment 10 Neil Gatenby 2016-09-09 14:39:12 UTC
Hi With this fix in place I'm seeing an awful lot of ... <snip>/util/XprHelper.h(676): error C2027: use of undefined type 'Eigen::internal::enable_if<false,void>' on Windows Any Help would be much appreciated Neil
Comment 11 Gael Guennebaud 2016-09-09 20:36:06 UTC
Here is the fix: https://bitbucket.org/eigen/eigen/commits/fbc536aef564/ Summary: Fix compilation on 32 bits systems.
Comment 12 Neil Gatenby 2016-09-22 07:48:08 UTC