New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1195 - vectorization_logic fails: Matrix3().cwiseQuotient(Matrix3()) expected CompleteUnrolling, got NoUnrolling
Summary: vectorization_logic fails: Matrix3().cwiseQuotient(Matrix3()) expected Comple...
Status: RESOLVED FIXED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - expression templates (show other bugs)
Version: 3.3 (current stable)
Hardware: All All
: Normal Failed Unit Test
Assignee: Benoit Steiner
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 3.4
  Show dependency treegraph
 
Reported: 2016-04-08 12:43 UTC by Benoit Jacob
Modified: 2019-01-16 15:47 UTC (History)
6 users (show)



Attachments

Description Benoit Jacob 2016-04-08 12:43:04 UTC
The vectorization_logic fails for me (Ubuntu 14.04, x86-64, GCC 4.8):

Initializing random number generator with seed 1460119105
Repeating each test 10 times
Src: 
      | Linear
Dst:  | Lvalue | Direct | NestByRef
      | Packet | Linear | Direct
DstXpr: N5Eigen6MatrixIiLi6ELi2ELi0ELi6ELi2EEE
SrcXpr: N5Eigen13CwiseBinaryOpINS_8internal18scalar_quotient_opIiiEEKNS_6MatrixIiLi6ELi2ELi0ELi6ELi2EEES6_EE
DstFlags = 58 ( | Packet | Linear | Direct )
SrcFlags = 10 ( | Linear )
DstAlignment = 16
SrcAlignment = 16
RequiredAlignment = 16
JointAlignment = 16
InnerSize = 6
InnerMaxSize = 6
PacketSize = 4
StorageOrdersAgree = 1
MightVectorize = 0
MayLinearize = 1
MayInnerVectorize = 0
MayLinearVectorize = 0
MaySliceVectorize = 0
Traversal = 1 (LinearTraversal)
UnrollingLimit = 100
MayUnrollCompletely = 0
MayUnrollInner = 1
Unrolling = 0 (NoUnrolling)

 Expected Traversal == LinearTraversal got LinearTraversal
 Expected Unrolling == CompleteUnrolling got NoUnrolling
Test vectorization_logic<int>::run() failed in /usr/local/google/home/benoitjacob/eigen/test/vectorization_logic.cpp (163)
    test_assign(Matrix3(),Matrix3().cwiseQuotient(Matrix3()), PacketTraits::HasDiv ? LinearVectorizedTraversal : LinearTraversal,CompleteUnrolling)
Stack:
  - vectorization_logic<int>::run()
  - vectorization_logic
  - Seed: 1460119105
Comment 1 Christoph Hertzberg 2016-04-11 15:56:06 UTC
This appears to be the commit which causes the fail:
https://bitbucket.org/eigen/eigen/commits/1254b8c4

Apparently the cost for division was increased so high that it is not worth to unroll anymore here.
I don't know if we should increase the UnrollingLimit, or relax the test case somehow (or reduce the estimated cost of divisions?).
Comment 2 Christoph Hertzberg 2016-04-11 15:59:44 UTC
(In reply to Christoph Hertzberg from comment #1)
> https://bitbucket.org/eigen/eigen/commits/1254b8c4
That's actually the last working, the next one starts failing:

https://bitbucket.org/eigen/eigen/commits/b4ec7935eeaa
Comment 3 Benoit Steiner 2016-04-11 19:39:21 UTC
The cost of division was underestimated so much than it was unusable for any kind of tuning, which is why we came up with a more accurate estimate. I can update the test to use the cwiseProduct instead of cwiseQuotient to fix the issue.
Comment 4 Gael Guennebaud 2016-04-11 20:15:47 UTC
Benoit Steiner: in the Tensor module, are you using the cost estimate for other purposes than inlining/sub-expression evals? For such use cases, I don't really see the point in having such "accurate" cost estimate...
Comment 5 Benoit Steiner 2016-04-11 20:28:39 UTC
To distribute the computation over the K threads of a thread pool, we simply shard it evenly over the K threads. This works well for very large and expensive computations. However for smaller computations, the overhead of a context switch is so high that it's much more efficient to either do the computation inline in the main thread or shard it over a small subset of the available threads.

This strategy has led to performance improvements of 2x to 100x (2 to 10x improvement being the common case, and 100x improvements corresponding to pathological cases). However for this strategy to work reliably we need to be able to assess the cost of a computation with some degree of accuracy, and being more than 10x off like we were for the cost estimate of a division was too much.

We haven't upstreamed this code yet, but expect to do it in the next few days or week. We're also going to upstream improvements to the cost estimates of functions such as exp, log, sqrt, and so on.
Comment 6 Gael Guennebaud 2016-04-13 13:26:24 UTC
Indeed, for work balancing you definitely need more accuracy.

I see you updated the unit test, so good for me, thanks.
Comment 7 Gael Guennebaud 2016-08-23 13:29:05 UTC
I'm reopening because I still think that this NumTraits::Div cost estimates have to be cleaned up:

1 - the current logic is limited to int+float+double on Haswell
2 - the architecture specific logic must be handled in arch/*
3 - adjusting the cost wrt to the vectorization status is not sound because when NumTraits<>::Div<vectorized> is called we don't know yet whether the expression is really going to be vectorized (and which packet size are going to be used...).
4 - the current values do not match the ones given there: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,AVX&text=div&expand=2047,2050,2059,2056,2080,2077,2050,2047,2059

Regarding point #4, for Haswell we can summarize the throughputs as follows:

         scalar  128  256
float         5    5   13
double       12   12   25

Then, in order to address points #1 and #2, we could remove NumTraits::Div and rather implement an internal helper structure (e.g., div_cost) implementing a generic estimate from NumTraits::MulCost. This struct could then be specialized in arch/*.

Still remains points #3 that would require to completely revise the cost model so that one can query the cost for arbitrary packet size (including scalar). However, this seems to be overkill to me.
Comment 8 Gael Guennebaud 2016-08-23 13:50:59 UTC
For the record, on skylake, divisions have been significantly optimized:

         scalar  128  256
float         3    3    5
double        4    4    8

though there is still a penalty when moving to AVX.

Here some numbers for Atom (~10x slower):

         scalar  128
float        30   60
double       60  120
Comment 9 Gael Guennebaud 2016-09-09 13:10:41 UTC
https://bitbucket.org/eigen/eigen/commits/90bfa4423d31/
Summary:     Bug 1195: move NumTraits::Div<>::Cost to internal::scalar_div_cost (with some specializations in arch/SSE and arch/AVX)
Comment 10 Neil Gatenby 2016-09-09 14:39:12 UTC
Hi

With this fix in place I'm seeing an awful lot of ...

  <snip>/util/XprHelper.h(676): error C2027: use of undefined type 'Eigen::internal::enable_if<false,void>'

on Windows

Any Help would be much appreciated

Neil
Comment 11 Gael Guennebaud 2016-09-09 20:36:06 UTC
Here is the fix:
https://bitbucket.org/eigen/eigen/commits/fbc536aef564/
Summary:     Fix compilation on 32 bits systems.
Comment 12 Neil Gatenby 2016-09-22 07:48:08 UTC
Much appreciated

Note You need to log in before you can comment on or make changes to this bug.