The vectorization_logic fails for me (Ubuntu 14.04, x86-64, GCC 4.8):
Initializing random number generator with seed 1460119105
Repeating each test 10 times
Dst: | Lvalue | Direct | NestByRef
| Packet | Linear | Direct
DstFlags = 58 ( | Packet | Linear | Direct )
SrcFlags = 10 ( | Linear )
DstAlignment = 16
SrcAlignment = 16
RequiredAlignment = 16
JointAlignment = 16
InnerSize = 6
InnerMaxSize = 6
PacketSize = 4
StorageOrdersAgree = 1
MightVectorize = 0
MayLinearize = 1
MayInnerVectorize = 0
MayLinearVectorize = 0
MaySliceVectorize = 0
Traversal = 1 (LinearTraversal)
UnrollingLimit = 100
MayUnrollCompletely = 0
MayUnrollInner = 1
Unrolling = 0 (NoUnrolling)
Expected Traversal == LinearTraversal got LinearTraversal
Expected Unrolling == CompleteUnrolling got NoUnrolling
Test vectorization_logic<int>::run() failed in /usr/local/google/home/benoitjacob/eigen/test/vectorization_logic.cpp (163)
test_assign(Matrix3(),Matrix3().cwiseQuotient(Matrix3()), PacketTraits::HasDiv ? LinearVectorizedTraversal : LinearTraversal,CompleteUnrolling)
- Seed: 1460119105
This appears to be the commit which causes the fail:
Apparently the cost for division was increased so high that it is not worth to unroll anymore here.
I don't know if we should increase the UnrollingLimit, or relax the test case somehow (or reduce the estimated cost of divisions?).
(In reply to Christoph Hertzberg from comment #1)
That's actually the last working, the next one starts failing:
The cost of division was underestimated so much than it was unusable for any kind of tuning, which is why we came up with a more accurate estimate. I can update the test to use the cwiseProduct instead of cwiseQuotient to fix the issue.
Benoit Steiner: in the Tensor module, are you using the cost estimate for other purposes than inlining/sub-expression evals? For such use cases, I don't really see the point in having such "accurate" cost estimate...
To distribute the computation over the K threads of a thread pool, we simply shard it evenly over the K threads. This works well for very large and expensive computations. However for smaller computations, the overhead of a context switch is so high that it's much more efficient to either do the computation inline in the main thread or shard it over a small subset of the available threads.
This strategy has led to performance improvements of 2x to 100x (2 to 10x improvement being the common case, and 100x improvements corresponding to pathological cases). However for this strategy to work reliably we need to be able to assess the cost of a computation with some degree of accuracy, and being more than 10x off like we were for the cost estimate of a division was too much.
We haven't upstreamed this code yet, but expect to do it in the next few days or week. We're also going to upstream improvements to the cost estimates of functions such as exp, log, sqrt, and so on.
Indeed, for work balancing you definitely need more accuracy.
I see you updated the unit test, so good for me, thanks.
I'm reopening because I still think that this NumTraits::Div cost estimates have to be cleaned up:
1 - the current logic is limited to int+float+double on Haswell
2 - the architecture specific logic must be handled in arch/*
3 - adjusting the cost wrt to the vectorization status is not sound because when NumTraits<>::Div<vectorized> is called we don't know yet whether the expression is really going to be vectorized (and which packet size are going to be used...).
4 - the current values do not match the ones given there: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,AVX&text=div&expand=2047,2050,2059,2056,2080,2077,2050,2047,2059
Regarding point #4, for Haswell we can summarize the throughputs as follows:
scalar 128 256
float 5 5 13
double 12 12 25
Then, in order to address points #1 and #2, we could remove NumTraits::Div and rather implement an internal helper structure (e.g., div_cost) implementing a generic estimate from NumTraits::MulCost. This struct could then be specialized in arch/*.
Still remains points #3 that would require to completely revise the cost model so that one can query the cost for arbitrary packet size (including scalar). However, this seems to be overkill to me.
For the record, on skylake, divisions have been significantly optimized:
scalar 128 256
float 3 3 5
double 4 4 8
though there is still a penalty when moving to AVX.
Here some numbers for Atom (~10x slower):
float 30 60
double 60 120
Summary: Bug 1195: move NumTraits::Div<>::Cost to internal::scalar_div_cost (with some specializations in arch/SSE and arch/AVX)
With this fix in place I'm seeing an awful lot of ...
<snip>/util/XprHelper.h(676): error C2027: use of undefined type 'Eigen::internal::enable_if<false,void>'
Any Help would be much appreciated
Here is the fix:
Summary: Fix compilation on 32 bits systems.