1195
2016-04-08 12:43:04 +0000
vectorization_logic fails: Matrix3().cwiseQuotient(Matrix3()) expected CompleteUnrolling, got NoUnrolling
2019-01-16 15:47:43 +0000
1
1
1
Unclassified
Eigen
Core - expression templates
3.3 (current stable)
All
All
RESOLVED
FIXED
Normal
Failed Unit Test
---
814
1
jacob.benoit.1
benoit.steiner.goog
benoit.steiner.goog
chtz
gael.guennebaud
jacob.benoit.1
neil.gatenby
netjoke.r.45
oldest_to_newest
5634
0
jacob.benoit.1
2016-04-08 12:43:04 +0000
The vectorization_logic fails for me (Ubuntu 14.04, x86-64, GCC 4.8):
Initializing random number generator with seed 1460119105
Repeating each test 10 times
Src:
| Linear
Dst: | Lvalue | Direct | NestByRef
| Packet | Linear | Direct
DstXpr: N5Eigen6MatrixIiLi6ELi2ELi0ELi6ELi2EEE
SrcXpr: N5Eigen13CwiseBinaryOpINS_8internal18scalar_quotient_opIiiEEKNS_6MatrixIiLi6ELi2ELi0ELi6ELi2EEES6_EE
DstFlags = 58 ( | Packet | Linear | Direct )
SrcFlags = 10 ( | Linear )
DstAlignment = 16
SrcAlignment = 16
RequiredAlignment = 16
JointAlignment = 16
InnerSize = 6
InnerMaxSize = 6
PacketSize = 4
StorageOrdersAgree = 1
MightVectorize = 0
MayLinearize = 1
MayInnerVectorize = 0
MayLinearVectorize = 0
MaySliceVectorize = 0
Traversal = 1 (LinearTraversal)
UnrollingLimit = 100
MayUnrollCompletely = 0
MayUnrollInner = 1
Unrolling = 0 (NoUnrolling)
Expected Traversal == LinearTraversal got LinearTraversal
Expected Unrolling == CompleteUnrolling got NoUnrolling
Test vectorization_logic<int>::run() failed in /usr/local/google/home/benoitjacob/eigen/test/vectorization_logic.cpp (163)
test_assign(Matrix3(),Matrix3().cwiseQuotient(Matrix3()), PacketTraits::HasDiv ? LinearVectorizedTraversal : LinearTraversal,CompleteUnrolling)
Stack:
- vectorization_logic<int>::run()
- vectorization_logic
- Seed: 1460119105
5653
1
chtz
2016-04-11 15:56:06 +0000
This appears to be the commit which causes the fail:
https://bitbucket.org/eigen/eigen/commits/1254b8c4
Apparently the cost for division was increased so high that it is not worth to unroll anymore here.
I don't know if we should increase the UnrollingLimit, or relax the test case somehow (or reduce the estimated cost of divisions?).
5654
2
chtz
2016-04-11 15:59:44 +0000
(In reply to Christoph Hertzberg from comment #1)
> https://bitbucket.org/eigen/eigen/commits/1254b8c4
That's actually the last working, the next one starts failing:
https://bitbucket.org/eigen/eigen/commits/b4ec7935eeaa
5655
3
benoit.steiner.goog
2016-04-11 19:39:21 +0000
The cost of division was underestimated so much than it was unusable for any kind of tuning, which is why we came up with a more accurate estimate. I can update the test to use the cwiseProduct instead of cwiseQuotient to fix the issue.
5656
4
gael.guennebaud
2016-04-11 20:15:47 +0000
Benoit Steiner: in the Tensor module, are you using the cost estimate for other purposes than inlining/sub-expression evals? For such use cases, I don't really see the point in having such "accurate" cost estimate...
5657
5
benoit.steiner.goog
2016-04-11 20:28:39 +0000
To distribute the computation over the K threads of a thread pool, we simply shard it evenly over the K threads. This works well for very large and expensive computations. However for smaller computations, the overhead of a context switch is so high that it's much more efficient to either do the computation inline in the main thread or shard it over a small subset of the available threads.
This strategy has led to performance improvements of 2x to 100x (2 to 10x improvement being the common case, and 100x improvements corresponding to pathological cases). However for this strategy to work reliably we need to be able to assess the cost of a computation with some degree of accuracy, and being more than 10x off like we were for the cost estimate of a division was too much.
We haven't upstreamed this code yet, but expect to do it in the next few days or week. We're also going to upstream improvements to the cost estimates of functions such as exp, log, sqrt, and so on.
5664
6
gael.guennebaud
2016-04-13 13:26:24 +0000
Indeed, for work balancing you definitely need more accuracy.
I see you updated the unit test, so good for me, thanks.
6094
7
gael.guennebaud
2016-08-23 13:29:05 +0000
I'm reopening because I still think that this NumTraits::Div cost estimates have to be cleaned up:
1 - the current logic is limited to int+float+double on Haswell
2 - the architecture specific logic must be handled in arch/*
3 - adjusting the cost wrt to the vectorization status is not sound because when NumTraits<>::Div<vectorized> is called we don't know yet whether the expression is really going to be vectorized (and which packet size are going to be used...).
4 - the current values do not match the ones given there: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,AVX&text=div&expand=2047,2050,2059,2056,2080,2077,2050,2047,2059
Regarding point #4, for Haswell we can summarize the throughputs as follows:
scalar 128 256
float 5 5 13
double 12 12 25
Then, in order to address points #1 and #2, we could remove NumTraits::Div and rather implement an internal helper structure (e.g., div_cost) implementing a generic estimate from NumTraits::MulCost. This struct could then be specialized in arch/*.
Still remains points #3 that would require to completely revise the cost model so that one can query the cost for arbitrary packet size (including scalar). However, this seems to be overkill to me.
6095
8
gael.guennebaud
2016-08-23 13:50:59 +0000
For the record, on skylake, divisions have been significantly optimized:
scalar 128 256
float 3 3 5
double 4 4 8
though there is still a penalty when moving to AVX.
Here some numbers for Atom (~10x slower):
scalar 128
float 30 60
double 60 120
6165
9
gael.guennebaud
2016-09-09 13:10:41 +0000
https://bitbucket.org/eigen/eigen/commits/90bfa4423d31/
Summary: Bug 1195: move NumTraits::Div<>::Cost to internal::scalar_div_cost (with some specializations in arch/SSE and arch/AVX)
6166
10
neil.gatenby
2016-09-09 14:39:12 +0000
Hi
With this fix in place I'm seeing an awful lot of ...
<snip>/util/XprHelper.h(676): error C2027: use of undefined type 'Eigen::internal::enable_if<false,void>'
on Windows
Any Help would be much appreciated
Neil
6169
11
gael.guennebaud
2016-09-09 20:36:06 +0000
Here is the fix:
https://bitbucket.org/eigen/eigen/commits/fbc536aef564/
Summary: Fix compilation on 32 bits systems.
6236
12
neil.gatenby
2016-09-22 07:48:08 +0000
Much appreciated