Created attachment 749 [details] Benchmark for the performance regression In the attached benchmark, I observe a fairly significant performance difference (~48% slower with Eigen 3.3) between the 3.2 and 3.3 branch of Eigen. This is on an Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz using g++ 4.9.2 with the following compilation flags: > g++ -std=c++11 -Wno-deprecated -Ofast -DNDEBUG -fno-finite-math-only -I eigen eigen_bench.cpp A large part of the performance difference goes away if I disabled unaligned vectorization (-DEIGEN_UNALIGNED_VECTORIZE=0), but a small performance difference of ~9% remains. This difference goes away when I completely disable vectorization (-DEIGEN_DONT_VECTORIZE). The run-times (multiple runs, fastest one taken) I see are as follows: Eigen 3.2.10: 0.91s Eigen 3.2.10: 0.91s (EIGEN_DONT_VECTORIZE) Eigen 3.3.0 : 1.35s Eigen 3.3.0 : 0.99s (EIGEN_UNALIGNED_VECTORIZE=0) Eigen 3.3.0 : 0.91s (EIGEN_DONT_VECTORIZE)
My unfounded gut feeling is that maybe the sub-vector accesses (i.e. head<> and segment<>) may be the root cause, but I don't know for sure. Almost everything else is individual scalar accesses.
Created attachment 750 [details] updated benchmark the original benchmark got optimized away in the eigen-3.2.10 case slight modification
Eigen 3.3 results in code that seems about 4-5 times slower. At least it contains about twice as many instructions. The assembly output for the inner loop of the updated benchmark looks like # Eigen 3.2.10, 0.21s .L363: pxor %xmm0, %xmm0 cvtsi2sd %edx, %xmm0 movapd %xmm0, %xmm1 movapd %xmm0, %xmm11 mulsd %xmm0, %xmm1 addsd %xmm0, %xmm6 addsd %xmm0, %xmm2 movapd %xmm1, %xmm10 movsd %xmm6, 16(%rsp) movsd %xmm2, 48(%rsp) addsd %xmm1, %xmm10 addsd %xmm10, %xmm1 mulsd %xmm9, %xmm1 divsd %xmm0, %xmm1 subsd %xmm1, %xmm11 movapd %xmm11, %xmm1 mulsd %xmm8, %xmm1 movapd %xmm1, %xmm10 divsd %xmm0, %xmm1 addsd %xmm0, %xmm10 addsd %xmm10, %xmm5 movsd %xmm5, 24(%rsp) addsd %xmm0, %xmm1 addsd %xmm1, %xmm4 movapd %xmm0, %xmm1 addsd %xmm7, %xmm1 movsd %xmm4, 32(%rsp) addsd %xmm1, %xmm3 movsd %xmm3, 40(%rsp) addl $1, %edx cmpl %edx, %ebx jg .L363 versus # Eigen 3.3.0, 0.95s .L363: pxor %xmm0, %xmm0 addl $1, %edx leaq 64(%rsp), %rcx cvtsi2sd %edx, %xmm0 movapd %xmm0, %xmm1 movsd %xmm0, 80(%rsp) unpcklpd %xmm1, %xmm1 movups %xmm1, 48(%rsp) movups %xmm1, 64(%rsp) movupd 48(%rsp), %xmm1 movups %xmm1, (%rbx) movupd (%rcx), %xmm1 movups %xmm1, 16(%rbx) movsd %xmm0, 128(%rsp) movsd 120(%rsp), %xmm3 movupd 8(%rbx), %xmm1 mulpd %xmm1, %xmm1 movsd 96(%rsp), %xmm7 movapd %xmm1, %xmm2 unpckhpd %xmm1, %xmm2 addsd %xmm2, %xmm1 movapd %xmm3, %xmm2 mulsd %xmm3, %xmm2 addsd %xmm2, %xmm1 mulsd %xmm6, %xmm1 divsd %xmm7, %xmm1 subsd %xmm1, %xmm0 mulsd %xmm5, %xmm0 movsd %xmm0, 136(%rsp) divsd %xmm7, %xmm0 movsd .LC12(%rip), %xmm7 movsd %xmm7, 152(%rsp) movsd %xmm0, 144(%rsp) movupd 8(%rbx), %xmm1 movupd 40(%rbx), %xmm0 addpd %xmm1, %xmm0 movups %xmm0, 8(%rbx) addsd 152(%rsp), %xmm3 movsd %xmm3, 120(%rsp) movupd (%rbx), %xmm0 movupd 0(%rbp), %xmm1 addpd %xmm1, %xmm0 movupd 16(%rbp), %xmm1 movups %xmm0, 0(%rbp) movupd 16(%rbx), %xmm0 addpd %xmm1, %xmm0 movups %xmm0, 16(%rbp) addsd 128(%rsp), %xmm4 movsd %xmm4, 32(%rsp) cmpl %r13d, %edx jne .L363
hm, if I isolate the Augment function, I get: Eigen 3.2 Eigen 3.3 movsd 24(%rdi), %xmm0 movupd 8(%rdi), %xmm0 movsd 16(%rdi), %xmm2 movsd 24(%rdi), %xmm1 mulsd %xmm0, %xmm0 mulpd %xmm0, %xmm0 movsd 8(%rdi), %xmm1 movsd 32(%rdi), %xmm2 movsd LC2(%rip), %xmm3 movsd LC2(%rip), %xmm3 mulsd %xmm2, %xmm2 mulsd %xmm1, %xmm1 mulsd %xmm1, %xmm1 haddpd %xmm0, %xmm0 addsd %xmm2, %xmm0 addsd %xmm1, %xmm0 addsd %xmm1, %xmm0 movsd 32(%rdi), %xmm1 mulsd LC0(%rip), %xmm0 mulsd LC0(%rip), %xmm0 divsd (%rdi), %xmm0 divsd (%rdi), %xmm0 subsd %xmm0, %xmm1 subsd %xmm0, %xmm2 movsd LC1(%rip), %xmm0 movsd LC1(%rip), %xmm0 mulsd %xmm1, %xmm0 mulsd %xmm2, %xmm0 movsd %xmm0, 40(%rsi) movsd %xmm0, 40(%rsi) divsd (%rdi), %xmm0 divsd (%rdi), %xmm0 movsd %xmm3, 56(%rsi) movsd %xmm3, 56(%rsi) movsd %xmm0, 48(%rsi) movsd %xmm0, 48(%rsi) ret So the 3.3 version looks better for that part. On the other hand, the code around the call to Augment is less pretty: Eigen 3.2 Eigen 3.3 L138: L138: addl $1, %eax addl $1, %edx pxor %xmm0, %xmm0 pxor %xmm0, %xmm0 leaq 16(%rax), %rcx cvtsi2sd %eax, %xmm0 cvtsi2sd %edx, %xmm0 movq %rbx, %rsi movq %rbx, %rsi movq %rbx, %rdi movq %rbx, %rdi movsd %xmm0, 48(%rsp) movddup %xmm0, %xmm1 movsd %xmm0, 56(%rsp) movups %xmm1, (%rax) movsd %xmm0, 64(%rsp) movups %xmm1, 16(%rax) movsd %xmm0, 72(%rsp) movsd %xmm0, 80(%rsp) movsd %xmm0, 80(%rsp) movupd (%rax), %xmm1 movups %xmm1, (%rbx) movupd (%rcx), %xmm1 movups %xmm1, 16(%rbx) movsd %xmm0, 128(%rsp) call __Z7Augment call __Z7Augment movsd 88(%rsp), %xmm0 movupd 8(%rbx), %xmm1 movsd 56(%rsp), %xmm2 movupd 40(%rbx), %xmm0 movsd 64(%rsp), %xmm1 addpd %xmm1, %xmm0 addsd %xmm0, %xmm2 movups %xmm0, 8(%rbx) movsd 96(%rsp), %xmm0 movsd 120(%rsp), %xmm0 movsd (%rsp), %xmm3 addsd 152(%rsp), %xmm0 addsd %xmm0, %xmm1 movsd %xmm0, 120(%rsp) movsd 104(%rsp), %xmm0 movupd (%rbx), %xmm0 addsd 72(%rsp), %xmm0 movupd 0(%rbp), %xmm1 addsd 24(%rsp), %xmm0 addpd %xmm1, %xmm0 addsd 48(%rsp), %xmm3 movupd 16(%rbp), %xmm1 addsd 8(%rsp), %xmm2 movups %xmm0, 0(%rbp) addsd 16(%rsp), %xmm1 movupd 16(%rbx), %xmm0 movsd %xmm0, 24(%rsp) addpd %xmm1, %xmm0 movsd 32(%rsp), %xmm0 movsd %xmm3, (%rsp) movsd 32(%rsp), %xmm1 addsd 80(%rsp), %xmm0 addsd 128(%rsp), %xmm1 movsd %xmm2, 8(%rsp) movups %xmm0, 16(%rbp) movsd %xmm1, 16(%rsp) movsd %xmm0, 32(%rsp) movsd %xmm1, 32(%rsp) cmpl %ebp, %eax cmpl %r12d, %edx jl L138 jl L138 We can some redundant copies before the call that could be avoided by filling augState.head<5>() directly instead of using the intermediate state variable. After the call we can also see some extra memory copies that are due to the fact that the packets used to compute: augState.segment<3>(1) += augState.tail<3>(); are not the same than the one to compute: sum += augState.head<5>(); Nonetheless, it's hard to believe that those few copies are so costly.
(In reply to Gael Guennebaud from comment #4) > Eigen 3.3 > ... > mulsd %xmm1, %xmm1 > haddpd %xmm0, %xmm0 > addsd %xmm1, %xmm0 > ... > So the 3.3 version looks better for that part. Sorry, that claim was wrong, haddpd has a huge latency.
> We can some redundant copies before the call that could be avoided by filling > augState.head<5>() directly instead of using the intermediate state variable. That would work for this reduced example (where quite a few parts were excised & simplified), but in the real application we have many states, then load some of them, augment them (which leads to an augState whose first N elements are a copy of state), do computations on them, and produce a shape-sized update. > haddpd has a huge latency. From what I can gather, it still shouldn't account for this much of a speed difference (factor between 4-5), unless I'm measuring badly / something else?
(In reply to Daniel Vollmer from comment #6) > > We can some redundant copies before the call that could be avoided by filling > > augState.head<5>() directly instead of using the intermediate state variable. > > That would work for this reduced example (where quite a few parts were > excised & > simplified), but in the real application we have many states, then load some > of them, > augment them (which leads to an augState whose first N elements are a copy > of state), > do computations on them, and produce a shape-sized update. sure, but this introduces a bias in the benchmark because some compiler optimizations are possible in one case but not in the other.
All I'm hoping for is there is no performance regression when moving from Eigen 3.2 to Eigen 3.3 (i.e. that vectorization is only attempted when it is actually profitable), as Eigen 3.2 doesn't attempt to vectorise these cases (and I'd prefer not to have to turn if off altogether, as there may be areas / expressions where it helps). :)
Yes of course it speeds up a lot of code, and so far we did not observe such slowdown. In the devel branch, I've removed the use of 'hadd' so the slowdown should be reduced. Nevertheless, there is something extremely strange in your benchmark because if I change the call to: rhoV.squaredNorm() by (rhoV(0)*rhoV(0)+rhoV(1)*rhoV(1)+rhoV(2)*rhoV(2)) then I get same performance as 3.2. Between these two versions, the ASM is marginally different: hand-written rhoV.squaredNorm() movsd 8(%rdi), %xmm0 movupd 8(%rdi), %xmm movsd 16(%rdi), %xmm1 mulpd %xmm0, %xmm0 mulsd %xmm0, %xmm0 movapd %xmm0, %xmm1 mulsd %xmm1, %xmm1 shufpd $1, %xmm1, %x movsd 24(%rdi), %xmm2 addsd %xmm0, %xmm1 mulsd %xmm2, %xmm2 movsd 24(%rdi), %xmm0 addsd %xmm1, %xmm2 mulsd %xmm0, %xmm0 and cannot explain a x3 speed difference. Moreover, if I write another benchmark comparing the following two functions: void foo1(const Scalar* data, Scalar* out, int n) { int j = 0; for(int i=0; i<n/5; i+=5) out[j++] = data[i+4] - 0.5 * Map<const Matrix<Scalar,3,1> >.squaredNorm() / data[i+3]; } void foo2(const Scalar* data, Scalar* out, int n) { int j = 0; for(int i=0; i<n/5; i+=5) out[j++] = data[i+4] - 0.5 * ((data[i]*data[i]+data[i+1]*data[i+1])+(data[i+2]*data[i+2]))/data[i+3]; } then I observe the same ASM differences, but they are equally fast. So that's a mystery to me. BTW, as you can see in the above ASM, in the default branch I've removed the calls to the slow haddpd intrinsics.
My compiler default settings (no explicit -m<anything>) resulted in SSE2, so I never saw the haddpd. When I explicitly enable it, I see it with Eigen 3.3, but not with your changes to the devel-branch (as expected). But, I do not observe any performance difference between the two. I really am fairly clueless where the large speed difference comes from (between 3.2 3.3). Do you see this difference as well (factor 4-5)?
Created attachment 758 [details] even further simplified assignment benchmark The root cause seems to be the decision in AssignEvaluator.h to allow LinearVectorization and SliceVectorization for very small sizes. For Eigen 3.2.10, the (non-vectorized) loop looks like this: .L363: pxor %xmm1, %xmm1 movl %edx, %ecx pxor %xmm0, %xmm0 cvtsi2sd %edx, %xmm1 addl $1, %edx cvtsi2sd %edx, %xmm0 cmpl %ebx, %ecx addsd %xmm1, %xmm6 addsd %xmm1, %xmm0 addsd %xmm1, %xmm2 movsd %xmm6, 16(%rsp) addsd %xmm0, %xmm5 addsd %xmm0, %xmm4 movsd %xmm2, 48(%rsp) addsd %xmm0, %xmm3 movsd %xmm5, 24(%rsp) movsd %xmm4, 32(%rsp) movsd %xmm3, 40(%rsp) jne .L363 whereas Eigen 3.3 vectorizes and produces: .L362: pxor %xmm0, %xmm0 movl %edx, %ecx cvtsi2sd %edx, %xmm0 addl $1, %edx cmpl %r12d, %ecx movapd %xmm0, %xmm2 movsd %xmm0, 80(%rsp) unpcklpd %xmm2, %xmm2 movups %xmm2, 48(%rsp) movups %xmm2, 64(%rsp) movupd 48(%rsp), %xmm2 movups %xmm2, (%rbx) movupd 64(%rsp), %xmm2 movups %xmm2, 16(%rbx) movsd %xmm0, 128(%rsp) pxor %xmm0, %xmm0 cvtsi2sd %edx, %xmm0 movapd %xmm0, %xmm2 unpcklpd %xmm2, %xmm2 movups %xmm2, 40(%rbx) movsd %xmm0, 152(%rsp) movupd 8(%rbx), %xmm2 movupd 40(%rbx), %xmm0 addpd %xmm2, %xmm0 movups %xmm0, 8(%rbx) movsd 120(%rsp), %xmm0 addsd 152(%rsp), %xmm0 movsd %xmm0, 120(%rsp) movupd (%rbx), %xmm0 movupd 0(%rbp), %xmm2 addpd %xmm2, %xmm0 movupd 16(%rbp), %xmm2 movups %xmm0, 0(%rbp) movupd 16(%rbx), %xmm0 addpd %xmm2, %xmm0 movups %xmm0, 16(%rbp) addsd 128(%rsp), %xmm1 movsd %xmm1, 32(%rsp) jne .L362 Maybe it is worthwhile to add a check to MayInnerVectorize / MayLinearVectorize / MaySliceVectorize for a minimum size that seems profitable (if size is known)? There already is a similar check in MaySliceVectorize, but that seems a bit optimistic. When I use the "InnerMaxSize >= 3*InnerPacketSize" check for MayInner- and MaySlizeVectorize, my blocks won't get vectorized anymore and I get the exact same code as Eigen 3.2.10.
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1342.