New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1342 - Performance regression in Eigen 3.3.0 for sub-vector access
Summary: Performance regression in Eigen 3.3.0 for sub-vector access
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: 3.3 (current stable)
Hardware: x86 - SSE All
: Normal Performance Problem
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-16 12:00 UTC by Daniel Vollmer
Modified: 2016-12-05 16:08 UTC (History)
4 users (show)



Attachments
Benchmark for the performance regression (2.02 KB, text/x-csrc)
2016-11-16 12:00 UTC, Daniel Vollmer
no flags Details
updated benchmark (2.14 KB, text/x-c++src)
2016-11-18 12:21 UTC, Daniel Vollmer
no flags Details
even further simplified assignment benchmark (1.02 KB, text/x-c++src)
2016-12-05 16:08 UTC, Daniel Vollmer
no flags Details

Description Daniel Vollmer 2016-11-16 12:00:53 UTC
Created attachment 749 [details]
Benchmark for the performance regression

In the attached benchmark, I observe a fairly significant performance difference (~48% slower with Eigen 3.3) between the 3.2 and 3.3 branch of Eigen.

This is on an Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz
using g++ 4.9.2 with the following compilation flags:
> g++ -std=c++11 -Wno-deprecated -Ofast -DNDEBUG -fno-finite-math-only -I eigen eigen_bench.cpp

A large part of the performance difference goes away if I disabled unaligned vectorization (-DEIGEN_UNALIGNED_VECTORIZE=0), but a small performance difference of ~9% remains. This difference goes away when I completely disable vectorization (-DEIGEN_DONT_VECTORIZE).

The run-times (multiple runs, fastest one taken) I see are as follows:
Eigen 3.2.10: 0.91s                                
Eigen 3.2.10: 0.91s (EIGEN_DONT_VECTORIZE)
Eigen 3.3.0 : 1.35s
Eigen 3.3.0 : 0.99s (EIGEN_UNALIGNED_VECTORIZE=0)
Eigen 3.3.0 : 0.91s (EIGEN_DONT_VECTORIZE)
Comment 1 Daniel Vollmer 2016-11-16 12:02:14 UTC
My unfounded gut feeling is that maybe the sub-vector accesses (i.e. head<> and segment<>) may be the root cause, but I don't know for sure. Almost everything else is individual scalar accesses.
Comment 2 Daniel Vollmer 2016-11-18 12:21:09 UTC
Created attachment 750 [details]
updated benchmark

the original benchmark got optimized away in the eigen-3.2.10 case

slight modification
Comment 3 Daniel Vollmer 2016-11-18 12:24:06 UTC
Eigen 3.3 results in code that seems about 4-5 times slower. At least it contains about twice as many instructions.

The assembly output for the inner loop of the updated benchmark looks like

# Eigen 3.2.10, 0.21s
.L363:
	pxor	%xmm0, %xmm0
	cvtsi2sd	%edx, %xmm0
	movapd	%xmm0, %xmm1
	movapd	%xmm0, %xmm11
	mulsd	%xmm0, %xmm1
	addsd	%xmm0, %xmm6
	addsd	%xmm0, %xmm2
	movapd	%xmm1, %xmm10
	movsd	%xmm6, 16(%rsp)
	movsd	%xmm2, 48(%rsp)
	addsd	%xmm1, %xmm10
	addsd	%xmm10, %xmm1
	mulsd	%xmm9, %xmm1
	divsd	%xmm0, %xmm1
	subsd	%xmm1, %xmm11
	movapd	%xmm11, %xmm1
	mulsd	%xmm8, %xmm1
	movapd	%xmm1, %xmm10
	divsd	%xmm0, %xmm1
	addsd	%xmm0, %xmm10
	addsd	%xmm10, %xmm5
	movsd	%xmm5, 24(%rsp)
	addsd	%xmm0, %xmm1
	addsd	%xmm1, %xmm4
	movapd	%xmm0, %xmm1
	addsd	%xmm7, %xmm1
	movsd	%xmm4, 32(%rsp)
	addsd	%xmm1, %xmm3
	movsd	%xmm3, 40(%rsp)
	addl	$1, %edx
	cmpl	%edx, %ebx
	jg	.L363

versus

# Eigen 3.3.0, 0.95s

.L363:
	pxor	%xmm0, %xmm0
	addl	$1, %edx
	leaq	64(%rsp), %rcx
	cvtsi2sd	%edx, %xmm0
	movapd	%xmm0, %xmm1
	movsd	%xmm0, 80(%rsp)
	unpcklpd	%xmm1, %xmm1
	movups	%xmm1, 48(%rsp)
	movups	%xmm1, 64(%rsp)
	movupd	48(%rsp), %xmm1
	movups	%xmm1, (%rbx)
	movupd	(%rcx), %xmm1
	movups	%xmm1, 16(%rbx)
	movsd	%xmm0, 128(%rsp)
	movsd	120(%rsp), %xmm3
	movupd	8(%rbx), %xmm1
	mulpd	%xmm1, %xmm1
	movsd	96(%rsp), %xmm7
	movapd	%xmm1, %xmm2
	unpckhpd	%xmm1, %xmm2
	addsd	%xmm2, %xmm1
	movapd	%xmm3, %xmm2
	mulsd	%xmm3, %xmm2
	addsd	%xmm2, %xmm1
	mulsd	%xmm6, %xmm1
	divsd	%xmm7, %xmm1
	subsd	%xmm1, %xmm0
	mulsd	%xmm5, %xmm0
	movsd	%xmm0, 136(%rsp)
	divsd	%xmm7, %xmm0
	movsd	.LC12(%rip), %xmm7
	movsd	%xmm7, 152(%rsp)
	movsd	%xmm0, 144(%rsp)
	movupd	8(%rbx), %xmm1
	movupd	40(%rbx), %xmm0
	addpd	%xmm1, %xmm0
	movups	%xmm0, 8(%rbx)
	addsd	152(%rsp), %xmm3
	movsd	%xmm3, 120(%rsp)
	movupd	(%rbx), %xmm0
	movupd	0(%rbp), %xmm1
	addpd	%xmm1, %xmm0
	movupd	16(%rbp), %xmm1
	movups	%xmm0, 0(%rbp)
	movupd	16(%rbx), %xmm0
	addpd	%xmm1, %xmm0
	movups	%xmm0, 16(%rbp)
	addsd	128(%rsp), %xmm4
	movsd	%xmm4, 32(%rsp)
	cmpl	%r13d, %edx
	jne	.L363
Comment 4 Gael Guennebaud 2016-11-18 17:51:55 UTC
hm, if I isolate the Augment function, I get:


Eigen 3.2                       Eigen 3.3
movsd	24(%rdi), %xmm0      	movupd	8(%rdi), %xmm0
movsd	16(%rdi), %xmm2      	movsd	24(%rdi), %xmm1
mulsd	%xmm0, %xmm0         	mulpd	%xmm0, %xmm0
movsd	8(%rdi), %xmm1       	movsd	32(%rdi), %xmm2
movsd	LC2(%rip), %xmm3     	movsd	LC2(%rip), %xmm3
mulsd	%xmm2, %xmm2         	mulsd	%xmm1, %xmm1
mulsd	%xmm1, %xmm1         	haddpd	%xmm0, %xmm0
addsd	%xmm2, %xmm0         	addsd	%xmm1, %xmm0
addsd	%xmm1, %xmm0         	
movsd	32(%rdi), %xmm1      
mulsd	LC0(%rip), %xmm0        mulsd	LC0(%rip), %xmm0
divsd	(%rdi), %xmm0        	divsd	(%rdi), %xmm0
subsd	%xmm0, %xmm1         	subsd	%xmm0, %xmm2
movsd	LC1(%rip), %xmm0     	movsd	LC1(%rip), %xmm0
mulsd	%xmm1, %xmm0         	mulsd	%xmm2, %xmm0
movsd	%xmm0, 40(%rsi)      	movsd	%xmm0, 40(%rsi)
divsd	(%rdi), %xmm0        	divsd	(%rdi), %xmm0
movsd	%xmm3, 56(%rsi)      	movsd	%xmm3, 56(%rsi)
movsd	%xmm0, 48(%rsi)      	movsd	%xmm0, 48(%rsi)
ret

So the 3.3 version looks better for that part.
On the other hand, the code around the call to Augment is less pretty:

Eigen 3.2                       Eigen 3.3
L138:                           L138:          
addl	$1, %eax               	addl	$1, %edx
pxor	%xmm0, %xmm0           	pxor	%xmm0, %xmm0
                                leaq	16(%rax), %rcx
cvtsi2sd	%eax, %xmm0     cvtsi2sd	%edx, %xmm0
movq	%rbx, %rsi            	movq	%rbx, %rsi
movq	%rbx, %rdi              movq	%rbx, %rdi
movsd	%xmm0, 48(%rsp)         movddup	%xmm0, %xmm1
movsd	%xmm0, 56(%rsp)         movups	%xmm1, (%rax)
movsd	%xmm0, 64(%rsp)       	movups	%xmm1, 16(%rax)
movsd	%xmm0, 72(%rsp)       	movsd	%xmm0, 80(%rsp)
movsd	%xmm0, 80(%rsp)         movupd	(%rax), %xmm1
                                movups	%xmm1, (%rbx)
                                movupd	(%rcx), %xmm1
                                movups	%xmm1, 16(%rbx)
                                movsd	%xmm0, 128(%rsp)
call	__Z7Augment            	call	__Z7Augment
movsd	88(%rsp), %xmm0        	movupd	8(%rbx), %xmm1
movsd	56(%rsp), %xmm2        	movupd	40(%rbx), %xmm0
movsd	64(%rsp), %xmm1        	addpd	%xmm1, %xmm0
addsd	%xmm0, %xmm2           	movups	%xmm0, 8(%rbx)
movsd	96(%rsp), %xmm0        	movsd	120(%rsp), %xmm0
movsd	(%rsp), %xmm3          	addsd	152(%rsp), %xmm0
addsd	%xmm0, %xmm1           	movsd	%xmm0, 120(%rsp)
movsd	104(%rsp), %xmm0       	movupd	(%rbx), %xmm0
addsd	72(%rsp), %xmm0        	movupd	0(%rbp), %xmm1
addsd	24(%rsp), %xmm0        	addpd	%xmm1, %xmm0
addsd	48(%rsp), %xmm3        	movupd	16(%rbp), %xmm1
addsd	8(%rsp), %xmm2         	movups	%xmm0, 0(%rbp)
addsd	16(%rsp), %xmm1        	movupd	16(%rbx), %xmm0
movsd	%xmm0, 24(%rsp)        	addpd	%xmm1, %xmm0
movsd	32(%rsp), %xmm0        	
movsd	%xmm3, (%rsp)          	movsd	32(%rsp), %xmm1
addsd	80(%rsp), %xmm0        	addsd	128(%rsp), %xmm1
movsd	%xmm2, 8(%rsp)         	movups	%xmm0, 16(%rbp)
movsd	%xmm1, 16(%rsp)        
movsd	%xmm0, 32(%rsp)         movsd	%xmm1, 32(%rsp)
cmpl	%ebp, %eax              cmpl	%r12d, %edx
jl	L138                    jl	L138             


We can some redundant copies before the call that could be avoided by filling augState.head<5>() directly instead of using the intermediate state variable.

After the call we can also see some extra memory copies that are due to the fact that the packets used to compute:

augState.segment<3>(1) += augState.tail<3>();

are not the same than the one to compute:

sum += augState.head<5>();

Nonetheless, it's hard to believe that those few copies are so costly.
Comment 5 Gael Guennebaud 2016-11-18 18:03:40 UTC
(In reply to Gael Guennebaud from comment #4)

> Eigen 3.3
> ...
> mulsd	%xmm1, %xmm1
> haddpd	%xmm0, %xmm0
> addsd	%xmm1, %xmm0
> ...
> So the 3.3 version looks better for that part.

Sorry, that claim was wrong, haddpd has a huge latency.
Comment 6 Daniel Vollmer 2016-11-19 10:12:10 UTC
> We can some redundant copies before the call that could be avoided by filling 
> augState.head<5>() directly instead of using the intermediate state variable.

That would work for this reduced example (where quite a few parts were excised & 
simplified), but in the real application we have many states, then load some of them, 
augment them (which leads to an augState whose first N elements are a copy of state), 
do computations on them, and produce a shape-sized update.

> haddpd has a huge latency.

From what I can gather, it still shouldn't account for this much of a speed difference (factor between 4-5), unless I'm measuring badly / something else?
Comment 7 Gael Guennebaud 2016-11-22 14:08:16 UTC
(In reply to Daniel Vollmer from comment #6)
> > We can some redundant copies before the call that could be avoided by filling 
> > augState.head<5>() directly instead of using the intermediate state variable.
> 
> That would work for this reduced example (where quite a few parts were
> excised & 
> simplified), but in the real application we have many states, then load some
> of them, 
> augment them (which leads to an augState whose first N elements are a copy
> of state), 
> do computations on them, and produce a shape-sized update.

sure, but this introduces a bias in the benchmark because some compiler optimizations are possible in one case but not in the other.
Comment 8 Daniel Vollmer 2016-11-22 17:11:16 UTC
All I'm hoping for is there is no performance regression when moving from Eigen 3.2 to Eigen 3.3 (i.e. that vectorization is only attempted when it is actually profitable), as Eigen 3.2 doesn't attempt to vectorise these cases (and I'd prefer not to have to turn if off altogether, as there may be areas / expressions where it helps). :)
Comment 9 Gael Guennebaud 2016-11-22 21:47:30 UTC
Yes of course it speeds up a lot of code, and so far we did not observe such slowdown.

In the devel branch, I've removed the use of 'hadd' so the slowdown should be reduced. Nevertheless, there is something extremely strange in your benchmark because if I change the call to:

    rhoV.squaredNorm()

by

    (rhoV(0)*rhoV(0)+rhoV(1)*rhoV(1)+rhoV(2)*rhoV(2))

then I get same performance as 3.2. Between these two versions, the ASM is marginally different:

hand-written             rhoV.squaredNorm()
movsd	8(%rdi), %xmm0   movupd	8(%rdi), %xmm
movsd	16(%rdi), %xmm1  mulpd	%xmm0, %xmm0
mulsd	%xmm0, %xmm0     movapd	%xmm0, %xmm1
mulsd	%xmm1, %xmm1     shufpd	$1, %xmm1, %x
movsd	24(%rdi), %xmm2  addsd	%xmm0, %xmm1
mulsd	%xmm2, %xmm2     movsd	24(%rdi), %xmm0
addsd	%xmm1, %xmm2     mulsd	%xmm0, %xmm0

and cannot explain a x3 speed difference. Moreover, if I write another benchmark comparing the following two functions:

void foo1(const Scalar* data, Scalar* out, int n)
{
  int j = 0;
  for(int i=0; i<n/5; i+=5)
    out[j++] = data[i+4] - 0.5 * Map<const Matrix<Scalar,3,1> >.squaredNorm() / data[i+3];
}

void foo2(const Scalar* data, Scalar* out, int n)
{
  int j = 0;
  for(int i=0; i<n/5; i+=5)
    out[j++] = data[i+4] - 0.5 * ((data[i]*data[i]+data[i+1]*data[i+1])+(data[i+2]*data[i+2]))/data[i+3];
}

then I observe the same ASM differences, but they are equally fast. So that's a mystery to me.


BTW, as you can see in the above ASM, in the default branch I've removed the calls to the slow haddpd intrinsics.
Comment 10 Daniel Vollmer 2016-11-23 10:13:16 UTC
My compiler default settings (no explicit -m<anything>) resulted in SSE2, so I never saw the haddpd. When I explicitly enable it, I see it with Eigen 3.3, but not with your changes to the devel-branch (as expected). But, I do not observe any performance difference between the two.

I really am fairly clueless where the large speed difference comes from (between 3.2 3.3). Do you see this difference as well (factor 4-5)?
Comment 11 Daniel Vollmer 2016-12-05 16:08:15 UTC
Created attachment 758 [details]
even further simplified assignment benchmark

The root cause seems to be the decision in AssignEvaluator.h to allow LinearVectorization and SliceVectorization for very small sizes.

For Eigen 3.2.10, the (non-vectorized) loop looks like this:
.L363:
	pxor	%xmm1, %xmm1
	movl	%edx, %ecx
	pxor	%xmm0, %xmm0
	cvtsi2sd	%edx, %xmm1
	addl	$1, %edx
	cvtsi2sd	%edx, %xmm0
	cmpl	%ebx, %ecx
	addsd	%xmm1, %xmm6
	addsd	%xmm1, %xmm0
	addsd	%xmm1, %xmm2
	movsd	%xmm6, 16(%rsp)
	addsd	%xmm0, %xmm5
	addsd	%xmm0, %xmm4
	movsd	%xmm2, 48(%rsp)
	addsd	%xmm0, %xmm3
	movsd	%xmm5, 24(%rsp)
	movsd	%xmm4, 32(%rsp)
	movsd	%xmm3, 40(%rsp)
	jne	.L363

whereas Eigen 3.3 vectorizes and produces:
.L362:
	pxor	%xmm0, %xmm0
	movl	%edx, %ecx
	cvtsi2sd	%edx, %xmm0
	addl	$1, %edx
	cmpl	%r12d, %ecx
	movapd	%xmm0, %xmm2
	movsd	%xmm0, 80(%rsp)
	unpcklpd	%xmm2, %xmm2
	movups	%xmm2, 48(%rsp)
	movups	%xmm2, 64(%rsp)
	movupd	48(%rsp), %xmm2
	movups	%xmm2, (%rbx)
	movupd	64(%rsp), %xmm2
	movups	%xmm2, 16(%rbx)
	movsd	%xmm0, 128(%rsp)
	pxor	%xmm0, %xmm0
	cvtsi2sd	%edx, %xmm0
	movapd	%xmm0, %xmm2
	unpcklpd	%xmm2, %xmm2
	movups	%xmm2, 40(%rbx)
	movsd	%xmm0, 152(%rsp)
	movupd	8(%rbx), %xmm2
	movupd	40(%rbx), %xmm0
	addpd	%xmm2, %xmm0
	movups	%xmm0, 8(%rbx)
	movsd	120(%rsp), %xmm0
	addsd	152(%rsp), %xmm0
	movsd	%xmm0, 120(%rsp)
	movupd	(%rbx), %xmm0
	movupd	0(%rbp), %xmm2
	addpd	%xmm2, %xmm0
	movupd	16(%rbp), %xmm2
	movups	%xmm0, 0(%rbp)
	movupd	16(%rbx), %xmm0
	addpd	%xmm2, %xmm0
	movups	%xmm0, 16(%rbp)
	addsd	128(%rsp), %xmm1
	movsd	%xmm1, 32(%rsp)
	jne	.L362


Maybe it is worthwhile to add a check to MayInnerVectorize / MayLinearVectorize / MaySliceVectorize for a minimum size that seems profitable (if size is known)? There already is a similar check in MaySliceVectorize, but that seems a bit optimistic. When I use the "InnerMaxSize >= 3*InnerPacketSize" check for MayInner- and MaySlizeVectorize, my blocks won't get vectorized anymore and I get the exact same code as Eigen 3.2.10.

Note You need to log in before you can comment on or make changes to this bug.