New user self-registration is disabled due to spam. Please email eigen-core-team @ if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1365 - Slowdown when switching from Eigen 3.2.10 to Eigen 3.3.1
Summary: Slowdown when switching from Eigen 3.2.10 to Eigen 3.3.1
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: 3.3 (current stable)
Hardware: x86 - 64-bit Windows
: Normal Performance Problem
Assignee: Nobody
Depends on:
Reported: 2016-12-19 18:07 UTC by Luca Vandelli
Modified: 2017-04-01 19:35 UTC (History)
5 users (show)

Test C++ code and assembly obtained using the 2 versions of Eigen (2.80 KB, text/plain)
2016-12-19 18:07 UTC, Luca Vandelli
no flags Details

Description Luca Vandelli 2016-12-19 18:07:38 UTC
Created attachment 762 [details]
Test C++ code and assembly obtained using the 2 versions of Eigen

I have written this small piece of code performing squared norm on vectors of floating points (see the attachment).

This is the output I obtain using Eigen 3.2.10 : 
nTrials       200     nPoints 1000    time    0.392

This is the output I obtain using Eigen 3.3.1 : 
nTrials       200     nPoints 1000    time    2.098

This means that using Eigen 3.3.1 is about 5x slower than using Eigen previous version!
The test is performed on a Intel Core I5 4690K 3.50GHz with 16 GB RAM.
The code is built using Microsoft Visual Studio 2015. The AVX2 vector extension (/arch:AVX2 switch) is enabled.
I have tried also to disassembly the code, though I do not understand the reason for such a slow-down.
The attachment contains the assembly code corresponding to the inner loop (for (int j = 0; j < np; ++j)) obtained with the two Eigen versions.

Another strange thing I noticed: introducing a temporary Vector4f like this:
Eigen::Vector4f dx = points[j] - points[i];
out(i, j) = dx.squaredNorm();
allows obtaining a 2x speed up in Eigen 3.3 (the computation time becomes about 0.9sec). A slow down is obtained instead in Eigen 3.2.10.

Thanks a lot for your help.
Comment 1 Gael Guennebaud 2016-12-19 22:18:33 UTC
Here with either gcc or clang I get same performance (0.37s) and similar asm:

3.3, AVX:
	vmovaps	(%rax), %xmm0
	vsubps	(%rsi,%rcx,4), %xmm0, %xmm0
	addq	$16, %rax
	vmulps	%xmm0, %xmm0, %xmm0
	vmovhlps%xmm0, %xmm0, %xmm1
	vaddps	%xmm1, %xmm0, %xmm0
	vshufps	$1, %xmm0, %xmm0, %xmm1
	vaddss	%xmm1, %xmm0, %xmm0
	vmovss	%xmm0, (%rdx)
	addq	%rdi, %rdx
	cmpq	%rax, %r8
	jne	L5

3.2, SSE2 only:
	movaps	(%rax), %xmm0
	addq	$16, %rax
	subps	(%rsi,%rcx,4), %xmm0
	mulps	%xmm0, %xmm0
	movaps	%xmm0, %xmm1
	movhlps	%xmm0, %xmm1
	addps	%xmm1, %xmm0
	movaps	%xmm0, %xmm1
	shufps	$1, %xmm0, %xmm1
	addss	%xmm1, %xmm0
	movss	%xmm0, (%rdx)
	addq	%rdi, %rdx
	cmpq	%rax, %r8
	jne	L5
Comment 2 Gael Guennebaud 2016-12-19 22:20:54 UTC
Oh, actually I was using the head of the 3.3 branch which contains some optimization regarding the usage of haddp* instruction. Please try with the head, your issue is likely already fixed, but this worth checking because your ASM is also exhibiting some weird stores.
Comment 3 Gael Guennebaud 2016-12-19 22:27:22 UTC
hm, I also get good perf. (0.37s) with 3.3.1 and usage of haddps:

	vmovaps	(%rax), %xmm0
	vsubps	(%rsi,%rcx,4), %xmm0, %xmm0
	addq	$16, %rax
	vmulps	%xmm0, %xmm0, %xmm0
	vhaddps	%xmm0, %xmm0, %xmm0
	vhaddps	%xmm0, %xmm0, %xmm0
	vmovss	%xmm0, (%rdx)
	addq	%rdi, %rdx
	cmpq	%rax, %r8
	jne	L5

In your case, those four instructions:

00007FF6D13812E7  mov         qword ptr [rbp-79h],rcx  
00007FF6D13812EB  mov         qword ptr [rbp-71h],rax  
00007FF6D13812EF  vmovups     ymm0,ymmword ptr [rsp+28h]  
00007FF6D13812F5  vmovups     ymmword ptr [rbp-59h],ymm0 

are really suspicious. I've no idea why MSVC generates those.
Comment 4 Luca Vandelli 2016-12-20 08:23:27 UTC
Hi Gael
I have tried building with the unstable source code from the development branch (from but I still get bad performances (1.86s).
This is the asm (it looks the same):

00007FF74A9B12D0  shl         rcx,4  
00007FF74A9B12D4  add         rcx,rbx  
00007FF74A9B12D7  mov         qword ptr [rbp-79h],rcx  
00007FF74A9B12DB  mov         qword ptr [rbp-71h],rax  
00007FF74A9B12DF  vmovups     ymm0,ymmword ptr [rsp+28h]  
00007FF74A9B12E5  vmovups     ymmword ptr [rbp-59h],ymm0  
00007FF74A9B12EA  vmovups     xmm0,xmmword ptr [rcx]  
00007FF74A9B12EE  vsubps      xmm1,xmm0,xmmword ptr [rax]  
00007FF74A9B12F2  vmulps      xmm3,xmm1,xmm1  
00007FF74A9B12F6  vmovhlps    xmm1,xmm3,xmm3  
00007FF74A9B12FA  vaddps      xmm3,xmm1,xmm3  
00007FF74A9B12FE  vshufps     xmm0,xmm3,xmm3,1  
00007FF74A9B1303  vaddss      xmm2,xmm3,xmm0  
00007FF74A9B1307  vmovss      dword ptr [r10],xmm2  
00007FF74A9B130C  inc         r9d  
00007FF74A9B130F  lea         r10,[r10+0FA0h]  
00007FF74A9B1316  movsxd      rcx,r9d  
00007FF74A9B1319  cmp         rcx,r11  
00007FF74A9B131C  jb          performTest+140h (07FF74A9B12D0h)
Comment 5 Gael Guennebaud 2016-12-20 08:40:45 UTC
yes, there are still those useless copies. could you paste the ASM on the entire computeDistances function so that we have some more context.
Comment 6 Gael Guennebaud 2016-12-20 08:46:23 UTC
also, maybe the /DEBUG /Z7 flags will make MSVC inserts somme helpful comments in the asm?
Comment 7 Luca Vandelli 2016-12-20 08:51:20 UTC
This is the entire generated asm, built with the /DEBUG /Z7 flags : 

static void performTest(int np, int nTrials) {
00007FF6E2931190  mov         rax,rsp  
00007FF6E2931193  push        rbp  
00007FF6E2931194  lea         rbp,[rax-5Fh]  
00007FF6E2931198  sub         rsp,100h  
00007FF6E293119F  mov         qword ptr [rsp+20h],0FFFFFFFFFFFFFFFEh  
00007FF6E29311A8  mov         qword ptr [rax+8],rbx  
00007FF6E29311AC  mov         qword ptr [rax+10h],rsi  
00007FF6E29311B0  mov         qword ptr [rax+18h],rdi  
00007FF6E29311B4  mov         qword ptr [rax+20h],r14  
00007FF6E29311B8  vmovaps     xmmword ptr [rax-18h],xmm6  
00007FF6E29311BD  vmovaps     xmmword ptr [rax-28h],xmm7  
00007FF6E29311C2  vmovaps     xmmword ptr [rax-38h],xmm8  
00007FF6E29311C7  vmovaps     xmmword ptr [rax-48h],xmm9  
00007FF6E29311CC  mov         rax,qword ptr [__security_cookie (07FF6E2935000h)]  
00007FF6E29311D3  xor         rax,rsp  
00007FF6E29311D6  mov         qword ptr [rbp+0Fh],rax  
00007FF6E29311DA  xor         eax,eax  
00007FF6E29311DC  vpxor       xmm0,xmm0,xmm0  
00007FF6E29311E0  vmovdqu     xmmword ptr [rbp-9],xmm0  

	std::vector<Eigen::Vector4f> points;
00007FF6E29311E5  mov         qword ptr [rbp+7],rax  
00007FF6E29311E9  mov         edx,3E8h  
00007FF6E29311EE  lea         rcx,[rbp-9]  
00007FF6E29311F2  call        std::vector<Eigen::Matrix<float,4,1,0,4,1>,std::allocator<Eigen::Matrix<float,4,1,0,4,1> > >::_Reallocate (07FF6E29316D0h)  

	Create a sample set of points
	for (int i = 0; i < np; ++i) {
00007FF6E29311F7  xor         ebx,ebx  
00007FF6E29311F9  vmovss      xmm6,dword ptr [__real@40400000 (07FF6E2933470h)]  
00007FF6E2931201  vmovss      xmm7,dword ptr [__real@40000000 (07FF6E293346Ch)]  
00007FF6E2931209  vmovss      xmm8,dword ptr [__real@3f800000 (07FF6E2933468h)]  
00007FF6E2931211  vxorps      xmm9,xmm9,xmm9  
00007FF6E2931216  nop         word ptr [rax+rax]  
00007FF6E2931220  vxorps      xmm0,xmm0,xmm0  
		float iFloat = (float)i;
00007FF6E2931224  vcvtsi2ss   xmm0,xmm0,ebx  
		Vector4f point(0.0f + iFloat, 1.0f + iFloat, 2.0f + iFloat, 3.0f + iFloat);
00007FF6E2931228  vaddss      xmm3,xmm0,xmm6  
00007FF6E293122C  vaddss      xmm2,xmm0,xmm7  
00007FF6E2931230  vaddss      xmm1,xmm0,xmm8  
00007FF6E2931235  vaddss      xmm0,xmm0,xmm9  
00007FF6E293123A  vmovss      dword ptr [rbp-29h],xmm0  
00007FF6E293123F  vmovss      dword ptr [rbp-25h],xmm1  
00007FF6E2931244  vmovss      dword ptr [rbp-21h],xmm2  
00007FF6E2931249  vmovss      dword ptr [rbp-1Dh],xmm3  
00007FF6E293124E  lea         rdx,[rbp-29h]  
00007FF6E2931252  lea         rcx,[rbp-9]  
00007FF6E2931256  call        std::vector<Eigen::Matrix<float,4,1,0,4,1>,std::allocator<Eigen::Matrix<float,4,1,0,4,1> > >::push_back (07FF6E29314C0h)  

	Create a sample set of points
	for (int i = 0; i < np; ++i) {
00007FF6E293125B  inc         ebx  
00007FF6E293125D  cmp         ebx,3E8h  
00007FF6E2931263  jl          performTest+90h (07FF6E2931220h)  
00007FF6E2931265  xor         eax,eax  
00007FF6E2931267  vpxor       xmm0,xmm0,xmm0  
00007FF6E293126B  vmovdqu     xmmword ptr [rbp-29h],xmm0  
	MatrixXf result(np, np);
00007FF6E2931270  mov         qword ptr [rbp-19h],rax  
00007FF6E2931274  mov         ecx,0F4240h  
00007FF6E2931279  call        Eigen::internal::conditional_aligned_new_auto<float,1> (07FF6E2931A50h)  
00007FF6E293127E  mov         rsi,rax  
00007FF6E2931281  mov         qword ptr [rbp-29h],rax  
00007FF6E2931285  vmovdqu     xmm0,xmmword ptr [__xmm@00000000000003e800000000000003e8 (07FF6E2933480h)]  
00007FF6E293128D  vmovdqu     xmmword ptr [rbp-21h],xmm0  

	clock_t begin = clock();
00007FF6E2931292  call        qword ptr [__imp_clock (07FF6E2933228h)]  
00007FF6E2931298  mov         r14d,eax  
00007FF6E293129B  mov         r11,qword ptr [rbp-1]  
00007FF6E293129F  mov         rbx,qword ptr [rbp-9]  
00007FF6E29312A3  sub         r11,rbx  
00007FF6E29312A6  sar         r11,4  
00007FF6E29312AA  mov         edi,0C8h  
00007FF6E29312AF  nop  
		Compute the distances between all the couples of points.
		computeDistances(result, points);
00007FF6E29312B0  xor         edx,edx  
00007FF6E29312B2  test        r11,r11  
00007FF6E29312B5  je          performTest+19Ch (07FF6E293132Ch)  
00007FF6E29312B7  xor         eax,eax  
00007FF6E29312B9  mov         r8,rsi  
00007FF6E29312BC  nop         dword ptr [rax]  
00007FF6E29312C0  xor         r9d,r9d  
00007FF6E29312C3  shl         rax,4  
00007FF6E29312C7  add         rax,rbx  
00007FF6E29312CA  xor         ecx,ecx  
00007FF6E29312CC  mov         r10,r8  
00007FF6E29312CF  nop  
00007FF6E29312D0  shl         rcx,4  
00007FF6E29312D4  add         rcx,rbx  
00007FF6E29312D7  mov         qword ptr [rbp-79h],rcx  
00007FF6E29312DB  mov         qword ptr [rbp-71h],rax  
00007FF6E29312DF  vmovups     ymm0,ymmword ptr [rsp+28h]  
00007FF6E29312E5  vmovups     ymmword ptr [rbp-59h],ymm0  
00007FF6E29312EA  vmovups     xmm0,xmmword ptr [rcx]  
00007FF6E29312EE  vsubps      xmm1,xmm0,xmmword ptr [rax]  
00007FF6E29312F2  vmulps      xmm3,xmm1,xmm1  
00007FF6E29312F6  vmovhlps    xmm1,xmm3,xmm3  
00007FF6E29312FA  vaddps      xmm3,xmm1,xmm3  
00007FF6E29312FE  vshufps     xmm0,xmm3,xmm3,1  
00007FF6E2931303  vaddss      xmm2,xmm3,xmm0  
00007FF6E2931307  vmovss      dword ptr [r10],xmm2  
00007FF6E293130C  inc         r9d  
00007FF6E293130F  lea         r10,[r10+0FA0h]  
00007FF6E2931316  movsxd      rcx,r9d  
00007FF6E2931319  cmp         rcx,r11  
00007FF6E293131C  jb          performTest+140h (07FF6E29312D0h)  
00007FF6E293131E  inc         edx  
00007FF6E2931320  add         r8,4  
00007FF6E2931324  movsxd      rax,edx  
00007FF6E2931327  cmp         rax,r11  
00007FF6E293132A  jb          performTest+130h (07FF6E29312C0h)  

	for (int i = 0; i < nTrials; ++i) {
00007FF6E293132C  sub         rdi,1  
00007FF6E2931330  jne         performTest+120h (07FF6E29312B0h)  
00007FF6E2931336  vzeroupper  

	clock_t end = clock();
00007FF6E2931339  call        qword ptr [__imp_clock (07FF6E2933228h)]  

	double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
00007FF6E293133F  sub         eax,r14d  
00007FF6E2931342  vxorps      xmm0,xmm0,xmm0  
00007FF6E2931346  vcvtsi2sd   xmm0,xmm0,eax  
00007FF6E293134A  vdivsd      xmm6,xmm0,mmword ptr [__real@408f400000000000 (07FF6E2933478h)]  

	std::cout << "computeDistances: nTrials\t" << nTrials << "\tnPoints\t" << np << "\ttime\t" << elapsed_secs << "\r\n";
00007FF6E2931352  lea         rdx,[string "computeDistances: nTrials\t" (07FF6E2933368h)]  

	std::cout << "computeDistances: nTrials\t" << nTrials << "\tnPoints\t" << np << "\ttime\t" << elapsed_secs << "\r\n";
00007FF6E2931359  mov         rcx,qword ptr [__imp_std::cout (07FF6E2933088h)]  
00007FF6E2931360  call        std::operator<<<std::char_traits<char> > (07FF6E2931880h)  
00007FF6E2931365  mov         rcx,rax  
00007FF6E2931368  mov         edx,0C8h  
00007FF6E293136D  call        qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6E2933090h)]  
00007FF6E2931373  mov         rcx,rax  
00007FF6E2931376  lea         rdx,[string "\tnPoints\t" (07FF6E2933358h)]  
00007FF6E293137D  call        std::operator<<<std::char_traits<char> > (07FF6E2931880h)  
00007FF6E2931382  mov         rcx,rax  
00007FF6E2931385  mov         edx,3E8h  
00007FF6E293138A  call        qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6E2933090h)]  
00007FF6E2931390  mov         rcx,rax  
00007FF6E2931393  lea         rdx,[string "\ttime\t" (07FF6E293334Ch)]  
00007FF6E293139A  call        std::operator<<<std::char_traits<char> > (07FF6E2931880h)  
00007FF6E293139F  mov         rcx,rax  
00007FF6E29313A2  vmovaps     xmm1,xmm6  
00007FF6E29313A6  call        qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6E2933080h)]  
00007FF6E29313AC  mov         rcx,rax  
00007FF6E29313AF  lea         rdx,[string "\r\n" (07FF6E2933348h)]  
00007FF6E29313B6  call        std::operator<<<std::char_traits<char> > (07FF6E2931880h)  
00007FF6E29313BB  nop  

00007FF6E29313BC  test        rsi,rsi  
00007FF6E29313BF  je          performTest+23Ch (07FF6E29313CCh)  
00007FF6E29313C1  mov         rcx,qword ptr [rsi-8]  
00007FF6E29313C5  call        qword ptr [__imp_free (07FF6E2933138h)]  
00007FF6E29313CB  nop  
00007FF6E29313CC  test        rbx,rbx  
00007FF6E29313CF  je          performTest+2B2h (07FF6E2931442h)  
00007FF6E29313D1  mov         rax,qword ptr [rbp+7]  
00007FF6E29313D5  sub         rax,rbx  
00007FF6E29313D8  sar         rax,4  
00007FF6E29313DC  mov         rcx,0FFFFFFFFFFFFFFFh  
00007FF6E29313E6  cmp         rax,rcx  
00007FF6E29313E9  jbe         performTest+262h (07FF6E29313F2h)  
00007FF6E29313EB  call        qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)]  
00007FF6E29313F1  int         3  
00007FF6E29313F2  shl         rax,4  

00007FF6E29313F6  cmp         rax,1000h  
00007FF6E29313FC  jb          performTest+2AAh (07FF6E293143Ah)  
00007FF6E29313FE  test        bl,1Fh  
00007FF6E2931401  je          performTest+27Ah (07FF6E293140Ah)  
00007FF6E2931403  call        qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)]  
00007FF6E2931409  int         3  
00007FF6E293140A  mov         rax,qword ptr [rbx-8]  
00007FF6E293140E  cmp         rax,rbx  
00007FF6E2931411  jb          performTest+28Ah (07FF6E293141Ah)  
00007FF6E2931413  call        qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)]  
00007FF6E2931419  int         3  
00007FF6E293141A  sub         rbx,rax  
00007FF6E293141D  cmp         rbx,8  
00007FF6E2931421  jae         performTest+29Ah (07FF6E293142Ah)  
00007FF6E2931423  call        qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)]  
00007FF6E2931429  int         3  
00007FF6E293142A  cmp         rbx,27h  
00007FF6E293142E  jbe         performTest+2A7h (07FF6E2931437h)  
00007FF6E2931430  call        qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)]  
00007FF6E2931436  int         3  
00007FF6E2931437  mov         rbx,rax  
00007FF6E293143A  mov         rcx,rbx  
00007FF6E293143D  call        operator delete (07FF6E2931B70h)  
00007FF6E2931442  mov         rcx,qword ptr [rbp+0Fh]  
00007FF6E2931446  xor         rcx,rsp  
00007FF6E2931449  call        __security_check_cookie (07FF6E2931B10h)  
00007FF6E293144E  lea         r11,[rsp+100h]  
00007FF6E2931456  mov         rbx,qword ptr [r11+10h]  
00007FF6E293145A  mov         rsi,qword ptr [r11+18h]  
00007FF6E293145E  mov         rdi,qword ptr [r11+20h]  
00007FF6E2931462  mov         r14,qword ptr [r11+28h]  
00007FF6E2931466  vmovaps     xmm6,xmmword ptr [r11-10h]  
00007FF6E293146C  vmovaps     xmm7,xmmword ptr [r11-20h]  
00007FF6E2931472  vmovaps     xmm8,xmmword ptr [r11-30h]  
00007FF6E2931478  vmovaps     xmm9,xmmword ptr [r11-40h]  
00007FF6E293147E  mov         rsp,r11  
00007FF6E2931481  pop         rbp  
00007FF6E2931482  ret
Comment 8 Christoph Hertzberg 2016-12-20 09:27:40 UTC
I'm not sure if that changes anything: Could you try replacing
  auto np = points.size();
  const size_t np = points.size();
And replace both int i and j by size_t.
Also (this is unrelated), you should probably iterate through the output matrix in column-major order (you can actually save computing half the matrix, due to symmetry). What might also help is to store (points[i]) in a const Vector4f (in the outer loop).

The issue itself looks a bit like an actual evaluator object is stored on the stack (though it is strange that ymm0 is also stored back). Did you compile with full optimization enabled?
What is also a bit strange is that MSVC uses unaligned moves instead of aligned ones.
Comment 9 Christoph Hertzberg 2016-12-20 09:36:05 UTC
(In reply to Christoph Hertzberg from comment #8)
> the stack (though it is strange that ymm0 is also stored back). [...]

Using ymm0 at all is actually strange here, since no Packet8f are used. Maybe MSVC just wants to copy 32 bytes from [rsp+28h] to [rbp-59h] (perhaps this copies an Evaluator object?)
Comment 10 Luca Vandelli 2016-12-20 09:50:41 UTC
(In reply to Christoph Hertzberg from comment #8)
> I'm not sure if that changes anything: Could you try replacing
>   auto np = points.size();
> by
>   const size_t np = points.size();
> And replace both int i and j by size_t.

I tried replacing int with size_t as you suggested but the performances are the same.

> Also (this is unrelated), you should probably iterate through the output
> matrix in column-major order (you can actually save computing half the
> matrix, due to symmetry). What might also help is to store (points[i]) in a
> const Vector4f (in the outer loop).

Yes, I have written this procedure just for benchmarking, I did not think to optimize it...

> The issue itself looks a bit like an actual evaluator object is stored on
> the stack (though it is strange that ymm0 is also stored back). Did you
> compile with full optimization enabled?

I compiled with the "Maximize Speed" (/O2) and with the "Favor fast code" (/Ot) flags enabled.
Comment 11 Luca Vandelli 2016-12-20 11:12:27 UTC
Just for completeness: these are all the compilation options

/GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc140.pdb" /Zc:inline /D "_SCL_SECURE_NO_WARNINGS" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\DistancesComputation.pch"
Comment 12 Gael Guennebaud 2016-12-20 14:58:05 UTC
Could you retry with the head of default branch, I've drastically reduced the storage cost of evaluators:
Date:        2016-12-20 14:51:30+00:00
Summary:     Remove common "noncopyable" base class from evaluator_base to get a chance to get EBO (Empty Base Optimization)
Note: we should probably get rid of this class and define a macro instead.
Date:        2016-12-20 14:55:40+00:00
Summary:     Optimize storage layout of Cwise* and PlainObjectBase evaluator to remove the functor or outer-stride if they are empty.
For instance, sizeof("(A-B).cwiseAbs2()") with A,B Vector4f is now 16 bytes, instead of 48 before this optimization.
In theory, evaluators should be completely optimized away by the compiler, but this might help in some cases.
Comment 13 Luca Vandelli 2016-12-20 16:10:30 UTC
Hi Gael
I have tried compiling with the head of the default branch but unfortunately the performance does not change.
This is the asm:

00007FF76B7312B0  test        r11,r11  
00007FF76B7312B3  je          performTest+194h (07FF76B731324h)  
00007FF76B7312B5  mov         rdx,rsi  
00007FF76B7312B8  mov         rax,rbx  
00007FF76B7312BB  mov         r10,r11  
00007FF76B7312BE  xchg        ax,ax  
00007FF76B7312C0  mov         rcx,rbx  
00007FF76B7312C3  mov         r8,rdx  
00007FF76B7312C6  mov         r9,r11  
00007FF76B7312C9  nop         dword ptr [rax]  
00007FF76B7312D0  mov         qword ptr [rbp-79h],rcx  
00007FF76B7312D4  mov         qword ptr [rbp-71h],rax  
00007FF76B7312D8  vmovups     ymm0,ymmword ptr [rsp+28h]  
00007FF76B7312DE  vmovups     ymmword ptr [rbp-59h],ymm0  
00007FF76B7312E3  vmovups     xmm0,xmmword ptr [rcx]  
00007FF76B7312E7  vsubps      xmm1,xmm0,xmmword ptr [rax]  
00007FF76B7312EB  vmulps      xmm3,xmm1,xmm1  
00007FF76B7312EF  vmovhlps    xmm1,xmm3,xmm3  
00007FF76B7312F3  vaddps      xmm3,xmm1,xmm3  
00007FF76B7312F7  vshufps     xmm0,xmm3,xmm3,1  
00007FF76B7312FC  vaddss      xmm2,xmm3,xmm0  
00007FF76B731300  vmovss      dword ptr [r8],xmm2  
00007FF76B731305  add         rcx,10h  
00007FF76B731309  lea         r8,[r8+0FA0h]  
00007FF76B731310  sub         r9,1  
00007FF76B731314  jne         performTest+140h (07FF76B7312D0h)  
00007FF76B731316  add         rax,10h  
00007FF76B73131A  add         rdx,4  
00007FF76B73131E  sub         r10,1  
00007FF76B731322  jne         performTest+130h (07FF76B7312C0h)
Comment 14 Gael Guennebaud 2016-12-20 16:52:49 UTC
I'm clueless, those 4 lines really make no sense to me, rbp-79h, rbp-71h, and rbp-59h are never used afterwards, and rsp+28h is not even initialized. Perhaps, they are produced by /GS (buffer security check). What if you remove it?

Also, what about /Oy ? (Frame-Pointer Omission)
Comment 15 Gael Guennebaud 2016-12-20 16:58:45 UTC
To check the effect of /GS- you must also disable /sdl with /sdl-
Comment 16 Luca Vandelli 2016-12-20 17:09:57 UTC
I have tried setting the options as you suggested but the performances are still the same.
These are the compilation options:

/GS- /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl- /Fd"x64\Release\vc140.pdb" /Zc:inline /D "_SCL_SECURE_NO_WARNINGS" /D "UMFPACK_MDLAB_MBCS" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oy /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\DistancesComputation.pch"
Comment 17 Gael Guennebaud 2016-12-20 17:17:42 UTC
And what is the ASM without the frame pointer (/Oy), this might give some insights.
Comment 18 Gael Guennebaud 2016-12-20 17:20:16 UTC
Also, what the is ASM produced when you introduce the temporary:

Eigen::Vector4f dx = points[j] - points[i];
out(i, j) = dx.squaredNorm();
Comment 19 Luca Vandelli 2016-12-20 17:28:37 UTC
Here it is:

00007FF7B48412A8  test        r11,r11  
00007FF7B48412AB  je          performTest+187h (07FF7B4841317h)  
00007FF7B48412AD  mov         rcx,rsi  
00007FF7B48412B0  mov         rax,rbx  
00007FF7B48412B3  mov         r10,r11  
00007FF7B48412B6  nop         word ptr [rax+rax]  
00007FF7B48412C0  mov         r8,rbx  
00007FF7B48412C3  mov         rdx,rcx  
00007FF7B48412C6  mov         r9,r11  
00007FF7B48412C9  nop         dword ptr [rax]  
00007FF7B48412D0  vmovups     xmm1,xmmword ptr [r8]  
00007FF7B48412D5  vsubps      xmm1,xmm1,xmmword ptr [rax]  
00007FF7B48412D9  vmovups     xmmword ptr [point],xmm1  
00007FF7B48412DF  vmulps      xmm3,xmm1,xmm1  
		Compute the distances between all the couples of points.
		computeDistances(result, points);
00007FF7B48412E3  vmovhlps    xmm1,xmm3,xmm3  
00007FF7B48412E7  vaddps      xmm3,xmm1,xmm3  
00007FF7B48412EB  vshufps     xmm0,xmm3,xmm3,1  
00007FF7B48412F0  vaddss      xmm2,xmm3,xmm0  
00007FF7B48412F4  vmovss      dword ptr [rdx],xmm2  
00007FF7B48412F8  lea         r8,[r8+10h]  
00007FF7B48412FC  lea         rdx,[rdx+0FA0h]  
00007FF7B4841303  sub         r9,1  
00007FF7B4841307  jne         performTest+140h (07FF7B48412D0h)  
00007FF7B4841309  add         rax,10h  
00007FF7B484130D  add         rcx,4  
00007FF7B4841311  sub         r10,1  
00007FF7B4841315  jne         performTest+130h (07FF7B48412C0h)  

	for (int i = 0; i < nTrials; ++i) {
00007FF7B4841317  sub         rdi,1  
00007FF7B484131B  jne         performTest+118h (07FF7B48412A8h)  
Comment 20 Gael Guennebaud 2016-12-20 17:48:01 UTC
I guess this is with the temp, and what about the asm with /Oy ?

This is interesting, the 4 weird lines are gone, but it still copy the temporary (unnecessarily) to memory:

vmovups     xmmword ptr [point],xmm1

MSVC is really stupid.
Comment 21 Luca Vandelli 2016-12-21 08:05:00 UTC
Hi Gael the asm I sent you was indeed compiled with /oy .
Comment 22 Денис Сенькин 2017-04-01 19:35:58 UTC
Are there any changes in this bug? If I use the fix mentioned in the first message, I get the performance same to eigen 3.2. Is it possible to get performance improvement using eigen 3.3 in calculating of squaredNorm?

Note You need to log in before you can comment on or make changes to this bug.