Created attachment 762 [details] Test C++ code and assembly obtained using the 2 versions of Eigen I have written this small piece of code performing squared norm on vectors of floating points (see the attachment). This is the output I obtain using Eigen 3.2.10 : nTrials 200 nPoints 1000 time 0.392 This is the output I obtain using Eigen 3.3.1 : nTrials 200 nPoints 1000 time 2.098 This means that using Eigen 3.3.1 is about 5x slower than using Eigen previous version! The test is performed on a Intel Core I5 4690K 3.50GHz with 16 GB RAM. The code is built using Microsoft Visual Studio 2015. The AVX2 vector extension (/arch:AVX2 switch) is enabled. I have tried also to disassembly the code, though I do not understand the reason for such a slow-down. The attachment contains the assembly code corresponding to the inner loop (for (int j = 0; j < np; ++j)) obtained with the two Eigen versions. Another strange thing I noticed: introducing a temporary Vector4f like this: Eigen::Vector4f dx = points[j] - points[i]; out(i, j) = dx.squaredNorm(); allows obtaining a 2x speed up in Eigen 3.3 (the computation time becomes about 0.9sec). A slow down is obtained instead in Eigen 3.2.10. Thanks a lot for your help.
Here with either gcc or clang I get same performance (0.37s) and similar asm: 3.3, AVX: L5: vmovaps (%rax), %xmm0 vsubps (%rsi,%rcx,4), %xmm0, %xmm0 addq $16, %rax vmulps %xmm0, %xmm0, %xmm0 vmovhlps%xmm0, %xmm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vshufps $1, %xmm0, %xmm0, %xmm1 vaddss %xmm1, %xmm0, %xmm0 vmovss %xmm0, (%rdx) addq %rdi, %rdx cmpq %rax, %r8 jne L5 3.2, SSE2 only: L5: movaps (%rax), %xmm0 addq $16, %rax subps (%rsi,%rcx,4), %xmm0 mulps %xmm0, %xmm0 movaps %xmm0, %xmm1 movhlps %xmm0, %xmm1 addps %xmm1, %xmm0 movaps %xmm0, %xmm1 shufps $1, %xmm0, %xmm1 addss %xmm1, %xmm0 movss %xmm0, (%rdx) addq %rdi, %rdx cmpq %rax, %r8 jne L5
Oh, actually I was using the head of the 3.3 branch which contains some optimization regarding the usage of haddp* instruction. Please try with the head, your issue is likely already fixed, but this worth checking because your ASM is also exhibiting some weird stores.
hm, I also get good perf. (0.37s) with 3.3.1 and usage of haddps: 3.3.1: L5: vmovaps (%rax), %xmm0 vsubps (%rsi,%rcx,4), %xmm0, %xmm0 addq $16, %rax vmulps %xmm0, %xmm0, %xmm0 vhaddps %xmm0, %xmm0, %xmm0 vhaddps %xmm0, %xmm0, %xmm0 vmovss %xmm0, (%rdx) addq %rdi, %rdx cmpq %rax, %r8 jne L5 In your case, those four instructions: 00007FF6D13812E7 mov qword ptr [rbp-79h],rcx 00007FF6D13812EB mov qword ptr [rbp-71h],rax 00007FF6D13812EF vmovups ymm0,ymmword ptr [rsp+28h] 00007FF6D13812F5 vmovups ymmword ptr [rbp-59h],ymm0 are really suspicious. I've no idea why MSVC generates those.
Hi Gael I have tried building with the unstable source code from the development branch (from http://bitbucket.org/eigen/eigen/get/default.zip) but I still get bad performances (1.86s). This is the asm (it looks the same): 00007FF74A9B12D0 shl rcx,4 00007FF74A9B12D4 add rcx,rbx 00007FF74A9B12D7 mov qword ptr [rbp-79h],rcx 00007FF74A9B12DB mov qword ptr [rbp-71h],rax 00007FF74A9B12DF vmovups ymm0,ymmword ptr [rsp+28h] 00007FF74A9B12E5 vmovups ymmword ptr [rbp-59h],ymm0 00007FF74A9B12EA vmovups xmm0,xmmword ptr [rcx] 00007FF74A9B12EE vsubps xmm1,xmm0,xmmword ptr [rax] 00007FF74A9B12F2 vmulps xmm3,xmm1,xmm1 00007FF74A9B12F6 vmovhlps xmm1,xmm3,xmm3 00007FF74A9B12FA vaddps xmm3,xmm1,xmm3 00007FF74A9B12FE vshufps xmm0,xmm3,xmm3,1 00007FF74A9B1303 vaddss xmm2,xmm3,xmm0 00007FF74A9B1307 vmovss dword ptr [r10],xmm2 00007FF74A9B130C inc r9d 00007FF74A9B130F lea r10,[r10+0FA0h] 00007FF74A9B1316 movsxd rcx,r9d 00007FF74A9B1319 cmp rcx,r11 00007FF74A9B131C jb performTest+140h (07FF74A9B12D0h)
yes, there are still those useless copies. could you paste the ASM on the entire computeDistances function so that we have some more context.
also, maybe the /DEBUG /Z7 flags will make MSVC inserts somme helpful comments in the asm?
This is the entire generated asm, built with the /DEBUG /Z7 flags : static void performTest(int np, int nTrials) { 00007FF6E2931190 mov rax,rsp 00007FF6E2931193 push rbp 00007FF6E2931194 lea rbp,[rax-5Fh] 00007FF6E2931198 sub rsp,100h 00007FF6E293119F mov qword ptr [rsp+20h],0FFFFFFFFFFFFFFFEh 00007FF6E29311A8 mov qword ptr [rax+8],rbx 00007FF6E29311AC mov qword ptr [rax+10h],rsi 00007FF6E29311B0 mov qword ptr [rax+18h],rdi 00007FF6E29311B4 mov qword ptr [rax+20h],r14 00007FF6E29311B8 vmovaps xmmword ptr [rax-18h],xmm6 00007FF6E29311BD vmovaps xmmword ptr [rax-28h],xmm7 00007FF6E29311C2 vmovaps xmmword ptr [rax-38h],xmm8 00007FF6E29311C7 vmovaps xmmword ptr [rax-48h],xmm9 00007FF6E29311CC mov rax,qword ptr [__security_cookie (07FF6E2935000h)] 00007FF6E29311D3 xor rax,rsp 00007FF6E29311D6 mov qword ptr [rbp+0Fh],rax 00007FF6E29311DA xor eax,eax 00007FF6E29311DC vpxor xmm0,xmm0,xmm0 00007FF6E29311E0 vmovdqu xmmword ptr [rbp-9],xmm0 std::vector<Eigen::Vector4f> points; 00007FF6E29311E5 mov qword ptr [rbp+7],rax points.reserve(np); 00007FF6E29311E9 mov edx,3E8h 00007FF6E29311EE lea rcx,[rbp-9] 00007FF6E29311F2 call std::vector<Eigen::Matrix<float,4,1,0,4,1>,std::allocator<Eigen::Matrix<float,4,1,0,4,1> > >::_Reallocate (07FF6E29316D0h) /* Create a sample set of points */ for (int i = 0; i < np; ++i) { 00007FF6E29311F7 xor ebx,ebx 00007FF6E29311F9 vmovss xmm6,dword ptr [__real@40400000 (07FF6E2933470h)] 00007FF6E2931201 vmovss xmm7,dword ptr [__real@40000000 (07FF6E293346Ch)] 00007FF6E2931209 vmovss xmm8,dword ptr [__real@3f800000 (07FF6E2933468h)] 00007FF6E2931211 vxorps xmm9,xmm9,xmm9 00007FF6E2931216 nop word ptr [rax+rax] 00007FF6E2931220 vxorps xmm0,xmm0,xmm0 float iFloat = (float)i; 00007FF6E2931224 vcvtsi2ss xmm0,xmm0,ebx Vector4f point(0.0f + iFloat, 1.0f + iFloat, 2.0f + iFloat, 3.0f + iFloat); 00007FF6E2931228 vaddss xmm3,xmm0,xmm6 00007FF6E293122C vaddss xmm2,xmm0,xmm7 00007FF6E2931230 vaddss xmm1,xmm0,xmm8 00007FF6E2931235 vaddss xmm0,xmm0,xmm9 00007FF6E293123A vmovss dword ptr [rbp-29h],xmm0 00007FF6E293123F vmovss dword ptr [rbp-25h],xmm1 00007FF6E2931244 vmovss dword ptr [rbp-21h],xmm2 00007FF6E2931249 vmovss dword ptr [rbp-1Dh],xmm3 points.push_back(point); 00007FF6E293124E lea rdx,[rbp-29h] 00007FF6E2931252 lea rcx,[rbp-9] 00007FF6E2931256 call std::vector<Eigen::Matrix<float,4,1,0,4,1>,std::allocator<Eigen::Matrix<float,4,1,0,4,1> > >::push_back (07FF6E29314C0h) /* Create a sample set of points */ for (int i = 0; i < np; ++i) { 00007FF6E293125B inc ebx 00007FF6E293125D cmp ebx,3E8h 00007FF6E2931263 jl performTest+90h (07FF6E2931220h) 00007FF6E2931265 xor eax,eax 00007FF6E2931267 vpxor xmm0,xmm0,xmm0 00007FF6E293126B vmovdqu xmmword ptr [rbp-29h],xmm0 } MatrixXf result(np, np); 00007FF6E2931270 mov qword ptr [rbp-19h],rax 00007FF6E2931274 mov ecx,0F4240h 00007FF6E2931279 call Eigen::internal::conditional_aligned_new_auto<float,1> (07FF6E2931A50h) 00007FF6E293127E mov rsi,rax 00007FF6E2931281 mov qword ptr [rbp-29h],rax 00007FF6E2931285 vmovdqu xmm0,xmmword ptr [__xmm@00000000000003e800000000000003e8 (07FF6E2933480h)] 00007FF6E293128D vmovdqu xmmword ptr [rbp-21h],xmm0 clock_t begin = clock(); 00007FF6E2931292 call qword ptr [__imp_clock (07FF6E2933228h)] 00007FF6E2931298 mov r14d,eax 00007FF6E293129B mov r11,qword ptr [rbp-1] 00007FF6E293129F mov rbx,qword ptr [rbp-9] 00007FF6E29312A3 sub r11,rbx 00007FF6E29312A6 sar r11,4 00007FF6E29312AA mov edi,0C8h 00007FF6E29312AF nop /* Compute the distances between all the couples of points. */ computeDistances(result, points); 00007FF6E29312B0 xor edx,edx 00007FF6E29312B2 test r11,r11 00007FF6E29312B5 je performTest+19Ch (07FF6E293132Ch) 00007FF6E29312B7 xor eax,eax 00007FF6E29312B9 mov r8,rsi 00007FF6E29312BC nop dword ptr [rax] 00007FF6E29312C0 xor r9d,r9d 00007FF6E29312C3 shl rax,4 00007FF6E29312C7 add rax,rbx 00007FF6E29312CA xor ecx,ecx 00007FF6E29312CC mov r10,r8 00007FF6E29312CF nop 00007FF6E29312D0 shl rcx,4 00007FF6E29312D4 add rcx,rbx 00007FF6E29312D7 mov qword ptr [rbp-79h],rcx 00007FF6E29312DB mov qword ptr [rbp-71h],rax 00007FF6E29312DF vmovups ymm0,ymmword ptr [rsp+28h] 00007FF6E29312E5 vmovups ymmword ptr [rbp-59h],ymm0 00007FF6E29312EA vmovups xmm0,xmmword ptr [rcx] 00007FF6E29312EE vsubps xmm1,xmm0,xmmword ptr [rax] 00007FF6E29312F2 vmulps xmm3,xmm1,xmm1 00007FF6E29312F6 vmovhlps xmm1,xmm3,xmm3 00007FF6E29312FA vaddps xmm3,xmm1,xmm3 00007FF6E29312FE vshufps xmm0,xmm3,xmm3,1 00007FF6E2931303 vaddss xmm2,xmm3,xmm0 00007FF6E2931307 vmovss dword ptr [r10],xmm2 00007FF6E293130C inc r9d 00007FF6E293130F lea r10,[r10+0FA0h] 00007FF6E2931316 movsxd rcx,r9d 00007FF6E2931319 cmp rcx,r11 00007FF6E293131C jb performTest+140h (07FF6E29312D0h) 00007FF6E293131E inc edx 00007FF6E2931320 add r8,4 00007FF6E2931324 movsxd rax,edx 00007FF6E2931327 cmp rax,r11 00007FF6E293132A jb performTest+130h (07FF6E29312C0h) for (int i = 0; i < nTrials; ++i) { 00007FF6E293132C sub rdi,1 00007FF6E2931330 jne performTest+120h (07FF6E29312B0h) 00007FF6E2931336 vzeroupper } clock_t end = clock(); 00007FF6E2931339 call qword ptr [__imp_clock (07FF6E2933228h)] double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC; 00007FF6E293133F sub eax,r14d 00007FF6E2931342 vxorps xmm0,xmm0,xmm0 00007FF6E2931346 vcvtsi2sd xmm0,xmm0,eax 00007FF6E293134A vdivsd xmm6,xmm0,mmword ptr [__real@408f400000000000 (07FF6E2933478h)] std::cout << "computeDistances: nTrials\t" << nTrials << "\tnPoints\t" << np << "\ttime\t" << elapsed_secs << "\r\n"; 00007FF6E2931352 lea rdx,[string "computeDistances: nTrials\t" (07FF6E2933368h)] std::cout << "computeDistances: nTrials\t" << nTrials << "\tnPoints\t" << np << "\ttime\t" << elapsed_secs << "\r\n"; 00007FF6E2931359 mov rcx,qword ptr [__imp_std::cout (07FF6E2933088h)] 00007FF6E2931360 call std::operator<<<std::char_traits<char> > (07FF6E2931880h) 00007FF6E2931365 mov rcx,rax 00007FF6E2931368 mov edx,0C8h 00007FF6E293136D call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6E2933090h)] 00007FF6E2931373 mov rcx,rax 00007FF6E2931376 lea rdx,[string "\tnPoints\t" (07FF6E2933358h)] 00007FF6E293137D call std::operator<<<std::char_traits<char> > (07FF6E2931880h) 00007FF6E2931382 mov rcx,rax 00007FF6E2931385 mov edx,3E8h 00007FF6E293138A call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6E2933090h)] 00007FF6E2931390 mov rcx,rax 00007FF6E2931393 lea rdx,[string "\ttime\t" (07FF6E293334Ch)] 00007FF6E293139A call std::operator<<<std::char_traits<char> > (07FF6E2931880h) 00007FF6E293139F mov rcx,rax 00007FF6E29313A2 vmovaps xmm1,xmm6 00007FF6E29313A6 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6E2933080h)] 00007FF6E29313AC mov rcx,rax 00007FF6E29313AF lea rdx,[string "\r\n" (07FF6E2933348h)] 00007FF6E29313B6 call std::operator<<<std::char_traits<char> > (07FF6E2931880h) 00007FF6E29313BB nop } 00007FF6E29313BC test rsi,rsi 00007FF6E29313BF je performTest+23Ch (07FF6E29313CCh) 00007FF6E29313C1 mov rcx,qword ptr [rsi-8] 00007FF6E29313C5 call qword ptr [__imp_free (07FF6E2933138h)] 00007FF6E29313CB nop 00007FF6E29313CC test rbx,rbx 00007FF6E29313CF je performTest+2B2h (07FF6E2931442h) 00007FF6E29313D1 mov rax,qword ptr [rbp+7] 00007FF6E29313D5 sub rax,rbx 00007FF6E29313D8 sar rax,4 00007FF6E29313DC mov rcx,0FFFFFFFFFFFFFFFh 00007FF6E29313E6 cmp rax,rcx 00007FF6E29313E9 jbe performTest+262h (07FF6E29313F2h) 00007FF6E29313EB call qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)] 00007FF6E29313F1 int 3 00007FF6E29313F2 shl rax,4 } 00007FF6E29313F6 cmp rax,1000h 00007FF6E29313FC jb performTest+2AAh (07FF6E293143Ah) 00007FF6E29313FE test bl,1Fh 00007FF6E2931401 je performTest+27Ah (07FF6E293140Ah) 00007FF6E2931403 call qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)] 00007FF6E2931409 int 3 00007FF6E293140A mov rax,qword ptr [rbx-8] 00007FF6E293140E cmp rax,rbx 00007FF6E2931411 jb performTest+28Ah (07FF6E293141Ah) 00007FF6E2931413 call qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)] 00007FF6E2931419 int 3 00007FF6E293141A sub rbx,rax 00007FF6E293141D cmp rbx,8 00007FF6E2931421 jae performTest+29Ah (07FF6E293142Ah) 00007FF6E2931423 call qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)] 00007FF6E2931429 int 3 00007FF6E293142A cmp rbx,27h 00007FF6E293142E jbe performTest+2A7h (07FF6E2931437h) 00007FF6E2931430 call qword ptr [__imp__invalid_parameter_noinfo_noreturn (07FF6E29331E0h)] 00007FF6E2931436 int 3 00007FF6E2931437 mov rbx,rax 00007FF6E293143A mov rcx,rbx 00007FF6E293143D call operator delete (07FF6E2931B70h) 00007FF6E2931442 mov rcx,qword ptr [rbp+0Fh] 00007FF6E2931446 xor rcx,rsp 00007FF6E2931449 call __security_check_cookie (07FF6E2931B10h) 00007FF6E293144E lea r11,[rsp+100h] 00007FF6E2931456 mov rbx,qword ptr [r11+10h] 00007FF6E293145A mov rsi,qword ptr [r11+18h] 00007FF6E293145E mov rdi,qword ptr [r11+20h] 00007FF6E2931462 mov r14,qword ptr [r11+28h] 00007FF6E2931466 vmovaps xmm6,xmmword ptr [r11-10h] 00007FF6E293146C vmovaps xmm7,xmmword ptr [r11-20h] 00007FF6E2931472 vmovaps xmm8,xmmword ptr [r11-30h] 00007FF6E2931478 vmovaps xmm9,xmmword ptr [r11-40h] 00007FF6E293147E mov rsp,r11 00007FF6E2931481 pop rbp 00007FF6E2931482 ret
I'm not sure if that changes anything: Could you try replacing auto np = points.size(); by const size_t np = points.size(); And replace both int i and j by size_t. Also (this is unrelated), you should probably iterate through the output matrix in column-major order (you can actually save computing half the matrix, due to symmetry). What might also help is to store (points[i]) in a const Vector4f (in the outer loop). The issue itself looks a bit like an actual evaluator object is stored on the stack (though it is strange that ymm0 is also stored back). Did you compile with full optimization enabled? What is also a bit strange is that MSVC uses unaligned moves instead of aligned ones.
(In reply to Christoph Hertzberg from comment #8) > the stack (though it is strange that ymm0 is also stored back). [...] Using ymm0 at all is actually strange here, since no Packet8f are used. Maybe MSVC just wants to copy 32 bytes from [rsp+28h] to [rbp-59h] (perhaps this copies an Evaluator object?)
(In reply to Christoph Hertzberg from comment #8) > I'm not sure if that changes anything: Could you try replacing > auto np = points.size(); > by > const size_t np = points.size(); > And replace both int i and j by size_t. I tried replacing int with size_t as you suggested but the performances are the same. > Also (this is unrelated), you should probably iterate through the output > matrix in column-major order (you can actually save computing half the > matrix, due to symmetry). What might also help is to store (points[i]) in a > const Vector4f (in the outer loop). Yes, I have written this procedure just for benchmarking, I did not think to optimize it... > The issue itself looks a bit like an actual evaluator object is stored on > the stack (though it is strange that ymm0 is also stored back). Did you > compile with full optimization enabled? I compiled with the "Maximize Speed" (/O2) and with the "Favor fast code" (/Ot) flags enabled.
Just for completeness: these are all the compilation options /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc140.pdb" /Zc:inline /D "_SCL_SECURE_NO_WARNINGS" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\DistancesComputation.pch"
Could you retry with the head of default branch, I've drastically reduced the storage cost of evaluators: https://bitbucket.org/eigen/eigen/commits/a5c3bbc49e62/ Date: 2016-12-20 14:51:30+00:00 Summary: Remove common "noncopyable" base class from evaluator_base to get a chance to get EBO (Empty Base Optimization) Note: we should probably get rid of this class and define a macro instead. https://bitbucket.org/eigen/eigen/commits/f251524e716b/ Date: 2016-12-20 14:55:40+00:00 Summary: Optimize storage layout of Cwise* and PlainObjectBase evaluator to remove the functor or outer-stride if they are empty. For instance, sizeof("(A-B).cwiseAbs2()") with A,B Vector4f is now 16 bytes, instead of 48 before this optimization. In theory, evaluators should be completely optimized away by the compiler, but this might help in some cases.
Hi Gael I have tried compiling with the head of the default branch but unfortunately the performance does not change. This is the asm: 00007FF76B7312B0 test r11,r11 00007FF76B7312B3 je performTest+194h (07FF76B731324h) 00007FF76B7312B5 mov rdx,rsi 00007FF76B7312B8 mov rax,rbx 00007FF76B7312BB mov r10,r11 00007FF76B7312BE xchg ax,ax 00007FF76B7312C0 mov rcx,rbx 00007FF76B7312C3 mov r8,rdx 00007FF76B7312C6 mov r9,r11 00007FF76B7312C9 nop dword ptr [rax] 00007FF76B7312D0 mov qword ptr [rbp-79h],rcx 00007FF76B7312D4 mov qword ptr [rbp-71h],rax 00007FF76B7312D8 vmovups ymm0,ymmword ptr [rsp+28h] 00007FF76B7312DE vmovups ymmword ptr [rbp-59h],ymm0 00007FF76B7312E3 vmovups xmm0,xmmword ptr [rcx] 00007FF76B7312E7 vsubps xmm1,xmm0,xmmword ptr [rax] 00007FF76B7312EB vmulps xmm3,xmm1,xmm1 00007FF76B7312EF vmovhlps xmm1,xmm3,xmm3 00007FF76B7312F3 vaddps xmm3,xmm1,xmm3 00007FF76B7312F7 vshufps xmm0,xmm3,xmm3,1 00007FF76B7312FC vaddss xmm2,xmm3,xmm0 00007FF76B731300 vmovss dword ptr [r8],xmm2 00007FF76B731305 add rcx,10h 00007FF76B731309 lea r8,[r8+0FA0h] 00007FF76B731310 sub r9,1 00007FF76B731314 jne performTest+140h (07FF76B7312D0h) 00007FF76B731316 add rax,10h 00007FF76B73131A add rdx,4 00007FF76B73131E sub r10,1 00007FF76B731322 jne performTest+130h (07FF76B7312C0h)
I'm clueless, those 4 lines really make no sense to me, rbp-79h, rbp-71h, and rbp-59h are never used afterwards, and rsp+28h is not even initialized. Perhaps, they are produced by /GS (buffer security check). What if you remove it? Also, what about /Oy ? (Frame-Pointer Omission)
To check the effect of /GS- you must also disable /sdl with /sdl-
I have tried setting the options as you suggested but the performances are still the same. These are the compilation options: /GS- /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl- /Fd"x64\Release\vc140.pdb" /Zc:inline /D "_SCL_SECURE_NO_WARNINGS" /D "UMFPACK_MDLAB_MBCS" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oy /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\DistancesComputation.pch"
And what is the ASM without the frame pointer (/Oy), this might give some insights.
Also, what the is ASM produced when you introduce the temporary: Eigen::Vector4f dx = points[j] - points[i]; out(i, j) = dx.squaredNorm();
Here it is: 00007FF7B48412A8 test r11,r11 00007FF7B48412AB je performTest+187h (07FF7B4841317h) 00007FF7B48412AD mov rcx,rsi 00007FF7B48412B0 mov rax,rbx 00007FF7B48412B3 mov r10,r11 00007FF7B48412B6 nop word ptr [rax+rax] 00007FF7B48412C0 mov r8,rbx 00007FF7B48412C3 mov rdx,rcx 00007FF7B48412C6 mov r9,r11 00007FF7B48412C9 nop dword ptr [rax] 00007FF7B48412D0 vmovups xmm1,xmmword ptr [r8] 00007FF7B48412D5 vsubps xmm1,xmm1,xmmword ptr [rax] 00007FF7B48412D9 vmovups xmmword ptr [point],xmm1 00007FF7B48412DF vmulps xmm3,xmm1,xmm1 /* Compute the distances between all the couples of points. */ computeDistances(result, points); 00007FF7B48412E3 vmovhlps xmm1,xmm3,xmm3 00007FF7B48412E7 vaddps xmm3,xmm1,xmm3 00007FF7B48412EB vshufps xmm0,xmm3,xmm3,1 00007FF7B48412F0 vaddss xmm2,xmm3,xmm0 00007FF7B48412F4 vmovss dword ptr [rdx],xmm2 00007FF7B48412F8 lea r8,[r8+10h] 00007FF7B48412FC lea rdx,[rdx+0FA0h] 00007FF7B4841303 sub r9,1 00007FF7B4841307 jne performTest+140h (07FF7B48412D0h) 00007FF7B4841309 add rax,10h 00007FF7B484130D add rcx,4 00007FF7B4841311 sub r10,1 00007FF7B4841315 jne performTest+130h (07FF7B48412C0h) for (int i = 0; i < nTrials; ++i) { 00007FF7B4841317 sub rdi,1 00007FF7B484131B jne performTest+118h (07FF7B48412A8h) }
I guess this is with the temp, and what about the asm with /Oy ? This is interesting, the 4 weird lines are gone, but it still copy the temporary (unnecessarily) to memory: vmovups xmmword ptr [point],xmm1 MSVC is really stupid.
Hi Gael the asm I sent you was indeed compiled with /oy .
Are there any changes in this bug? If I use the fix mentioned in the first message, I get the performance same to eigen 3.2. Is it possible to get performance improvement using eigen 3.3 in calculating of squaredNorm?
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1365.