New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1680 - More VS EIGEN_STRONG_INLINE and VS code performance
Summary: More VS EIGEN_STRONG_INLINE and VS code performance
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - general (show other bugs)
Version: 3.3 (current stable)
Hardware: x86 - AVX Windows
: Normal Performance Problem
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-14 14:31 UTC by neumann
Modified: 2019-03-25 09:48 UTC (History)
4 users (show)



Attachments
EIGEN_STRONG_INLINE Patch to inline everything with VS2019 (40.25 KB, patch)
2019-02-14 14:32 UTC, neumann
no flags Details | Diff
Assembly generated by clang-cl (33.66 KB, text/plain)
2019-02-14 14:33 UTC, neumann
no flags Details
Assembly generated by vs2019 (97.89 KB, text/plain)
2019-02-14 14:33 UTC, neumann
no flags Details
even more strong inlining (1.97 KB, patch)
2019-02-15 13:49 UTC, neumann
no flags Details | Diff
More complex example generating a lot of mov in VS (2.51 KB, text/plain)
2019-02-15 13:51 UTC, neumann
no flags Details
VS assembly for the complex case (45.04 KB, text/plain)
2019-02-15 13:51 UTC, neumann
no flags Details
clang cl assembly for the complex case (11.97 KB, text/plain)
2019-02-15 13:52 UTC, neumann
no flags Details

Description neumann 2019-02-14 14:31:53 UTC
Short Story: Every new VS version I test if my simulation compiled with VS run faster/just as fast as the ones with clang-cl.  I tested it because i wanted to use some SVML features but it seems I have to wait until clang-cl supports the same intrinsics to get a speed advantage. 

Benchmark: 
clang-cl: around 140-150 ns/step
vs2019 (without adding force inline to eigen3): around 450 ns/step
vs2019 (with adding force inline to eigen3): around 400 ns/step
vs2019 (forcing everything in hot loop inline; like clang-cl does): around 700 ns/step

What can be seen from the assembly is the following:
The vs2019 assembly is roughly 3 times as long as the clang-cl assembly.
The vs2019 assembly has 934 instructions containing the word "mov" from which are at least 307 unaligned moves (vmovupd). 
clang in comparison generates only 184 mov like instructions whereas only 5 are vmovupd.

The question thus is: 
Why does VS generate so many mov instruction with the eigen3 library?
Why are there so many unaligned moves although  the circumstance that eigen3 aligns the data?

Compiler Flags:
/permissive- /MP /we"4289" /GS- /TP /W4 /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Zc:inline /fp:fast /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "_SILENCE_CXX17_RESULT_OF_DEPRECATION_WARNING" /D "USE_BOOST_RANDOM" /D "USE_PCG_RANDOM" /D "CMAKE_INTDIR=\"RelWithDebInfo\"" /D "_MBCS" /fp:except- /errorReport:prompt /WX- /Zc:forScope /GR /arch:AVX /Gd /Oy /Oi /MD /std:c++17 /Fa"RelWithDebInfo/" /EHsc  /Ot /diagnostics:classic  /w44640 /w14242 /w14254 /w14263 /w14265 /w14287 /w14296 /w14311 /w14545 /w14546 /w14547 /w14549 /w14555 /w14619 /w14640 /w14826 /w14905 /w14906 /w14928 /bigobj /Ob3

Side Note: 
My Code also saw the problem of #1365 but I just decided to switch to clang-cl ;)
I also posted the bug in the VS Feedback because it could be an optimizer bug?
Comment 1 neumann 2019-02-14 14:32:42 UTC
Created attachment 925 [details]
EIGEN_STRONG_INLINE Patch to inline everything with VS2019
Comment 2 neumann 2019-02-14 14:33:10 UTC
Created attachment 926 [details]
Assembly generated by clang-cl
Comment 3 neumann 2019-02-14 14:33:41 UTC
Created attachment 927 [details]
Assembly generated by vs2019
Comment 4 Gael Guennebaud 2019-02-15 10:19:39 UTC
Is this new to VS 2019? Was the speed better with the previous version of VS? Also, please attach the code of the relevant function, otherwise we cannot build any insights from the ASM alone except that VS is doing countless useless copies.
Comment 5 neumann 2019-02-15 11:56:22 UTC
Here just a small example:
https://godbolt.org/z/hxeVQc
(did not even need to implement more things)
(also code gives an error if i add /permissive- as a compiler flag)

Performance has been bad since I switched from 3.2 to 3.3 when I was still using VS2015. Clang-Cl Performance was always good (better as VS although not a factor of 3 with eigen 3.2). 

Personally, I think it is an optimizer bug.
That is way i posted it also in VS feedback
Comment 7 neumann 2019-02-15 13:10:58 UTC
> Performance has been bad since I switched from 3.2 to 3.3 when I was still
> using VS2015. Clang-Cl Performance was always good (better as VS although
> not a factor of 3 with eigen 3.2). 

see also: 
http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1365
Comment 8 neumann 2019-02-15 13:49:17 UTC
Created attachment 928 [details]
even more strong inlining

found even more placed which need strong inlining
Comment 9 neumann 2019-02-15 13:51:25 UTC
Created attachment 929 [details]
More complex example generating a lot of mov in VS

clang-cl 224 lines of assembly/ 77 mov like / 9 vmovupd
vs2019 802 lines of assembly/ 614 mov like / 120 vmovupd
Comment 10 neumann 2019-02-15 13:51:57 UTC
Created attachment 930 [details]
VS assembly for the complex case
Comment 11 neumann 2019-02-15 13:52:20 UTC
Created attachment 931 [details]
clang cl assembly for the complex case
Comment 12 Gael Guennebaud 2019-02-15 15:46:00 UTC
Thank you for reporting those issues to VS team. Looking forward to what they have to say.


I applied the strong inline changes (I added more):

https://bitbucket.org/eigen/eigen/commits/2c6c790507
https://bitbucket.org/eigen/eigen/commits/a5e0ab8403

and kudos for finding the binary op members in Macros.h ;)
Comment 13 neumann 2019-02-15 19:27:24 UTC
(In reply to Gael Guennebaud from comment #12)
> and kudos for finding the binary op members in Macros.h ;)

that one wasn't even difficult. The difficult one was the compiler generated copy constructor which was not inlined. Going back from assembly to the code just went nowhere and the name of the function call was so long that it did not fully display in the assembly. With all other functions I could always step in the function call and just go back to the source code and see where the STRONG_INLINE or inline is/was missing. The macro one was just a right click and go to definition away ;)  

Hopefully the VS team will solve the inlining and optimizer issue (some day). Until then clang-cl is my friend ;). (Or the Intel Compiler if it can someday compile my code.... Maybe Version 20 has all the required c++17 features)

Until then I wait for clang-cl to implement svml intrinsics so that i can finally optimize:

//Prepare Sines and Cosines Cache
//Could try to get the compiler to emit sincos call!<-does not work with clang-cl 
StateSines = yi.array().sin();
StateCosines = yi.array().cos();
(5D vectors)
Comment 14 Gael Guennebaud 2019-02-16 10:08:27 UTC
> StateSines = yi.array().sin();
> StateCosines = yi.array().cos();

For that see: bug 984.
Comment 15 Gael Guennebaud 2019-02-20 16:57:56 UTC
For the record, regarding compiler-generated copy-ctor, I've already hit this issue with ICC in CwiseUnaryOp, see bug 667.
Comment 16 neumann 2019-03-25 09:48:47 UTC
Seems like a got someone to at least partially look upon the issue. 

https://www.reddit.com/r/cpp/comments/b2ulzp/c_team_blog_game_performance_and_compilation_time/ejb08rv?utm_source=share&utm_medium=web2x

have there been some changes to eigen from 3.2 to 3.3 in connection with what the reddit post mentions?

Note You need to log in before you can comment on or make changes to this bug.