Short Story: Every new VS version I test if my simulation compiled with VS run faster/just as fast as the ones with clang-cl. I tested it because i wanted to use some SVML features but it seems I have to wait until clang-cl supports the same intrinsics to get a speed advantage.
clang-cl: around 140-150 ns/step
vs2019 (without adding force inline to eigen3): around 450 ns/step
vs2019 (with adding force inline to eigen3): around 400 ns/step
vs2019 (forcing everything in hot loop inline; like clang-cl does): around 700 ns/step
What can be seen from the assembly is the following:
The vs2019 assembly is roughly 3 times as long as the clang-cl assembly.
The vs2019 assembly has 934 instructions containing the word "mov" from which are at least 307 unaligned moves (vmovupd).
clang in comparison generates only 184 mov like instructions whereas only 5 are vmovupd.
The question thus is:
Why does VS generate so many mov instruction with the eigen3 library?
Why are there so many unaligned moves although the circumstance that eigen3 aligns the data?
/permissive- /MP /we"4289" /GS- /TP /W4 /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Zc:inline /fp:fast /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "_SILENCE_CXX17_RESULT_OF_DEPRECATION_WARNING" /D "USE_BOOST_RANDOM" /D "USE_PCG_RANDOM" /D "CMAKE_INTDIR=\"RelWithDebInfo\"" /D "_MBCS" /fp:except- /errorReport:prompt /WX- /Zc:forScope /GR /arch:AVX /Gd /Oy /Oi /MD /std:c++17 /Fa"RelWithDebInfo/" /EHsc /Ot /diagnostics:classic /w44640 /w14242 /w14254 /w14263 /w14265 /w14287 /w14296 /w14311 /w14545 /w14546 /w14547 /w14549 /w14555 /w14619 /w14640 /w14826 /w14905 /w14906 /w14928 /bigobj /Ob3
My Code also saw the problem of #1365 but I just decided to switch to clang-cl ;)
I also posted the bug in the VS Feedback because it could be an optimizer bug?
Created attachment 925 [details]
EIGEN_STRONG_INLINE Patch to inline everything with VS2019
Created attachment 926 [details]
Assembly generated by clang-cl
Created attachment 927 [details]
Assembly generated by vs2019
Is this new to VS 2019? Was the speed better with the previous version of VS? Also, please attach the code of the relevant function, otherwise we cannot build any insights from the ASM alone except that VS is doing countless useless copies.
Here just a small example:
(did not even need to implement more things)
(also code gives an error if i add /permissive- as a compiler flag)
Performance has been bad since I switched from 3.2 to 3.3 when I was still using VS2015. Clang-Cl Performance was always good (better as VS although not a factor of 3 with eigen 3.2).
Personally, I think it is an optimizer bug.
That is way i posted it also in VS feedback
VS Developer Community Links:
> Performance has been bad since I switched from 3.2 to 3.3 when I was still
> using VS2015. Clang-Cl Performance was always good (better as VS although
> not a factor of 3 with eigen 3.2).
Created attachment 928 [details]
even more strong inlining
found even more placed which need strong inlining
Created attachment 929 [details]
More complex example generating a lot of mov in VS
clang-cl 224 lines of assembly/ 77 mov like / 9 vmovupd
vs2019 802 lines of assembly/ 614 mov like / 120 vmovupd
Created attachment 930 [details]
VS assembly for the complex case
Created attachment 931 [details]
clang cl assembly for the complex case
Thank you for reporting those issues to VS team. Looking forward to what they have to say.
I applied the strong inline changes (I added more):
and kudos for finding the binary op members in Macros.h ;)
(In reply to Gael Guennebaud from comment #12)
> and kudos for finding the binary op members in Macros.h ;)
that one wasn't even difficult. The difficult one was the compiler generated copy constructor which was not inlined. Going back from assembly to the code just went nowhere and the name of the function call was so long that it did not fully display in the assembly. With all other functions I could always step in the function call and just go back to the source code and see where the STRONG_INLINE or inline is/was missing. The macro one was just a right click and go to definition away ;)
Hopefully the VS team will solve the inlining and optimizer issue (some day). Until then clang-cl is my friend ;). (Or the Intel Compiler if it can someday compile my code.... Maybe Version 20 has all the required c++17 features)
Until then I wait for clang-cl to implement svml intrinsics so that i can finally optimize:
//Prepare Sines and Cosines Cache
//Could try to get the compiler to emit sincos call!<-does not work with clang-cl
StateSines = yi.array().sin();
StateCosines = yi.array().cos();
> StateSines = yi.array().sin();
> StateCosines = yi.array().cos();
For that see: bug 984.
For the record, regarding compiler-generated copy-ctor, I've already hit this issue with ICC in CwiseUnaryOp, see bug 667.
Seems like a got someone to at least partially look upon the issue.
have there been some changes to eigen from 3.2 to 3.3 in connection with what the reddit post mentions?