New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 357 - Poor fixed-size vectorizable performance with MSVC 2010
Summary: Poor fixed-size vectorizable performance with MSVC 2010
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: 3.0
Hardware: x86 - 64-bit Windows
: --- Unknown
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-04 18:06 UTC by Colm
Modified: 2013-08-19 22:06 UTC (History)
3 users (show)



Attachments
GCC Assembler output for test case. (9.84 KB, text/plain)
2011-10-04 20:48 UTC, Colm
no flags Details
Portion of MSVC assembler output (6.94 KB, text/plain)
2011-10-04 20:49 UTC, Colm
no flags Details

Description Colm 2011-10-04 18:06:42 UTC
I was having trouble with the matrix exponential function on Windows with Visual C++.  It refused to produce optimized code. I messed around and came up with this simple test which seems to show a huge performance difference between gcc and MSVC for fixed-size vectorizable classes.  

e.g.

using Eigen::Matrix4d;

int main()
{
	
	Matrix4d A = Matrix4d::Random();
	Matrix4d A_Id = Matrix4d::Identity();
	Matrix4d A2 = A * A;
	Matrix4d A4 = A2 * A2;
	Matrix4d m_tmp2;
	
	const double b[] = {17297280., 8648640., 1995840., 277200., 25200., 1512., 56., 1.};
	
	std::clock_t start = std::clock();

	for(int ct=0; ct< 1e7; ++ct)
	{
	m_tmp2 = b[7]*A + b[5]*A4 + b[3]*A2 + b[1]*A_Id;
	}
	std::clock_t stop = std::clock();
	std::cout << "Time taken: " << (stop-start)/(double)CLOCKS_PER_SEC << std::endl;
	
	return 0;
}

Using gcc 4.6.1 (TDM-GCC) as 
>> g++ -s -O3 -DNDEBUG -Ieigen 
gives 
Time taken: 0.03

Using MSVC 2010 64bit as
>> cl /O2 /D"NDEBUG -I"eigen"
gives
Time taken: 0.213

The difference vanishes (actually gcc is a little slower) with dynamic sized matrices.
Comment 1 Gael Guennebaud 2011-10-04 18:31:44 UTC
Be careful with this kind of simple benchmarks where the compiler could too aggressively optimize your code. For instance the compiler could completely remove the for loop or take advantage that the values of the objects are known at compile time.

It is better to write a small function with EIGEN_DONT_INLINE:

EIGEN_DONT_INLINE void foo(Matrix4d& A, ....) {
 m_tmp2 = b[7]*A + b[5]*A4 + b[3]*A2 + b[1]*A_Id;
}

and then call this function multiple times to bench it.

However I'm not sure EIGEN_DONT_INLINE does anything with MSVC, so perhaps use a separate .cpp file to implement it and make sure this function won't be inlined.
Comment 2 Gael Guennebaud 2011-10-04 18:42:31 UTC
ok, ICC seems to perform poorly as well. I check the assembler, and the reason is poor inlining. I bet this is the same reason with MSVC.

To check the asm, I add an enclosing pair:

EIGEN_ASM_COMMENT("mybegin");
...
EIGEN_ASM_COMMENT("myend");

around the critical expression to facilitate the search of the relevant asm lines.

You could try to figure out which function are poorly inlined with MSVC, and declare them with EIGEN_STRONG_INLINE, and get back to us.
Comment 3 Gael Guennebaud 2011-10-04 18:49:18 UTC
EDIT:

actually ICC does not inline the assignment but the rest is properly inlined, the performance issue is not here for ICC. The reason is an abusive use of the movddup instruction which is called multiple times (4) on the same variable while it should be called only once per b[i].

could you check the assembler produced by MSVC?
Comment 4 Colm 2011-10-04 20:48:33 UTC
Created attachment 216 [details]
GCC Assembler output for test case.
Comment 5 Colm 2011-10-04 20:49:02 UTC
Created attachment 217 [details]
Portion of MSVC assembler output
Comment 6 Colm 2011-10-04 20:50:12 UTC
(In reply to comment #3)
> EDIT:
> 
> actually ICC does not inline the assignment but the rest is properly inlined,
> the performance issue is not here for ICC. The reason is an abusive use of the
> movddup instruction which is called multiple times (4) on the same variable
> while it should be called only once per b[i].
> 
> could you check the assembler produced by MSVC?

Wow, thanks for the fast response.  

Sorry, I'm not very useful with the assembler code.  I spent some time poking around and the best I can come up with is that gcc is creating these nice "assign_LinearTraversal_CompleteUnrolling" functions whereas MSVC is not.  However, I'm clueless on why. Poking around in Assign.h everything seems to be EIGEN_STRONG_INLINE.

Note You need to log in before you can comment on or make changes to this bug.