New user self-registration is currently disabled. Please email eigen-core-team @ if you need an account.
Bug 1181 - Performance Issue changing from 3.2.8 to 3.3beta1
Performance Issue changing from 3.2.8 to 3.3beta1
Product: Eigen
Classification: Unclassified
Component: Core - matrix products
3.3 (current stable)
x86 - 64-bit All
: Normal Performance Problem
Assigned To: Nobody
Depends on:
Blocks: 3.3
  Show dependency treegraph
Reported: 2016-03-18 12:37 UTC by neumann
Modified: 2016-05-31 15:38 UTC (History)
2 users (show)

Test code (1.07 KB, text/plain)
2016-04-14 20:23 UTC, neumann
no flags Details

Description neumann 2016-03-18 12:37:28 UTC
I wanted to try out Eigen 3.3beta1 and found that my code was running approx. ten times slower than with Eigen 3.2.8

By playing around with the code I found that it will also be slow in Eigen 3.2.8 if const is removed from the line "const Vector3d yi{ tmpyi };" 

Testcode: (Inlcude eigen/core und chrono)

int main(int argc, char** argv)
	using namespace Eigen;

	Matrix3d rot, rott;
	rot << 0, 0, 1,
		1, 0, 0,
		0, 1, 0;
	rott << rot.transpose();

	const double Prefactor1 = 1.5456;
	const double Prefactor2 = 3.5456;
	Vector3d tmpyi;
	tmpyi << 2, 1, 9;

	const Vector3d yi{ tmpyi };
	std::chrono::time_point<std::chrono::high_resolution_clock, std::chrono::nanoseconds> det_init_begin, det_init_finished;
	det_init_begin = std::chrono::high_resolution_clock::now();
	Matrix3d tmp;
	for (int i = 0; i < 1000000; ++i)
		tmp = (Prefactor1*(rot*yi.asDiagonal()*rott*-1 * rott + rott*yi.asDiagonal()*rot*rot) + Prefactor2*(yi.asDiagonal()*Matrix3d::Ones()*yi.asDiagonal()));

	det_init_finished = std::chrono::high_resolution_clock::now();
	std::cout << "It took " << (det_init_finished - det_init_begin).count() / 1E6 << " ms to calc 10^6 values " << std::endl << std::endl;
        return 0;
Comment 1 Gael Guennebaud 2016-04-13 14:40:56 UTC
Be careful with such benchmark as the compiler is free to move some computation out of the loop. Anyway, I guess the problem is mostly on inline, so please first try with the latest devel branch as we've already enforced the inline of some critical functions for MSVC. Also, you can avoid redundant computations by rewriting your expressions as: 

  Vector3d p1yi = Prefactor1*yi;
  Matrix3d rr = rot*rot;
  Matrix3d tmp0 = rot*p1yi.asDiagonal();
  Matrix3d tmp1 = rott*p1yi.asDiagonal();
  return tmp1*rr - tmp0*rr.transpose() + (Prefactor2*yi*yi.transpose());

Also, I cannot reproduce such a perf. regression with clang or gcc.
Comment 2 neumann 2016-04-14 20:23:11 UTC
Created attachment 695 [details]
Test code

Did some additional tests and found this easy test case.

Tested with following options under MSVS 2013 (Update 2) / 2015 (Update 1 and 2)
(Do not have any other compilers installed to test it out)

Compiler options:
/FR"x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /sdl /Fd"x64\Release\vc140.pdb" /Zc:inline /D "_MBCS" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\Clustertest.pch" 

Disassembly in the slow case has this additional call:
00007FF727782B74 E8 D7 40 00 00       call        Eigen::internal::generic_product_impl<Eigen::Matrix<double,3,1,0,3,1>,Eigen::Transpose<Eigen::Matrix<double,3,1,0,3,1> >,Eigen::DenseShape,Eigen::DenseShape,3>::evalTo<Eigen::Matrix<double,3,3,0,3,3> > (07FF727786C50h) 

so it is a inline problem. The question now is why is it happening in eigen 3.3 and not in 3.2.8? Speed drops from 12-14ms to 140-160ms for me.
(Tested with build: 518152aee731 and 4c94692de3e5 (latest?))

Another call I see is:
00007FF727782D01 E8 4A 34 00 00       call        Eigen::DenseBase<Eigen::Matrix<double,3,1,0,3,1> >::operator<< (07FF727786150h)  

I think this is normal and does not greatly affect speed.

And thanks for the yi*yi.transpose() optimization. Helped a bit getting my code faster.
Comment 3 neumann 2016-04-14 20:47:34 UTC
Hmm changing this line (89) in GeneralProduct.h:

template<> struct product_type_selector<Small, Small, 1>    { enum { ret = LazyCoeffBasedProductMode}; };


template<> struct product_type_selector<Small, Small, 1>    { enum { ret = OuterProduct}; };

makes it a lot faster (about 22ms) but it is still slower than before
Comment 4 neumann 2016-04-14 21:26:25 UTC
So i solved my performance problem by adding EIGEN_STRONG_INLINE to the following functions by just follwing the disassembly calls:

Eigen::internal::generic_product_impl<Eigen::Matrix<double,3,1,0,3,1>,Eigen::Transpose<Eigen::Matrix<double,3,1,0,3,1>>,Eigen::DenseShape,Eigen::DenseShape,3>::evalTo<Eigen::Matrix<double,3,3,0,3,3> >()

Eigen::internal::Assignment<Eigen::Matrix<double, 3, 3, 0, 3, 3>, Eigen::Product<Eigen::Matrix<double, 3, 1, 0, 3, 1>, Eigen::Transpose<Eigen::Matrix<double, 3, 1, 0, 3, 1> >, 0>, Eigen::internal::assign_op<double>, Eigen::internal::Dense2Dense, double>::run()

Eigen::internal::variable_if_dynamic<__int64, 3>::variable_if_dynamic<__int64, 3>()

i do not know if it is a good solution but it works
Comment 5 Christoph Hertzberg 2016-04-15 13:23:46 UTC
If it solves the issue, I'm fine with adding some additional EIGEN_STRONG_INLINE -- but I don't think that <Small,Small, 1> should be changed to OuterProduct (comment 3).
Comment 6 Gael Guennebaud 2016-04-15 20:50:32 UTC
I'm fine with adding such EIGEN_STRONG_INLINE, could you attach the diff so that we are sure we are adding to right ones. Thanks.
Comment 7 Gael Guennebaud 2016-05-31 15:38:28 UTC

Note You need to log in before you can comment on or make changes to this bug.