New user self-registration is currently disabled. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Bug 1317 - Performance hit in 3.3rc1
Performance hit in 3.3rc1
Status: CONFIRMED
Product: Eigen
Classification: Unclassified
Component: Core - general
3.3 (current stable)
x86 - 64-bit Windows
: Normal Performance Problem
Assigned To: Nobody
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-09-29 15:12 UTC by Avi Ginsburg
Modified: 2016-10-12 16:23 UTC (History)
4 users (show)



Attachments
7z'd assembly dumps for the different versions (88.54 KB, application/x-7z-compressed)
2016-10-05 08:28 UTC, Avi Ginsburg
no flags Details

Description Avi Ginsburg 2016-09-29 15:12:26 UTC
Degraded performance (3.3rc1) for the MCVE below when compared to 3.2.9.
My tested setup MSVC 2013; 64-bit(32-bit performance is identical between versions); w/o AVX to be fair, but both with and w/o SSE). I have not tested (yet, on my todo list) on Ubuntu/gcc.

MCVE:
#include <iostream>
#include <chrono>

// Can be on or off
//#define EIGEN_DONT_VECTORIZE

#ifdef _MSC_VER

#define OLDVER
#ifdef OLDVER
#include <Eigen3.2.9/Eigen/Core>
#else
#include <Eigen3.3rc1/Eigen/Core>
#endif

#else
#include <Eigen/Core>
#endif


int main()
{
	int len = 16 * 512;

	Eigen::Matrix3Xf l; l.setRandom(3, len);
	float res = 0;

	std::cout << "Hello Eigen\t";
	std::cout << EIGEN_WORLD_VERSION << "." 
		<< EIGEN_MAJOR_VERSION << "." 
		<< EIGEN_MINOR_VERSION << "\n";
	std::cout << Eigen::SimdInstructionSetsInUse() << "\n";

	auto t1 = std::chrono::high_resolution_clock::now();

	for (int i = 0; i < len; i++)
	{
		res += (l.leftCols(i).colwise() - l.col(i)).colwise().norm().eval().sum();
	}

	auto t2 = std::chrono::high_resolution_clock::now();
	auto t = std::chrono::duration_cast <std::chrono::milliseconds>(t2 - t1).count();

	std::cout << "timing: " << t << " ms\t" << res << std::endl;

	return 0;
}

Outputs:
Hello Eigen     3.2.9
SSE, SSE2
timing: 275 ms  4.44498e+007

Hello Eigen     3.2.94
SSE, SSE2
timing: 733 ms  4.44498e+007

Just a note: Using 3.2.9, the eval is faster than w/o it. Using 3.3rc1, w/o eval is faster than 3.2.9 with eval.

The main issue is `(l.leftCols(i).colwise() - l.col(i)).colwise()`. Whatever appears to the right (as long as eval is used) matters less.
Comment 1 Gael Guennebaud 2016-09-30 15:44:32 UTC
Interesting. With gcc I get exact same performance (90ms), but with clang I get:

3.2: 90ms
3.3: 440ms

No choice but investigate asm to track inlining issues.
Comment 2 Gael Guennebaud 2016-09-30 15:52:21 UTC
hm, no inlining issue.

Here is what is generated with 3.2 or gcc with 3.3:

L7:
	movss	8(%rax), %xmm2
	addq	$12, %rax
	movss	-8(%rax), %xmm8
	subss	%xmm5, %xmm2
	movss	-12(%rax), %xmm1
	cmpq	%rax, %rdx
	subss	%xmm6, %xmm8
	subss	%xmm4, %xmm1
	mulss	%xmm2, %xmm2
	mulss	%xmm8, %xmm8
	mulss	%xmm1, %xmm1
	addss	%xmm8, %xmm2
	addss	%xmm2, %xmm1
	movss	%xmm1, -12(%rsp)
	movss	-12(%rsp), %xmm1
	sqrtss	%xmm1, %xmm1
	addss	%xmm1, %xmm3
	jne	L7


And with clang on 3.3:


LBB0_4:                                 ## %.lr.ph.i
                                        ##   Parent Loop BB0_2 Depth=1
                                        ## =>  This Inner Loop Header: Depth=2
	movq	%rdx, -344(%rbp)
	movq	%rbx, -328(%rbp)
	movq	%rdi, -320(%rbp)
	movq	-368(%rbp), %r9
	movq	-360(%rbp), %r11
	movq	%r11, 8(%rax)
	movq	%r9, (%rax)
	movq	$3, -296(%rbp)
	movq	%rcx, -288(%rbp)
	movq	%rdi, -272(%rbp)
	movq	$0, -264(%rbp)
	movq	%rbx, -256(%rbp)
	movq	%rdx, -160(%rbp)
	movq	$3, -248(%rbp)
	movq	%r15, -152(%rbp)
	movq	%rcx, -144(%rbp)
	movq	%rdi, -128(%rbp)
	movq	%rbx, -232(%rbp)
	movq	$0, -120(%rbp)
	movq	%rbx, -112(%rbp)
	movq	$3, -104(%rbp)
	movq	$0, -216(%rbp)
	movq	%rcx, -96(%rbp)
	movq	%r12, -88(%rbp)
	movq	$0, -72(%rbp)
	movq	%r8, -208(%rbp)
	movq	%r8, -64(%rbp)
	movq	%r13, -56(%rbp)
	movss	-8(%rsi), %xmm3         ## xmm3 = mem[0],zero,zero,zero
	movss	-4(%rsi), %xmm4         ## xmm4 = mem[0],zero,zero,zero
	subss	(%rcx), %xmm3
	mulss	%xmm3, %xmm3
	subss	4(%rcx), %xmm4
	mulss	%xmm4, %xmm4
	movss	(%rsi), %xmm5           ## xmm5 = mem[0],zero,zero,zero
	subss	8(%rcx), %xmm5
	mulss	%xmm5, %xmm5
	addss	%xmm4, %xmm5
	addss	%xmm3, %xmm5
	xorps	%xmm3, %xmm3
	movss	%xmm5, %xmm3            ## xmm3 = xmm5[0],xmm3[1,2,3]
	sqrtss	%xmm3, %xmm3
	incq	%r8
	addq	$12, %rsi
	cmpq	%r8, %rbx
	addss	%xmm3, %xmm2
	jne	LBB0_4

so 28 lines of completely useless movq....

It would be interesting to see what MSVC generates.
Comment 3 Gael Guennebaud 2016-09-30 16:37:53 UTC
OK, a simpler way to exhibit the issue is:

for (int i = 0; i < l.cols(); i++)
  for (int j = 0; j < l.cols(); j++)
    res += (l.col(j)-l.col(i)).sum();

versus:

for (int i = 0; i < l.cols(); i++)
  for (int j = 0; j < l.cols(); j++)
    res += (l - l.col(i).rowwise().replicate(l.cols())).col(j).sum();

both are equivalent, but the second is generating more layers of abstraction that seems to confuse clang with Eigen 3.3 (though nothing fundamentally changed between 3.2, and 3.3 -> instead of having a tree of expression we have a tree of evaluator).

This could probably be fixed by writing more code to push any Block expressions down tho the leaves of the expression tree, but that will be for later!

In the meantime, I really don't know what could be done to ease compiler job. My experiments are not giving me any clue so far.
Comment 4 Gael Guennebaud 2016-09-30 16:55:00 UTC
I think I got some hints.

Actually, one can workaround the issue by evaluating the most nested block expression:

Vector3f li = l.col(i);
res += (l.colwise() - li).colwise().norm().eval().sum();

Even more surprising is that declaring li as a Ref:

Ref<const Vector3f> li = l.col(i);

also "fix" the issue whereas here Ref<const Vector3f> is really like l.col(i):
 - no copy
 - both inherits MapBase that implements the evaluation details.

So the only difference is that Block provides additional accessors to the starting row/col indices and reference to the nested expression so that one can crawl through the expression tree and get all is needed to rewrite it.

The "problem" is that Block as to store this information, even though it is not needed. This is something new in 3.2. GCC is smart enough to get completely rid of them, but apparently clang and msvc are not that capable in this case.
Comment 5 Gael Guennebaud 2016-09-30 20:04:39 UTC
Sadly, that's not so simple. If I replace:


Ref<const Vector3f> li = l.col(i);

with

Map<const Vector3f> li(l.data()+i*3);

then clang will mess up again even though in this case Map and Ref are binary compatible and they share the same traits and same evaluator... And after simplifying Block<>, I get the same as Map.
Comment 6 Gael Guennebaud 2016-10-01 13:38:19 UTC
Problem solved with clang:

https://bitbucket.org/eigen/eigen/commits/00094bd861b8/
Summary:     Bug 1317: fix performance regression with some Block expressions and clang by helping it to remove dead code.
The trick is to get rid of the nested expression in the evaluator by copying only the required information (here, the strides).


Let us know if this also does the trick with Visual.
Comment 7 Avi Ginsburg 2016-10-01 18:45:53 UTC
I won't be in the office until Wednesday. I'll let you know then.
Comment 8 Avi Ginsburg 2016-10-05 08:28:20 UTC
Created attachment 743 [details]
7z'd assembly dumps for the different versions

Somewhat better, but still not optimal (for cl, not talking about gcc). I get the following times:
3.2.7    270ms
3.3rc1   716ms
c7527e   318ms

I'm attaching the entire asm output from VS2013. My assembly is about as good as my French, so I'll leave them to you.
Comment 9 Avi Ginsburg 2016-10-05 08:55:09 UTC
Just as an aside, VS2015 produces performance very similar to VS2013 for all three versions.
Comment 10 Gael Guennebaud 2016-10-12 16:07:17 UTC
it's hard to see because Visual is generating ASM for all level of inlining... to it is not clear which code is really executed. Nonetheless, I see some ugly lines of useless copies like:

	movups	xmm0, XMMWORD PTR [rax]
	movups	XMMWORD PTR [rcx], xmm0
	movups	xmm1, XMMWORD PTR [rax+16]
	movups	XMMWORD PTR [rcx+16], xmm1
	movups	xmm0, XMMWORD PTR [rax+32]
	movups	XMMWORD PTR [rcx+32], xmm0
	movups	xmm1, XMMWORD PTR [rax+48]
	movups	XMMWORD PTR [rcx+48], xmm1
	movups	xmm0, XMMWORD PTR [rax+64]
	movups	XMMWORD PTR [rcx+64], xmm0
	movups	xmm1, XMMWORD PTR [rax+80]
	movups	XMMWORD PTR [rcx+80], xmm1
	movups	xmm0, XMMWORD PTR [rax+96]
	movups	XMMWORD PTR [rcx+96], xmm0
	movups	xmm1, XMMWORD PTR [rax+112]
	movups	XMMWORD PTR [rcx+112], xmm1
	movups	xmm0, XMMWORD PTR [rax+128]
	movups	XMMWORD PTR [rcx+128], xmm0
	movups	xmm1, XMMWORD PTR [rax+144]
	movups	XMMWORD PTR [rcx+144], xmm1

coming from:

EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
explicit CwiseUnaryOp(const XprType& xpr, const UnaryOp& func = UnaryOp())
 : m_xpr(xpr), m_functor(func) {}


This represents 40 floats, no idea where they come from.
Comment 11 Gael Guennebaud 2016-10-12 16:23:14 UTC
ok, those 160 bytes comes from the nested expression:

sizeof( decltype((l.leftCols(3).colwise() - l.col(2)).array().col(1)) )

returns 152 with 3.3, and 120 with 3.2. In theory the expression objects only have to live at compile time, and the compiler should generate no such code (as do clang and gcc).

Note You need to log in before you can comment on or make changes to this bug.