Summary: | Performance hit in 3.3rc1 | ||||||
---|---|---|---|---|---|---|---|
Product: | Eigen | Reporter: | Avi Ginsburg <coffeelover82> | ||||
Component: | Core - general | Assignee: | Nobody <eigen.nobody> | ||||
Status: | CONFIRMED --- | ||||||
Severity: | Performance Problem | CC: | chtz, coffeelover82, gael.guennebaud, jacob.benoit.1 | ||||
Priority: | Normal | ||||||
Version: | 3.3 (current stable) | ||||||
Hardware: | x86 - 64-bit | ||||||
OS: | Windows | ||||||
Whiteboard: | |||||||
Attachments: |
|
Description
Avi Ginsburg
2016-09-29 15:12:26 UTC
Interesting. With gcc I get exact same performance (90ms), but with clang I get: 3.2: 90ms 3.3: 440ms No choice but investigate asm to track inlining issues. hm, no inlining issue. Here is what is generated with 3.2 or gcc with 3.3: L7: movss 8(%rax), %xmm2 addq $12, %rax movss -8(%rax), %xmm8 subss %xmm5, %xmm2 movss -12(%rax), %xmm1 cmpq %rax, %rdx subss %xmm6, %xmm8 subss %xmm4, %xmm1 mulss %xmm2, %xmm2 mulss %xmm8, %xmm8 mulss %xmm1, %xmm1 addss %xmm8, %xmm2 addss %xmm2, %xmm1 movss %xmm1, -12(%rsp) movss -12(%rsp), %xmm1 sqrtss %xmm1, %xmm1 addss %xmm1, %xmm3 jne L7 And with clang on 3.3: LBB0_4: ## %.lr.ph.i ## Parent Loop BB0_2 Depth=1 ## => This Inner Loop Header: Depth=2 movq %rdx, -344(%rbp) movq %rbx, -328(%rbp) movq %rdi, -320(%rbp) movq -368(%rbp), %r9 movq -360(%rbp), %r11 movq %r11, 8(%rax) movq %r9, (%rax) movq $3, -296(%rbp) movq %rcx, -288(%rbp) movq %rdi, -272(%rbp) movq $0, -264(%rbp) movq %rbx, -256(%rbp) movq %rdx, -160(%rbp) movq $3, -248(%rbp) movq %r15, -152(%rbp) movq %rcx, -144(%rbp) movq %rdi, -128(%rbp) movq %rbx, -232(%rbp) movq $0, -120(%rbp) movq %rbx, -112(%rbp) movq $3, -104(%rbp) movq $0, -216(%rbp) movq %rcx, -96(%rbp) movq %r12, -88(%rbp) movq $0, -72(%rbp) movq %r8, -208(%rbp) movq %r8, -64(%rbp) movq %r13, -56(%rbp) movss -8(%rsi), %xmm3 ## xmm3 = mem[0],zero,zero,zero movss -4(%rsi), %xmm4 ## xmm4 = mem[0],zero,zero,zero subss (%rcx), %xmm3 mulss %xmm3, %xmm3 subss 4(%rcx), %xmm4 mulss %xmm4, %xmm4 movss (%rsi), %xmm5 ## xmm5 = mem[0],zero,zero,zero subss 8(%rcx), %xmm5 mulss %xmm5, %xmm5 addss %xmm4, %xmm5 addss %xmm3, %xmm5 xorps %xmm3, %xmm3 movss %xmm5, %xmm3 ## xmm3 = xmm5[0],xmm3[1,2,3] sqrtss %xmm3, %xmm3 incq %r8 addq $12, %rsi cmpq %r8, %rbx addss %xmm3, %xmm2 jne LBB0_4 so 28 lines of completely useless movq.... It would be interesting to see what MSVC generates. OK, a simpler way to exhibit the issue is: for (int i = 0; i < l.cols(); i++) for (int j = 0; j < l.cols(); j++) res += (l.col(j)-l.col(i)).sum(); versus: for (int i = 0; i < l.cols(); i++) for (int j = 0; j < l.cols(); j++) res += (l - l.col(i).rowwise().replicate(l.cols())).col(j).sum(); both are equivalent, but the second is generating more layers of abstraction that seems to confuse clang with Eigen 3.3 (though nothing fundamentally changed between 3.2, and 3.3 -> instead of having a tree of expression we have a tree of evaluator). This could probably be fixed by writing more code to push any Block expressions down tho the leaves of the expression tree, but that will be for later! In the meantime, I really don't know what could be done to ease compiler job. My experiments are not giving me any clue so far. I think I got some hints. Actually, one can workaround the issue by evaluating the most nested block expression: Vector3f li = l.col(i); res += (l.colwise() - li).colwise().norm().eval().sum(); Even more surprising is that declaring li as a Ref: Ref<const Vector3f> li = l.col(i); also "fix" the issue whereas here Ref<const Vector3f> is really like l.col(i): - no copy - both inherits MapBase that implements the evaluation details. So the only difference is that Block provides additional accessors to the starting row/col indices and reference to the nested expression so that one can crawl through the expression tree and get all is needed to rewrite it. The "problem" is that Block as to store this information, even though it is not needed. This is something new in 3.2. GCC is smart enough to get completely rid of them, but apparently clang and msvc are not that capable in this case. Sadly, that's not so simple. If I replace: Ref<const Vector3f> li = l.col(i); with Map<const Vector3f> li(l.data()+i*3); then clang will mess up again even though in this case Map and Ref are binary compatible and they share the same traits and same evaluator... And after simplifying Block<>, I get the same as Map. Problem solved with clang: https://bitbucket.org/eigen/eigen/commits/00094bd861b8/ Summary: Bug 1317: fix performance regression with some Block expressions and clang by helping it to remove dead code. The trick is to get rid of the nested expression in the evaluator by copying only the required information (here, the strides). Let us know if this also does the trick with Visual. I won't be in the office until Wednesday. I'll let you know then. Created attachment 743 [details]
7z'd assembly dumps for the different versions
Somewhat better, but still not optimal (for cl, not talking about gcc). I get the following times:
3.2.7 270ms
3.3rc1 716ms
c7527e 318ms
I'm attaching the entire asm output from VS2013. My assembly is about as good as my French, so I'll leave them to you.
Just as an aside, VS2015 produces performance very similar to VS2013 for all three versions. it's hard to see because Visual is generating ASM for all level of inlining... to it is not clear which code is really executed. Nonetheless, I see some ugly lines of useless copies like: movups xmm0, XMMWORD PTR [rax] movups XMMWORD PTR [rcx], xmm0 movups xmm1, XMMWORD PTR [rax+16] movups XMMWORD PTR [rcx+16], xmm1 movups xmm0, XMMWORD PTR [rax+32] movups XMMWORD PTR [rcx+32], xmm0 movups xmm1, XMMWORD PTR [rax+48] movups XMMWORD PTR [rcx+48], xmm1 movups xmm0, XMMWORD PTR [rax+64] movups XMMWORD PTR [rcx+64], xmm0 movups xmm1, XMMWORD PTR [rax+80] movups XMMWORD PTR [rcx+80], xmm1 movups xmm0, XMMWORD PTR [rax+96] movups XMMWORD PTR [rcx+96], xmm0 movups xmm1, XMMWORD PTR [rax+112] movups XMMWORD PTR [rcx+112], xmm1 movups xmm0, XMMWORD PTR [rax+128] movups XMMWORD PTR [rcx+128], xmm0 movups xmm1, XMMWORD PTR [rax+144] movups XMMWORD PTR [rcx+144], xmm1 coming from: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE explicit CwiseUnaryOp(const XprType& xpr, const UnaryOp& func = UnaryOp()) : m_xpr(xpr), m_functor(func) {} This represents 40 floats, no idea where they come from. ok, those 160 bytes comes from the nested expression: sizeof( decltype((l.leftCols(3).colwise() - l.col(2)).array().col(1)) ) returns 152 with 3.3, and 120 with 3.2. In theory the expression objects only have to live at compile time, and the compiler should generate no such code (as do clang and gcc). -- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1317. |