Bug 203 - Vector4f result = a1*v1 + a2*v2 compiled like ass by eigen3; much better with eigen2
: Vector4f result = a1*v1 + a2*v2 compiled like ass by eigen3; much better with...
Status: RESOLVED FIXED
Product: Eigen
Classification: Unclassified
Component: Core - vectorization
: unspecified
: x86 - 64-bit Linux
: --- normal
Assigned To: Gael Guennebaud
:
:
:
:
: 3.0
  Show dependency treegraph
 
Reported: 2011-02-28 05:17 UTC by Benoit Jacob
Modified: 2014-03-20 10:16 UTC (History)
2 users (show)



Attachments
revert to just using _mm_set1_p[sd] (2.06 KB, patch)
2011-02-28 05:55 UTC, Benoit Jacob
no flags Details | Diff | Splinter Review

Description Benoit Jacob 2011-02-28 05:17:27 UTC
This test program:


#include <Eigen/Core>
using namespace Eigen;
void foo(float a1, const Vector4f& v1,
         float a2, const Vector4f& v2,
         Vector4f& result)
{
  asm volatile("#begin");
  result = a1*v1 + a2*v2;
  asm volatile("#end");
}


compiled like this with eigen3 and gcc 4.4.5 x86-64 linux:
$ g++ -c -S -O2 -I eigen derf.cpp -DNDEBUG -o derf.s


gives this crappy assembly:


#APP
# 9 "derf.cpp" 1
        #begin
# 0 "" 2
#NO_APP
        xorps   %xmm2, %xmm2
        movss   %xmm1, %xmm2
        pshufd  $0, %xmm2, %xmm1
        xorps   %xmm2, %xmm2
        mulps   (%rsi), %xmm1
        movss   %xmm0, %xmm2
        pshufd  $0, %xmm2, %xmm0
        mulps   (%rdi), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdx)
#APP
# 11 "derf.cpp" 1
        #end
# 0 "" 2
#NO_APP


while with eigen2, it gives this good assembly:


#APP
# 9 "derf.cpp" 1
        #begin
# 0 "" 2
#NO_APP
        shufps  $0, %xmm1, %xmm1
        shufps  $0, %xmm0, %xmm0
        mulps   (%rsi), %xmm1
        mulps   (%rdi), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdx)
#APP
# 11 "derf.cpp" 1
        #end
# 0 "" 2
#NO_APP
Comment 1 Benoit Jacob 2011-02-28 05:31:33 UTC
The first bad revision is:

changeset:   1396:ab39cda02b30
user:        Gael Guennebaud <g.gael@free.fr>
date:        Fri Aug 07 11:09:34 2009 +0200
summary:     * implement a second level of micro blocking (faster for small
sizes)
Comment 2 Benoit Jacob 2011-02-28 05:35:50 UTC
And then the asm changes again at this revision, to give the current asm:

changeset:   1593:402e5f111006
parent:      1561:cf40d4554411
user:        Gael Guennebaud <g.gael@free.fr>
date:        Thu Sep 17 23:18:21 2009 +0200
summary:     fix #53: performance regression, hopefully I did not resurected
another
Comment 3 Benoit Jacob 2011-02-28 05:51:27 UTC
So, the regression is introduced by the 'smart' implementation of pset1(), that
is, when you replace _mm_set1_ps by _mm_set_ss + shufps.

The comment says that's to work around a GCC bug whereby it implements
_mm_set1_ps using multiple moves, which is slow. I can reproduce that with
-m32, but not in 64bit. However, still with -m32, the 'fix' does not seem to
help at all.
Comment 4 Benoit Jacob 2011-02-28 05:55:10 UTC
Created attachment 109 [details] [review]
revert to just using _mm_set1_p[sd]

This patch fixes the problem on x86-64 and does not change much on x86-32 (asm
remains bad on x86-32).
Comment 5 Benoit Jacob 2011-02-28 06:12:39 UTC
OK, pushed this patch. After all it's just removing a weird workaround that
creates problems and I can't reproduce any improvement from using this
workaround. Feel free to reintroduce it if you know what exactly it fixes, but
please make sure that it doesn't hurt perf in other cases (here it hurted on
linux/gcc4.4/x86-64).

There remains the problem that _mm_set1_ps compiles very poorly with gcc/i386
(i.e. -m32). I don't know what to do about it. Maybe inline asm again?
Comment 6 Gael Guennebaud 2014-03-20 10:16:09 UTC
pshufd is really faster than shufps. Using asm does the job without introducing
such a  regression:

https://bitbucket.org/eigen/eigen/commits/60ca549abed6/
Changeset:   60ca549abed6
User:        ggael
Date:        2014-03-20 10:14:26
Summary:     Makes gcc to generate a pshufd instruction for pset1

Note You need to log in before you can comment on or make changes to this bug.