Bug 203 - Vector4f result = a1*v1 + a2*v2 compiled like ass by eigen3; much better with eigen2
Vector4f result = a1*v1 + a2*v2 compiled like ass by eigen3; much better with...
Status: RESOLVED FIXED
Product: Eigen
Classification: Unclassified
Component: Core - vectorization
unspecified
x86 - 64-bit Linux
: --- Unknown
Assigned To: Gael Guennebaud
:
Depends on:
Blocks: 3.0
  Show dependency treegraph
 
Reported: 2011-02-28 05:17 UTC by Benoit Jacob
Modified: 2014-03-20 10:16 UTC (History)
2 users (show)



Attachments
revert to just using _mm_set1_p[sd] (2.06 KB, patch)
2011-02-28 05:55 UTC, Benoit Jacob
no flags Details | Diff

Description Benoit Jacob 2011-02-28 05:17:27 UTC
This test program:


#include <Eigen/Core>
using namespace Eigen;
void foo(float a1, const Vector4f& v1,
         float a2, const Vector4f& v2,
         Vector4f& result)
{
  asm volatile("#begin");
  result = a1*v1 + a2*v2;
  asm volatile("#end");
}


compiled like this with eigen3 and gcc 4.4.5 x86-64 linux:
$ g++ -c -S -O2 -I eigen derf.cpp -DNDEBUG -o derf.s


gives this crappy assembly:


#APP
# 9 "derf.cpp" 1
        #begin
# 0 "" 2
#NO_APP
        xorps   %xmm2, %xmm2
        movss   %xmm1, %xmm2
        pshufd  $0, %xmm2, %xmm1
        xorps   %xmm2, %xmm2
        mulps   (%rsi), %xmm1
        movss   %xmm0, %xmm2
        pshufd  $0, %xmm2, %xmm0
        mulps   (%rdi), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdx)
#APP
# 11 "derf.cpp" 1
        #end
# 0 "" 2
#NO_APP


while with eigen2, it gives this good assembly:


#APP
# 9 "derf.cpp" 1
        #begin
# 0 "" 2
#NO_APP
        shufps  $0, %xmm1, %xmm1
        shufps  $0, %xmm0, %xmm0
        mulps   (%rsi), %xmm1
        mulps   (%rdi), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdx)
#APP
# 11 "derf.cpp" 1
        #end
# 0 "" 2
#NO_APP
Comment 1 Benoit Jacob 2011-02-28 05:31:33 UTC
The first bad revision is:

changeset:   1396:ab39cda02b30
user:        Gael Guennebaud <g.gael@free.fr>
date:        Fri Aug 07 11:09:34 2009 +0200
summary:     * implement a second level of micro blocking (faster for small sizes)
Comment 2 Benoit Jacob 2011-02-28 05:35:50 UTC
And then the asm changes again at this revision, to give the current asm:

changeset:   1593:402e5f111006
parent:      1561:cf40d4554411
user:        Gael Guennebaud <g.gael@free.fr>
date:        Thu Sep 17 23:18:21 2009 +0200
summary:     fix #53: performance regression, hopefully I did not resurected another
Comment 3 Benoit Jacob 2011-02-28 05:51:27 UTC
So, the regression is introduced by the 'smart' implementation of pset1(), that is, when you replace _mm_set1_ps by _mm_set_ss + shufps.

The comment says that's to work around a GCC bug whereby it implements _mm_set1_ps using multiple moves, which is slow. I can reproduce that with -m32, but not in 64bit. However, still with -m32, the 'fix' does not seem to help at all.
Comment 4 Benoit Jacob 2011-02-28 05:55:10 UTC
Created attachment 109 [details]
revert to just using _mm_set1_p[sd]

This patch fixes the problem on x86-64 and does not change much on x86-32 (asm remains bad on x86-32).
Comment 5 Benoit Jacob 2011-02-28 06:12:39 UTC
OK, pushed this patch. After all it's just removing a weird workaround that creates problems and I can't reproduce any improvement from using this workaround. Feel free to reintroduce it if you know what exactly it fixes, but please make sure that it doesn't hurt perf in other cases (here it hurted on linux/gcc4.4/x86-64).

There remains the problem that _mm_set1_ps compiles very poorly with gcc/i386 (i.e. -m32). I don't know what to do about it. Maybe inline asm again?
Comment 6 Gael Guennebaud 2014-03-20 10:16:09 UTC
pshufd is really faster than shufps. Using asm does the job without introducing such a  regression:

https://bitbucket.org/eigen/eigen/commits/60ca549abed6/
Changeset:   60ca549abed6
User:        ggael
Date:        2014-03-20 10:14:26
Summary:     Makes gcc to generate a pshufd instruction for pset1

Note You need to log in before you can comment on or make changes to this bug.