This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen
Bug 1688 - Perf issue with quat*vec3 with double and GCC
Summary: Perf issue with quat*vec3 with double and GCC
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Geometry (show other bugs)
Version: 3.4 (development)
Hardware: All All
: Normal Performance Problem
Assignee: Nobody
URL:
Whiteboard:
Keywords:
: 1779 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-03-04 20:47 UTC by Gael Guennebaud
Modified: 2019-12-04 18:32 UTC (History)
5 users (show)



Attachments

Description Gael Guennebaud 2019-03-04 20:47:58 UTC
With GCC, "Quaternion<double> * Vector3d" is about x2 slower with EIGEN_UNALIGNED_VECTORIZE=1 (the default) than without it (or without vectorization at all). The "problem" is that this operation mixes non-vectorized cross products with coefficient-wise adds and muls which are vectorizatized using 1 Packet2d for x,y and 1 scalar for z.

Clang is pretty happy with it because it manages to vectorize the cross products too.

There is no easy solution:

1 - I don't think vectorizing cross with Packet2d will help because this would generate too many data-movements. You really have to consider the whole quat*vec3 code to be able to do something meaningful.

2 - So far we have no easy mechanism to disable vectorization locally (we need bug 1159). 

To disable vectorization we could re-introduce the Flagged expression we had in 3.2 and before, at least for internal purposes.

Another trick is to let Eigen thinks that some Vec3 has a runtime inner-stride:

    #if EIGEN_COMP_GNUC && (!defined(EIGEN_VECTORIZE_AVX)) && EIGEN_UNALIGNED_VECTORIZE
    Vector3 uv0 = this->vec().cross(v);
    Ref<Vector3,0,InnerStride<-1> > uv = uv0;
    #else
    Vector3 uv = this->vec().cross(v);
    #endif
    uv += uv;
    return v + this->w() * uv + this->vec().cross(uv);

This does the job, but that's not really satisfactory.


So the best would probably to vectorize by hand the whole expressions. This way we would have the same speed with clang and GCC and others.
Comment 1 Gael Guennebaud 2019-03-04 20:58:47 UTC
To give some numbers:

               gcc  clang
float          5.7   5.8
double        19.9   4.9
float+novec    5.6   5.8
double+novec   5.6   5.2

With "Ref<Vector3,0,InnerStride<-1> >" trick:

float:  5.7
double: 7.6
Comment 2 Christoph Hertzberg 2019-03-05 10:54:53 UTC
Instead of manually vectorizing this, how about just implicitly loading `v` into a `Quaternion`, and doing:
  res = (q*[v;0]*q.conjugate()).vec();
Loading [v;0] into two Packet2d is just one unaligned load and one `movsd` instruction.

This would also partially solve the problem of non-unit quaternions (Bug 560, https://bitbucket.org/eigen/eigen/pull-requests/600/)

Of course, hand-vectorized code will likely be faster, since we would explicitly avoid trivial (x*0) products and optimize away the `.conjugate()` and some swizzles and we would not need to care about the correctness of the real part of the result.
Comment 3 Christoph Hertzberg 2019-03-05 12:54:40 UTC
Manually doing this does not look very promising (at least not for double):
   Quaterniond qv; qv.vec()=v; qv.w()=0;
   Quaterniond qr = q*qv*q.conjugate();
   return qr.vec();

The float-variant actually generates slightly shorter code:
  https://godbolt.org/z/I14oNi
I did not benchmark this, but it looks likes that could easily be hand-tuned to be much more efficient, e.g., by re-ordering the end-result to be
  [x, *, y, z] or [z, *, x, y]
so that the result can be stored using a single movss + movhps (same goes for loading the initial `v` into `qv.vec()`)

Also (in both cases) all products involving 0 could be avoided, as well as products that are only required for the final w (this likely requires re-ordering of the shuffles).
Comment 4 Gael Guennebaud 2019-03-05 22:35:09 UTC
For float, I have the following vectorized version of the current code which is much faster than everything else:


EIGEN_DONT_INLINE
Vector3f quat_mul_vec3(const Quaternionf& q, const Vector3f& v)
{
  using namespace Eigen::internal;
  Packet4f a = pload<Packet4f>(q.coeffs().data());
  Packet4f b = _mm_setr_ps(v.x(),v.y(),v.z(),0.0f);
  Packet4f a1 = vec4f_swizzle1(a,1,2,0,3);
  Packet4f a2 = vec4f_swizzle1(a,2,0,1,3);
  Packet4f uv = psub(pmul(a1,vec4f_swizzle1(b,2,0,1,3)),
                     pmul(a2,vec4f_swizzle1(b,1,2,0,3)));
  uv += uv;
  Packet4f t  = psub(pmul(a1,vec4f_swizzle1(uv,2,0,1,3)),
                     pmul(a2,vec4f_swizzle1(uv,1,2,0,3)));

  Vector4f res;
  pstore(&res.x(),b+pmul(pset1<Packet4f>(q.w()), uv)+t);
  return res.head<3>();
}

gcc:   3.2
clang: 3.8


For comparison purposes, here are some other variants that apply scaling as well

         gcc          clang
     float/double  float/double
[1]   6.4 / 15      6.7 / 5.3
[2]  14   / 14      5.3 / 6.1
[3]   5.8 /  9.4    6.2 / 5.6

[1] = current version + scaling by squared-norm:
  Vec3 uv = q.vec().cross(v);
  uv += uv;
  return q.squaredNorm()*v + q.w() * uv + q.vec().cross(uv);

[2] = q*(0,v)*q'

[3] = same but optimize by hand:
  Vec3 u = q.vec();
  float s = q.w();
  return 2 * u.dot(v) * u
      + (s*s - u.squaredNorm()) * v
      + 2 * s * u.cross(v);
Comment 5 Gael Guennebaud 2019-03-05 22:39:15 UTC
I applied the Ref+InnerStride trick to my fork, that's better than nothing:
https://bitbucket.org/ggael/eigen-quat2/commits/c1b1777085938e452542fa2aa2538020c768905b?at=default

It's x3 times faster, moving from 15 to 5 (almost same as clang).
Comment 6 Christoph Hertzberg 2019-03-08 03:25:12 UTC
Hand-optimized code for double (operators need to be replaced by pmul, pmadd, pmsub, etc) -- to be inserted into src/Geometry/arch/Geometry_SSE.h:

~~~~~~
template<class Derived>
struct quat_transform_vector<Architecture::SSE, Derived, double>
{
  enum {
    ResAlignment = traits<Quaternion<double> >::Alignment
  };
  typedef typename QuaternionBase<Derived>::Vector3 Vector3;
  static EIGEN_STRONG_INLINE Vector3 run(const QuaternionBase<Derived>& q, const Vector3& v)
  {
    evaluator<typename Derived::Coefficients> qe(q.coeffs());
    Packet2d q_xy = qe.template packet<traits<Derived>::Alignment,Packet2d>(0);
    Packet2d q_zw = qe.template packet<traits<Derived>::Alignment,Packet2d>(2);

    Packet2d v_xy = ploadu<Packet2d>(v.data());
    Packet2d v_z = _mm_set_sd(v.z());
    // Packet2d v_yz = _mm_loadu_pd(v.data()+1);
    Packet2d v_yz = _mm_shuffle_pd(v_xy, v_z,1);
    Packet2d v_zx = _mm_shuffle_pd(v_z,v_xy,0);

    Packet2d q_yz = _mm_shuffle_pd(q_xy,q_zw,1);
    Packet2d q_zx = _mm_shuffle_pd(q_zw,q_xy,0);
    Packet2d q_ww = _mm_shuffle_pd(q_zw,q_zw,3);

    Packet2d t_zx = q_xy*v_yz - q_yz*v_xy;
    Packet2d t_yz = q_zx*v_xy - q_xy*v_zx;

    t_zx += t_zx;
    t_yz += t_yz;

    Packet2d t_xy = _mm_shuffle_pd(t_zx,t_yz,1);

    Packet2d r_xy = v_xy + q_yz*t_zx - q_zx*t_yz + q_ww*t_xy;
    Packet2d r_z_ = v_zx + q_ww*t_zx + q_xy*t_yz - q_yz*t_xy;

    Vector3 res;
    _mm_storeu_pd(res.data(), r_xy);
    _mm_store_sd(res.data()+2, r_z_);
    return res;
  }
};
~~~~~~

IACA (Intel Architecture Code Analyzer) analysis gives this a throughput of 7.84 cycles on Haswell when compiled with g++-5 or later (g++-4.x does not compile, since it does not overload operators for packets).
 $ g++-5 -O2 -DNDEBUG -I. -c -march=native quat-test.cpp -S

Benchmarking with /bench/geometry.cpp actually gives slightly better results with additional `-DEIGEN_DONT_VECTORIZE` (which according to IACA has a throughput of 12 cycles, though).
Clang's `-DEIGEN_DONT_VECTORIZE` is also slightly worse -- both in benchmarking as in IACA.
Comment 7 Gael Guennebaud 2019-03-08 13:46:33 UTC
With my benchmark (float / double):

     gcc       clang
[1]  5.1/5.8   5.1/4.5
[2]  3.3/4.3   3.8/3.9

[1] = current eigen-quat2 fork (i.e., the 30 flop version with InnerSize trick for gcc)
[2] = Mine and Christoph manual vectorization for float and double respectively.

and with EIGEN_UNALIGNED_VECTORIZE=0

     gcc       clang
[1]  5.1/5.1   5.1/4.5
[2]  3.3/4.3   3.8/3.9
Comment 8 Christoph Hertzberg 2019-03-08 15:45:02 UTC
The 15%/10% difference between gcc and clang seems a bit much to me, but I guess we can live with that for now.

What do you use for benchmarking? 
The results from ./bench/geometry.cpp partially looked spurious to me, e.g., I think I had some compilation options where the `ToRotationMatrixWrapper` path was faster, even for a single vector. Maybe the compiler managed to pull the conversion to a rotation matrix out of the loop (I need to have a look at that again, maybe I just mis-read a number).

Also, it seems there is no `pmsub` yet. I would add that together with `pnmadd` (matching the AVX-FMA naming), with fallbacks `pmsub(a,b,c)=psub(pmul(a,b),c)` and `pnmadd(a,b,c)=psub(c,pmul(a,b))`.

I can do that probably tomorrow.
Comment 9 Gael Guennebaud 2019-03-09 20:38:48 UTC
My benchmark isolate a single q*v within a non-inline function returning by value and call it thousands of times on the same pair of q and v.
Comment 10 Christoph Hertzberg 2019-11-15 09:35:29 UTC
*** Bug 1779 has been marked as a duplicate of this bug. ***
Comment 11 Nobody 2019-12-04 18:32:32 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1688.

Note You need to log in before you can comment on or make changes to this bug.