With GCC, "Quaternion<double> * Vector3d" is about x2 slower with EIGEN_UNALIGNED_VECTORIZE=1 (the default) than without it (or without vectorization at all). The "problem" is that this operation mixes non-vectorized cross products with coefficient-wise adds and muls which are vectorizatized using 1 Packet2d for x,y and 1 scalar for z. Clang is pretty happy with it because it manages to vectorize the cross products too. There is no easy solution: 1 - I don't think vectorizing cross with Packet2d will help because this would generate too many data-movements. You really have to consider the whole quat*vec3 code to be able to do something meaningful. 2 - So far we have no easy mechanism to disable vectorization locally (we need bug 1159). To disable vectorization we could re-introduce the Flagged expression we had in 3.2 and before, at least for internal purposes. Another trick is to let Eigen thinks that some Vec3 has a runtime inner-stride: #if EIGEN_COMP_GNUC && (!defined(EIGEN_VECTORIZE_AVX)) && EIGEN_UNALIGNED_VECTORIZE Vector3 uv0 = this->vec().cross(v); Ref<Vector3,0,InnerStride<-1> > uv = uv0; #else Vector3 uv = this->vec().cross(v); #endif uv += uv; return v + this->w() * uv + this->vec().cross(uv); This does the job, but that's not really satisfactory. So the best would probably to vectorize by hand the whole expressions. This way we would have the same speed with clang and GCC and others.
To give some numbers: gcc clang float 5.7 5.8 double 19.9 4.9 float+novec 5.6 5.8 double+novec 5.6 5.2 With "Ref<Vector3,0,InnerStride<-1> >" trick: float: 5.7 double: 7.6
Instead of manually vectorizing this, how about just implicitly loading `v` into a `Quaternion`, and doing: res = (q*[v;0]*q.conjugate()).vec(); Loading [v;0] into two Packet2d is just one unaligned load and one `movsd` instruction. This would also partially solve the problem of non-unit quaternions (Bug 560, https://bitbucket.org/eigen/eigen/pull-requests/600/) Of course, hand-vectorized code will likely be faster, since we would explicitly avoid trivial (x*0) products and optimize away the `.conjugate()` and some swizzles and we would not need to care about the correctness of the real part of the result.
Manually doing this does not look very promising (at least not for double): Quaterniond qv; qv.vec()=v; qv.w()=0; Quaterniond qr = q*qv*q.conjugate(); return qr.vec(); The float-variant actually generates slightly shorter code: https://godbolt.org/z/I14oNi I did not benchmark this, but it looks likes that could easily be hand-tuned to be much more efficient, e.g., by re-ordering the end-result to be [x, *, y, z] or [z, *, x, y] so that the result can be stored using a single movss + movhps (same goes for loading the initial `v` into `qv.vec()`) Also (in both cases) all products involving 0 could be avoided, as well as products that are only required for the final w (this likely requires re-ordering of the shuffles).
For float, I have the following vectorized version of the current code which is much faster than everything else: EIGEN_DONT_INLINE Vector3f quat_mul_vec3(const Quaternionf& q, const Vector3f& v) { using namespace Eigen::internal; Packet4f a = pload<Packet4f>(q.coeffs().data()); Packet4f b = _mm_setr_ps(v.x(),v.y(),v.z(),0.0f); Packet4f a1 = vec4f_swizzle1(a,1,2,0,3); Packet4f a2 = vec4f_swizzle1(a,2,0,1,3); Packet4f uv = psub(pmul(a1,vec4f_swizzle1(b,2,0,1,3)), pmul(a2,vec4f_swizzle1(b,1,2,0,3))); uv += uv; Packet4f t = psub(pmul(a1,vec4f_swizzle1(uv,2,0,1,3)), pmul(a2,vec4f_swizzle1(uv,1,2,0,3))); Vector4f res; pstore(&res.x(),b+pmul(pset1<Packet4f>(q.w()), uv)+t); return res.head<3>(); } gcc: 3.2 clang: 3.8 For comparison purposes, here are some other variants that apply scaling as well gcc clang float/double float/double [1] 6.4 / 15 6.7 / 5.3 [2] 14 / 14 5.3 / 6.1 [3] 5.8 / 9.4 6.2 / 5.6 [1] = current version + scaling by squared-norm: Vec3 uv = q.vec().cross(v); uv += uv; return q.squaredNorm()*v + q.w() * uv + q.vec().cross(uv); [2] = q*(0,v)*q' [3] = same but optimize by hand: Vec3 u = q.vec(); float s = q.w(); return 2 * u.dot(v) * u + (s*s - u.squaredNorm()) * v + 2 * s * u.cross(v);
I applied the Ref+InnerStride trick to my fork, that's better than nothing: https://bitbucket.org/ggael/eigen-quat2/commits/c1b1777085938e452542fa2aa2538020c768905b?at=default It's x3 times faster, moving from 15 to 5 (almost same as clang).
Hand-optimized code for double (operators need to be replaced by pmul, pmadd, pmsub, etc) -- to be inserted into src/Geometry/arch/Geometry_SSE.h: ~~~~~~ template<class Derived> struct quat_transform_vector<Architecture::SSE, Derived, double> { enum { ResAlignment = traits<Quaternion<double> >::Alignment }; typedef typename QuaternionBase<Derived>::Vector3 Vector3; static EIGEN_STRONG_INLINE Vector3 run(const QuaternionBase<Derived>& q, const Vector3& v) { evaluator<typename Derived::Coefficients> qe(q.coeffs()); Packet2d q_xy = qe.template packet<traits<Derived>::Alignment,Packet2d>(0); Packet2d q_zw = qe.template packet<traits<Derived>::Alignment,Packet2d>(2); Packet2d v_xy = ploadu<Packet2d>(v.data()); Packet2d v_z = _mm_set_sd(v.z()); // Packet2d v_yz = _mm_loadu_pd(v.data()+1); Packet2d v_yz = _mm_shuffle_pd(v_xy, v_z,1); Packet2d v_zx = _mm_shuffle_pd(v_z,v_xy,0); Packet2d q_yz = _mm_shuffle_pd(q_xy,q_zw,1); Packet2d q_zx = _mm_shuffle_pd(q_zw,q_xy,0); Packet2d q_ww = _mm_shuffle_pd(q_zw,q_zw,3); Packet2d t_zx = q_xy*v_yz - q_yz*v_xy; Packet2d t_yz = q_zx*v_xy - q_xy*v_zx; t_zx += t_zx; t_yz += t_yz; Packet2d t_xy = _mm_shuffle_pd(t_zx,t_yz,1); Packet2d r_xy = v_xy + q_yz*t_zx - q_zx*t_yz + q_ww*t_xy; Packet2d r_z_ = v_zx + q_ww*t_zx + q_xy*t_yz - q_yz*t_xy; Vector3 res; _mm_storeu_pd(res.data(), r_xy); _mm_store_sd(res.data()+2, r_z_); return res; } }; ~~~~~~ IACA (Intel Architecture Code Analyzer) analysis gives this a throughput of 7.84 cycles on Haswell when compiled with g++-5 or later (g++-4.x does not compile, since it does not overload operators for packets). $ g++-5 -O2 -DNDEBUG -I. -c -march=native quat-test.cpp -S Benchmarking with /bench/geometry.cpp actually gives slightly better results with additional `-DEIGEN_DONT_VECTORIZE` (which according to IACA has a throughput of 12 cycles, though). Clang's `-DEIGEN_DONT_VECTORIZE` is also slightly worse -- both in benchmarking as in IACA.
With my benchmark (float / double): gcc clang [1] 5.1/5.8 5.1/4.5 [2] 3.3/4.3 3.8/3.9 [1] = current eigen-quat2 fork (i.e., the 30 flop version with InnerSize trick for gcc) [2] = Mine and Christoph manual vectorization for float and double respectively. and with EIGEN_UNALIGNED_VECTORIZE=0 gcc clang [1] 5.1/5.1 5.1/4.5 [2] 3.3/4.3 3.8/3.9
The 15%/10% difference between gcc and clang seems a bit much to me, but I guess we can live with that for now. What do you use for benchmarking? The results from ./bench/geometry.cpp partially looked spurious to me, e.g., I think I had some compilation options where the `ToRotationMatrixWrapper` path was faster, even for a single vector. Maybe the compiler managed to pull the conversion to a rotation matrix out of the loop (I need to have a look at that again, maybe I just mis-read a number). Also, it seems there is no `pmsub` yet. I would add that together with `pnmadd` (matching the AVX-FMA naming), with fallbacks `pmsub(a,b,c)=psub(pmul(a,b),c)` and `pnmadd(a,b,c)=psub(c,pmul(a,b))`. I can do that probably tomorrow.
My benchmark isolate a single q*v within a non-inline function returning by value and call it thousands of times on the same pair of q and v.
*** Bug 1779 has been marked as a duplicate of this bug. ***
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1688.