This bugzilla service is closed. All entries have been migrated to
Bug 1559 - Quaternion performance degradation on Power VSX
Summary: Quaternion performance degradation on Power VSX
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: 3.3 (current stable)
Hardware: PPC - AltiVec All
: Normal Performance Problem
Assignee: Nobody
Depends on:
Reported: 2018-06-21 14:58 UTC by David Edelsohn
Modified: 2019-12-04 17:43 UTC (History)
7 users (show)


Description David Edelsohn 2018-06-21 14:58:38 UTC
The quatmul.cpp test case exhibits poor performance on Power8 processors when compiled with either GCC 7.3 or GCC 8.1.  Our investigation shows that this is due to a flush in the load-store unit that occurs when four 32-bit floating-point values are stored as scalars, and then immediately reloaded using a VSX load instruction.

Compilation flags:  -O3 -mcpu=power8
Eigen version:  3.3.4
Test case:  quatmul.cpp
Functions where flush occurs:  quatmul_default and quatmul_novec

The code produced by GCC for quatmul_default ends with:

11.53 :        10000ef8:   stfs    f9,-16(r1)
 0.08 :        10000efc:   stfs    f10,-4(r1)
49.09 :        10000f00:   stfs    f12,-12(r1)
 1.73 :        10000f04:   stfs    f0,-8(r1)
23.31 :        10000f08:   lxvd2x  vs0,0,r9
 7.33 :        10000f0c:   stxvd2x vs0,0,r5
 0.82 :        10000f10:   blr

where r9 has the value of (r1)-16.  In both cases we have generic C++ code (no Power-specific intrinsics) to perform multiply operations on quaternions, followed by some Power-intrinsic code (vec_vsx_ld, vec_vsx_st) to move those values to a location specified by an input parameter.  (There's a great deal of inlining that takes place to cause the intrinsic and non-intrinsic code to wind up in the same function.)

Because part of the code has been optimized with Power VSX intrinsics, but the quaternion multiplication is still done in scalar C++ code, this "impedance mismatch" leads to the float stores followed immediately by a vector reload.  Without the load/store intrinsics, the compiler would likely store the floats directly into the return value location.  However, the "pload<Packet4f>" and "pstore<Packet4f>" routines that generate lxvd2x/stxvd2x are used throughout Eigen, so a better fix is probably to write code with VSX intrinsics to perform the quaternion multiplications.

A temporary fix that would remove the flushes and reduce the cost of this code sequence would be to add inline assembly code to generate two "ori 2,2,0" instructions following the quaternian multiplications.  This would produce:

  stfs    f9,-16(r1)
  stfs    f10,-4(r1)
  stfs    f12,-12(r1)
  stfs    f0,-8(r1) 
  ori 2,2,0  
  ori 2,2,0
  lxvd2x  vs0,0,r9
  stxvd2x vs0,0,r5

These ori instructions are group-ending nops on Power8 hardware, and two of them are sufficient to prevent a flush (there will still be some stall delay).  These aren't needed on Power9 hardware, but will do no harm there.  Of course, this is only recommended as a temporary workaround pending a better solution.  Adding the two nops was measured to improve runtime performance of this benchmark by 45%.

It's unclear if the pload / pstore functions should be converted to be less SIMD-dependent or if the other algorithms need to be SIMDized. Does Eigen fundamentally require all routines to be hand-tuned to avoid the performance problems of transitions between SIMD and non-SIMD code?
Comment 1 Gael Guennebaud 2018-06-22 12:34:34 UTC
The best would be to make the current implementation of quat_product suitable for all SIMD engine. It's current implementation is:

    Quaternion<float> res;
    const Packet4f mask = _mm_setr_ps(0.f,0.f,0.f,-0.f);
    Packet4f a = _a.coeffs().template packet<AAlignment>(0);
    Packet4f b = _b.coeffs().template packet<BAlignment>(0);
    Packet4f s1 = pmul(vec4f_swizzle1(a,1,2,0,2),vec4f_swizzle1(b,2,0,1,2));
    Packet4f s2 = pmul(vec4f_swizzle1(a,3,3,3,1),vec4f_swizzle1(b,0,1,2,1));
              padd( psub( pmul(a,vec4f_swizzle1(b,3,3,3,3)),
                          pmul( vec4f_swizzle1(a,2,0,1,0),
Generalizing _mm_setr_ps should be straightforward, so all we need is to "standardize" swizzling: vec4f_swizzle1 whose SSE version is:

#define vec4f_swizzle1(v,p,q,r,s) \
  (_mm_castsi128_ps(_mm_shuffle_epi32( _mm_castps_si128(v), ((s)<<6|(r)<<4|(q)<<2|(p)))))

Is it easy to make an equivalent for VSX? and ideally we would like to have and equivalent for NEON and Altivec too.
Comment 2 Christoph Hertzberg 2018-06-22 12:47:56 UTC
It would be easy to generalize with Meta-Packets (Bug 692) -- this could also join the double and float paths for both SSE and AVX (as well as other architectures). I don't think I'll have much time to continue my first attempts in that direction any time soon, especially I don't have any possibilities to test non-x86 architectures.
Comment 3 David Edelsohn 2018-06-22 13:12:32 UTC
You can request free access to IBM Power8 VM for Open Source development to test the solution on non-x86 architecture.

List me as the sponsor.
Comment 4 Nobody 2019-12-04 17:43:32 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance:

Note You need to log in before you can comment on or make changes to this bug.