The quatmul.cpp test case exhibits poor performance on Power8 processors when compiled with either GCC 7.3 or GCC 8.1. Our investigation shows that this is due to a flush in the load-store unit that occurs when four 32-bit floating-point values are stored as scalars, and then immediately reloaded using a VSX load instruction.
Compilation flags: -O3 -mcpu=power8
Eigen version: 3.3.4
Test case: quatmul.cpp
Functions where flush occurs: quatmul_default and quatmul_novec
The code produced by GCC for quatmul_default ends with:
11.53 : 10000ef8: stfs f9,-16(r1)
0.08 : 10000efc: stfs f10,-4(r1)
49.09 : 10000f00: stfs f12,-12(r1)
1.73 : 10000f04: stfs f0,-8(r1)
23.31 : 10000f08: lxvd2x vs0,0,r9
7.33 : 10000f0c: stxvd2x vs0,0,r5
0.82 : 10000f10: blr
where r9 has the value of (r1)-16. In both cases we have generic C++ code (no Power-specific intrinsics) to perform multiply operations on quaternions, followed by some Power-intrinsic code (vec_vsx_ld, vec_vsx_st) to move those values to a location specified by an input parameter. (There's a great deal of inlining that takes place to cause the intrinsic and non-intrinsic code to wind up in the same function.)
Because part of the code has been optimized with Power VSX intrinsics, but the quaternion multiplication is still done in scalar C++ code, this "impedance mismatch" leads to the float stores followed immediately by a vector reload. Without the load/store intrinsics, the compiler would likely store the floats directly into the return value location. However, the "pload<Packet4f>" and "pstore<Packet4f>" routines that generate lxvd2x/stxvd2x are used throughout Eigen, so a better fix is probably to write code with VSX intrinsics to perform the quaternion multiplications.
A temporary fix that would remove the flushes and reduce the cost of this code sequence would be to add inline assembly code to generate two "ori 2,2,0" instructions following the quaternian multiplications. This would produce:
These ori instructions are group-ending nops on Power8 hardware, and two of them are sufficient to prevent a flush (there will still be some stall delay). These aren't needed on Power9 hardware, but will do no harm there. Of course, this is only recommended as a temporary workaround pending a better solution. Adding the two nops was measured to improve runtime performance of this benchmark by 45%.
It's unclear if the pload / pstore functions should be converted to be less SIMD-dependent or if the other algorithms need to be SIMDized. Does Eigen fundamentally require all routines to be hand-tuned to avoid the performance problems of transitions between SIMD and non-SIMD code?
The best would be to make the current implementation of quat_product suitable for all SIMD engine. It's current implementation is:
const Packet4f mask = _mm_setr_ps(0.f,0.f,0.f,-0.f);
Packet4f a = _a.coeffs().template packet<AAlignment>(0);
Packet4f b = _b.coeffs().template packet<BAlignment>(0);
Packet4f s1 = pmul(vec4f_swizzle1(a,1,2,0,2),vec4f_swizzle1(b,2,0,1,2));
Packet4f s2 = pmul(vec4f_swizzle1(a,3,3,3,1),vec4f_swizzle1(b,0,1,2,1));
padd( psub( pmul(a,vec4f_swizzle1(b,3,3,3,3)),
Generalizing _mm_setr_ps should be straightforward, so all we need is to "standardize" swizzling: vec4f_swizzle1 whose SSE version is:
#define vec4f_swizzle1(v,p,q,r,s) \
(_mm_castsi128_ps(_mm_shuffle_epi32( _mm_castps_si128(v), ((s)<<6|(r)<<4|(q)<<2|(p)))))
Is it easy to make an equivalent for VSX? and ideally we would like to have and equivalent for NEON and Altivec too.
It would be easy to generalize with Meta-Packets (Bug 692) -- this could also join the double and float paths for both SSE and AVX (as well as other architectures). I don't think I'll have much time to continue my first attempts in that direction any time soon, especially I don't have any possibilities to test non-x86 architectures.
You can request free access to IBM Power8 VM for Open Source development to test the solution on non-x86 architecture.
List me as the sponsor.