This is a follow-up from bug 1624. *** Summary so far: *** 1. We know what a fast float GEMM kernel looks like on ARM NEON: it should take advantage of multiply-accumulate-against-single-element instructions, like this: https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neon-gemm-kernel-benchmark.cc#L4670-L4716 2. The patches in bug 1624 implemented that basic idea of taking advantage of multiply-by-element instructions. However, they didn't take advantage of the ability to multiply by an *arbitrary* element, they only used multiplication by the 0-th element in a vector, with repeated loading of new data into that 0-th element. Finally, that was submitted as efda481cbd7a. 3. The patches in bug 1624 also had a crash issue, loading 8 bytes when only 4 may be present. 4. The crash issue was fixed in e01823ce7f6e, however that turned out to regress performance. Looking at disassembly, the reason why this regresses performance is that at least Clang compiles vmlaq_n_f32 as a ld1r instruction (load scalar and duplicate onto all lanes) followed by an ordinary multiply-add instruction, not taking advantage of multiply-by-element; and it seems that ld1r does not dual-issue so well with multiply-add. *** New patch *** This new patch: - Implements exactly the original idea of fast kernels on NEON, with a 128-bit load loading 4 RHS float values, each of them used in-place by a multiply-add-by-element instruction. - Offers higher performance overall than even the fast code from above step 2 (efda481cbd7a). - Does not read data out of bounds, unlike the fast code from above step 2 (efda481cbd7a).
Created attachment 901 [details] benchmark
benchmark results here: https://docs.google.com/spreadsheets/d/1x6SWiDLYfYLPzvp1bIbAxkpkVX4bJST1RT3T0f_j9BU/edit?usp=sharing
Created attachment 902 [details] faster kernel The faster kernel. Certainly a bit of a departure from current design and I expect Gael you'll have better ideas about how to optimally integrate this. Hopefully this patch is useful as a starting point and to show what performance one can shoot for.
Thanks Benoit for making the improvement and fix!
Maybe this idea could be generalized to also handle arch with 32 registers (Altivec/VSX, Mips/MSA, and AVX512). With more registers, we could load 4 rhs elements and broadcast them afterward exactly as we do for 2pX4 micro kernel. Maybe we could unify the 3 variants by introducing a generic proxy to the four rhs elements, let's call it "rhs_panel". It needs to be parametrized by the number of registers already taken: Traits::RhsPanel<15> rhs_panel; traits.loadLhs(&blA[(0+3*K)*LhsProgress], A0); \ traits.loadLhs(&blA[(1+3*K)*LhsProgress], A1); \ traits.loadLhs(&blA[(2+3*K)*LhsProgress], A2); \ traits.loadRhs(blB + (0+4*K)*Traits::RhsProgress, rhs_panel); \ traits.madd(A0, rhs_panel(fix<0>), C0); \ traits.madd(A1, rhs_panel(fix<0>), C4); \ traits.madd(A2, rhs_panel(fix<0>), C8); \ traits.madd(A0, rhs_panel(fix<1>), C1); \ traits.madd(A1, rhs_panel(fix<1>), C5); \ traits.madd(A2, rhs_panel(fix<1>), C9); \ ... If it remains less than 4 registers, RhsPanel would simply stores a RhsScalar*, and operator() would issue a standard broadcast/splat from memory. This assumes the compiler is able to factorize the redundant broadcasts. If it remains at least 4 registers, RhsPanel would holds 4 packets filled by loadRhs (as in broadcastRhs). With NEON, RhsPanel would hold only a single Packet and operator() would return a proxy embedding this packet with the compile-time lane number. If that's simpler, we could also unify operator() with madd: traits.madd(A0, rhs_panel, C0, fix<0>); We 32 registers we should also add a 4x4 or even 5x4, or maybe 3x8 micro-kernel but that's another topic!
It would be great if we could implement all this using class templates instead of macros, which would unroll the same way fixed-sized matrix multiplications unroll. Ideally, this could even share the same code.
(In reply to Gael Guennebaud from comment #5) > Maybe this idea could be generalized to also handle arch with 32 registers > (Altivec/VSX, Mips/MSA, and AVX512). Note that on 64-bit ARM, there are 32 NEON (128bit) registers. > With more registers, we could load 4 > rhs elements and broadcast them afterward exactly as we do for 2pX4 micro > kernel. Indeed! > > Maybe we could unify the 3 variants by introducing a generic proxy to the > four rhs elements, let's call it "rhs_panel". [...] Glad I got your brain started on this :-D Happy with anything that performs well on both big cores and little cores of any pixel phone or similar, per attached benchmark or any other benchmark you can run on a device. The collection of asm kernels in this file, https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neon-gemm-kernel-benchmark.cc with their code comments, represents all I know on this topic (if you're interested, I can try to publicly release more complete benchmark results from it than we have so far in https://docs.google.com/spreadsheets/d/1-0LjdMvW0XtH1bYknC0bQINoFaxjTuL9eplZZcitykI/edit#gid=0 )
Friendly ping, Hey Gael & Christoph, any update on this? Let me know if I can help. Thanks a lot!
Renjie, do you feel brave enough to try to implement the unified approach I suggested ? To reduce the meta-programming kung-fu we could simplify a bit the implementation with the following API. Let's remind the 3 cases: (a) we have plenty of registers and the 4 rhs coeffs are broadcasted to 4 RhsPackets (b) we are tight in registers (AVX+FMA+3x4 kernel), we need to broadcast the rhs coeffs once at a time into a single register (c) NEON: we copy the 4 rhs coeffs into a single RhsPacket and perform madd_by_lane. Then a micro kernel would look like: Traits::RhsPanel<15> rhs_panel; // here 15 = number of reserved registers for A* and C* // RhsPanel could be: // one RhsPacket (b) and (c) // a struct of 4 RhsPacket B_0,B1,B2,B3 (a) // no changes for loadLhs: traits.loadLhs(&blA[(0+3*K)*LhsProgress], A0); \ traits.loadLhs(&blA[(1+3*K)*LhsProgress], A1); \ traits.loadLhs(&blA[(2+3*K)*LhsProgress], A2); \ // loadRhs and broadcastRhs would be unified: traits.loadRhs(blB + (0+4*K)*Traits::RhsProgress, rhs_panel); \ // in the following calls, the 5th parameter would be ignored in case (b) traits.madd(A0, rhs_panel, C0, TMP, fix<0>); \ traits.madd(A1, rhs_panel, C1, TMP, fix<0>); \ traits.madd(A2, rhs_panel, C2, TMP, fix<0>); \ // update the rhs_panel in case (b), empty in cases (a) and (c) traits.updateRhs(blB + (1+4*K)*Traits::RhsProgress, rhs_panel); \ traits.madd(A0, rhs_panel, C3, TMP, fix<1>); \ traits.madd(A1, rhs_panel, C4, TMP, fix<1>); \ traits.madd(A2, rhs_panel, C5, TMP, fix<1>); \ ... This way we don't have to assume that the compiler will always do a good job at factorizing out redundant broadcast, and we don't ask more to the compiler in term of register allocation. So this should be safe.
Thanks for the guidance! I will give it a try! :D
Hey Gael, just want to make sure I understand it correctly: 1) Having a RhsPanel defined in each gebp_trait and templated by the registers taken. (it's either a RhsPacket or RhsPacketx4) 2) Rewrite madd & loadRhs in each gebp_trait, madd for NEON and vectorization, basically like http://eigen.tuxfamily.org/bz/attachment.cgi?id=902&action=diff. (but we need to make sure RhsProgress == 1)? maybe we still need a different signature? 3) Declare RhsPanel<register_taken> in different micro kernels for 3px4, 2px4 and 1px4. 4) Get rid of BroadcastRhs for the gebp_trait since loadRhs can handle RhsPacketx4. WDYT? Thanks for your advice!
Yes, that's basically what I had in mind but feel-free to adjust it as implementing it, I would not be surprised if some unexpected difficulties come up in the process. For instance I haven't thought about the compatibly with the "swapped rhs-lhs" mode.
got it! I will try to do it!
Hey Gael, I'm truly sorry for the delay. I'm busy working on a project needs to be done recently, I noticed this issue is blocking another issue. Please take it over if necessary, or I can still work on that if this issue can wait? I did not make much progress except for 1), I haven't rewrote the madd/loadRhs, also for 1), my declaration looks like ``` template <typename RhsPacket, typename RhsPacketx4, int registers_taken> using RhsPanelHelper = typename conditional<(registers_taken < 15), RhsPacket, RhsPacketx4>::type; ``` and inside each gebp_traits, I have something like ``` typedef struct {RhsPacket B_0; RhsPacket B1; RhsPacket B2; RhsPacket B3;} RhsPacketx4; ``` then I defined ``` template <int registers_taken> using RhsPanel = RhsPanelHelper<RhsPacket, RhsPacketx4, registers_taken>; ``` inside gebp_kernelm so inside gebp_kernel, we can use the syntax described in 1). however, I'm not sure about if my approach (`using` for template is since c++11?) is permitted for eigen? Thanks,
no worry, the dependence is not that strong, I was just concerned about limiting conflicts. Regarding: template <typename RhsPacket, typename RhsPacketx4, int registers_taken> using RhsPanelHelper = typename conditional<(registers_taken < 15), RhsPacket, RhsPacketx4>::type; and other "using" statements, you can wrap it: template <typename RhsPacket, typename RhsPacketx4, int registers_taken> struct RhsPanelHelper { typedef typename conditional<(registers_taken < 15), RhsPacket, RhsPacketx4>::type type; };
*** Bug 1385 has been marked as a duplicate of this bug. ***
got it, thank you!
Created attachment 910 [details] proposed refactoring patch Really sorry about the delay, finally find sometime to work on that. I managed to generate a patch as Gael described with help from Gael & Benoit. :) I have checked it passed the tests & verified the speed up on neon kernels. However, I have not verified the performance on other platforms. Specifialy, not sure about whether the performance for the 1px4 kernel will regress my current approach I defined and extra T0 has a struct of 4 rhspacket, but it's also a unused variable, so hopefully it can be optimized by the compiler? PTAL. thanks a lot!
Friendly ping, hi Gael, can you take a look at the patch? thanks a lot! :D
Thank you for the efforts. I just had a look, but I cannot test it as it don't apply on the head and I don't know when you branched. Have you tested it on AVX+FMA? I don't understand how it can still work in this case. This corresponds to case (b) in my comment #9: (b) we are tight in registers (AVX+FMA+3x4 kernel), we need to broadcast the rhs coeffs once at a time into a single register that I suggested to handle with calls to: traits.updateRhs(blB + (1+4*K)*Traits::RhsProgress, rhs_panel); I also have other comments: - RhsPacketx4 should be generically defined only once. - It seems you replaced some calls to pbroadcast4 by manual multiple loads.
Created attachment 914 [details] new_proposed_refactoring_patch
(In reply to Gael Guennebaud from comment #20) Hey Gael, really sorry about that, I didn't fully understand what you said before, indeed like what you said, my patch caused some performance regression on AVX+FMA for the 3px4 kernel. I have updated my patch to resolve the issue. PTAL. And some replies inline. Thanks a lot for your advice and help! > Thank you for the efforts. > > I just had a look, but I cannot test it as it don't apply on the head and I > don't know when you branched. > > > Have you tested it on AVX+FMA? I don't understand how it can still work in > this case. This corresponds to case (b) in my comment #9: > > (b) we are tight in registers (AVX+FMA+3x4 kernel), > we need to broadcast the rhs coeffs once at a time into a single register > > that I suggested to handle with calls to: > > traits.updateRhs(blB + (1+4*K)*Traits::RhsProgress, rhs_panel); > > > I also have other comments: > - RhsPacketx4 should be generically defined only once. I found it's a little bit hard to do so, like for the Neon kernel needs RhsPacketx4 needs to be typedefed as float32x4 since it's needed by the neon intrinsics. > - It seems you replaced some calls to pbroadcast4 by manual multiple loads. Since pbroadcast4 are only used in broadcastRhs so I can get rid of them? Again, thanks for the review and happy new year. :)
Hi Gael, do you mind taking another look for the new proposed patch? thanks a lot!
This looks better. It would help a lot if you could merge your patch with the head of Eigen so that I give it a shot. Some comments: These lines are not correct: #if defined(EIGEN_VECTORIZE_FMA) && defined(EIGEN_VECTORIZE_AVX) typedef typename Traits::RhsPacket RhsPanel15; #else typedef typename RhsPanelHelper<RhsPacket, RhsPacketx4, 15>::type RhsPanel15; #endif This will reduce perf on AVX512 for which we have 32 registers. This logic should be handled by RhsPanelHelper based on EIGEN_ARCH_DEFAULT_NUMBER_OF_REGISTERS. Regarding pbroadcast4, code like: loadRhs(b, dest.B_0); loadRhs(b + 1, dest.B1); loadRhs(b + 2, dest.B2); loadRhs(b + 3, dest.B3) for which loadRhs is a simple pset1 should rather call pbroadcast4 (on some SIMD engines it is better implemented as one pload + 4 shuffles).
Created attachment 917 [details] updated_patch thanks for the review, I have synced to the head and updated the patch, please take a look, thanks a lot!
Your patch does not apply. Make sure it includes all your local commits.
Created attachment 918 [details] updated_patch really sorry about that, I'm not a hg expert. reuploaded a patch, this time I make sure it applies to a clean repo. plz let me know if you have any question. thanks a lot.
Alright, I applied your patch: https://bitbucket.org/eigen/eigen/commits/556fb4ceb654/ User: renjieliu Summary: Bug: 1633: refactor gebp kernel and optimize for neon and did some cleanup: https://bitbucket.org/eigen/eigen/commits/465744fa91ea/ User: ggael Summary: Bug 1633: use proper type for madd temporaries, factorize RhsPacketx4. Thank you for the work!
Thanks a lot, Gael and Renjie, for resolving all this!
Renjie, have you benchmarked the final version of the patch that we applied? I now have access to a ThunderX2 and I observed a 10% slow down with the patch: 3.3: 12.5 GFlops bug1624 13.5 GFlops (but out of bounds memory access) +fix: 12.5 GFlops after this patch: 11.5 GFlops I the generated assembly is also pretty bad, one step of the 3pX4 kernel: ldr q4, [x2, 16] ldr q16, [x1, 48] ldr q15, [x1, 64] dup d5, v4.d[0] dup d4, v4.d[1] ldr q14, [x1, 80] dup v19.4s, v5.s[0] dup v17.4s, v4.s[0] dup v5.4s, v5.s[1] dup v4.4s, v4.s[1] fmla v13.4s, v16.4s, v19.4s fmla v12.4s, v15.4s, v19.4s fmla v11.4s, v14.4s, v19.4s fmla v10.4s, v16.4s, v5.4s fmla v9.4s, v15.4s, v5.4s fmla v8.4s, v14.4s, v5.4s fmla v7.4s, v16.4s, v17.4s fmla v1.4s, v15.4s, v17.4s fmla v2.4s, v14.4s, v17.4s fmla v6.4s, v16.4s, v4.4s fmla v3.4s, v15.4s, v4.4s fmla v0.4s, v14.4s, v4.4s add x28, x1, 352 prfm pldl1keep, [x28] whereas I would expect: ldr q4, [x2, 16] ldr q16, [x1, 48] ldr q15, [x1, 64] ldr q14, [x1, 80] fmla v13.4s, v16.4s, v4.s[2] fmla v12.4s, v15.4s, v4.s[2] fmla v11.4s, v14.4s, v4.s[2] fmla v10.4s, v16.4s, v4.s[3] fmla v9.4s, v15.4s, v4.s[3] fmla v8.4s, v14.4s, v4.s[3] fmla v7.4s, v16.4s, v4.s[0] fmla v1.4s, v15.4s, v4.s[0] fmla v2.4s, v14.4s, v4.s[0] fmla v6.4s, v16.4s, v4.s[1] fmla v3.4s, v15.4s, v4.s[1] fmla v0.4s, v14.4s, v4.s[1] add x28, x1, 352 prfm pldl1keep, [x28] Any ideas?
Here, changing the four madd overloads to: template<int LaneID> EIGEN_STRONG_INLINE void madd(const LhsPacket& a, const RhsPacketx4& b, AccPacket& c, RhsPacket& /*tmp*/, const FixedInt<LaneID>&) const { c = vfmaq_laneq_f32(c, a, b, LaneID); } works as expected, and I get a significant speed increase: 14.3 GFlops According to: https://developer.arm.com/technologies/neon/intrinsics vfmaq_laneq_f32 seems to be equally supported as vfmaq_lane_f32. Any reasons for not using vfmaq_laneq_f32 in the first place??
hm, the assembly is still not as good as it should be: ldr q1, [x2, 16] add x30, x1, 352 ldr q25, [x1, 48] dup v28.4s, v1.s[0] dup v27.4s, v1.s[1] dup v26.4s, v1.s[2] dup v1.4s, v1.s[3] ldr q24, [x1, 64] ldr q23, [x1, 80] fmla v22.4s, v25.4s, v28.4s fmla v21.4s, v24.4s, v28.4s fmla v20.4s, v23.4s, v28.4s fmla v3.4s, v25.4s, v27.4s fmla v19.4s, v24.4s, v27.4s fmla v18.4s, v23.4s, v27.4s fmla v6.4s, v25.4s, v26.4s fmla v16.4s, v24.4s, v26.4s fmla v7.4s, v23.4s, v26.4s fmla v17.4s, v25.4s, v1.4s fmla v5.4s, v24.4s, v1.4s fmla v2.4s, v23.4s, v1.4s I don't know why GCC issues those dup instructions...
OK, so now if I play with inline asm and replace vfmaq_laneq_f32 by: if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" (a), "w" (b) : ); I eventually get 18.8 GFlops with the following generated ASM: ldr q23, [x1, 192] add x30, x1, 496 ldr q22, [x2, 64] fmla v20.4s, v23.4s, v22.s[0] fmla v17.4s, v23.4s, v22.s[1] fmla v6.4s, v23.4s, v22.s[2] fmla v3.4s, v23.4s, v22.s[3] ldr q23, [x1, 208] fmla v19.4s, v23.4s, v22.s[0] fmla v16.4s, v23.4s, v22.s[1] fmla v5.4s, v23.4s, v22.s[2] fmla v2.4s, v23.4s, v22.s[3] ldr q23, [x1, 224] fmla v18.4s, v23.4s, v22.s[0] fmla v7.4s, v23.4s, v22.s[1] fmla v4.4s, v23.4s, v22.s[2] fmla v1.4s, v23.4s, v22.s[3] prfm pldl1keep, [x30] So now I have to teach GCC to load A0, A1, and A2 into three distinct registers (as it did before I added the inline asm... damn register allocators!)
For the dup, here is how GCC implements vfmaq_laneq_f32: __extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b, float32x4_t __c, const int __lane) { return __builtin_aarch64_fmav4sf (__b, __aarch64_vdupq_laneq_f32 (__c, __lane), __a); } I'll report a bug to GCC.
GCC's bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101
I think you're right, vfmaq_laneq_f32 is better than vfmaq_lane_f32, and it makes the code easier as well. :) Thanks for catching this! For the gcc part, I really have no idea why it happened, I compiled with clang 4.0.1, and it seems even for vfmaq_lane_f32, it's producing the right result: #begin step of gebp micro kernel 3pX4 //NO_APP //APP #Note: these asm comments work around bug 935! //NO_APP //APP prfm pldl1keep, [x19] //NO_APP fmla v19.4s, v20.4s, v23.s[0] fmla v17.4s, v20.4s, v23.s[1] fmla v5.4s, v20.4s, v23.s[2] fmla v3.4s, v20.4s, v23.s[3] ldp q22, q20, [x13, #32] fmla v18.4s, v21.4s, v23.s[0] fmla v7.4s, v21.4s, v23.s[1] fmla v4.4s, v21.4s, v23.s[2] fmla v1.4s, v21.4s, v23.s[3] fmla v16.4s, v22.4s, v23.s[0] fmla v6.4s, v22.4s, v23.s[1] fmla v2.4s, v22.4s, v23.s[2] fmla v0.4s, v22.4s, v23.s[3] ldp q21, q22, [x13, #64] ldr q23, [x16, #16] add x19, x13, #352 // =352 //APP #end step of gebp micro kernel 3pX4 Thanks for filing the bug btw.
Created attachment 922 [details] Speed up gebp on arm64 Here are my local changes that allowed to reach 19.3 GFlops (current Eigen version reaches 11.5). I also added faster support for double.
Created attachment 923 [details] Fix perf with arm64 and gcc, add explicit support for double Here is the patch I plan to apply.
thanks for the fix!
Applied: https://bitbucket.org/eigen/eigen/commits/e2b2a2b25718/
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1633.