This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen
Bug 1633 - Improve float matrix multiplication performance on ARM NEON (take 2)
Summary: Improve float matrix multiplication performance on ARM NEON (take 2)
Status: RESOLVED FIXED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - matrix products (show other bugs)
Version: unspecified
Hardware: ARM - NEON All
: Normal Performance Problem
Assignee: Nobody
URL:
Whiteboard:
Keywords:
: 1385 (view as bug list)
Depends on:
Blocks: 1642
  Show dependency treegraph
 
Reported: 2018-11-28 18:01 UTC by Benoit Jacob
Modified: 2019-12-04 18:12 UTC (History)
5 users (show)



Attachments
benchmark (2.03 KB, text/x-c++src)
2018-11-28 18:02 UTC, Benoit Jacob
no flags Details
faster kernel (22.08 KB, patch)
2018-11-28 18:12 UTC, Benoit Jacob
no flags Details | Diff
proposed refactoring patch (39.29 KB, patch)
2018-12-24 11:21 UTC, Renjie Liu
no flags Details | Diff
new_proposed_refactoring_patch (42.08 KB, patch)
2019-01-10 08:27 UTC, Renjie Liu
no flags Details | Diff
updated_patch (40.84 KB, patch)
2019-01-15 10:29 UTC, Renjie Liu
no flags Details | Diff
updated_patch (40.81 KB, patch)
2019-01-16 04:56 UTC, Renjie Liu
no flags Details | Diff
Speed up gebp on arm64 (5.71 KB, patch)
2019-01-29 14:09 UTC, Gael Guennebaud
no flags Details | Diff
Fix perf with arm64 and gcc, add explicit support for double (7.26 KB, patch)
2019-01-29 14:24 UTC, Gael Guennebaud
no flags Details | Diff

Description Benoit Jacob 2018-11-28 18:01:56 UTC
This is a follow-up from bug 1624.

*** Summary so far: ***

1. We know what a fast float GEMM kernel looks like on ARM NEON: it should take advantage of multiply-accumulate-against-single-element instructions, like this:
https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neon-gemm-kernel-benchmark.cc#L4670-L4716


2. The patches in bug 1624 implemented that basic idea of taking advantage of multiply-by-element instructions. However, they didn't take advantage of the ability to multiply by an *arbitrary* element, they only used multiplication by the 0-th element in a vector, with repeated loading of new data into that 0-th element. Finally, that was submitted as efda481cbd7a.

3. The patches in bug 1624 also had a crash issue, loading 8 bytes when only 4 may be present.

4. The crash issue was fixed in e01823ce7f6e, however that turned out to regress performance.  Looking at disassembly, the reason why this regresses performance is that at least Clang compiles vmlaq_n_f32 as a ld1r instruction (load scalar and duplicate onto all lanes) followed by an ordinary multiply-add instruction, not taking advantage of multiply-by-element; and it seems that ld1r does not dual-issue so well with multiply-add.


*** New patch ***

This new patch:
  - Implements exactly the original idea of fast kernels on NEON, with a 128-bit load loading 4 RHS float values, each of them used in-place by a multiply-add-by-element instruction.
  - Offers higher performance overall than even the fast code from above step 2 (efda481cbd7a).
  - Does not read data out of bounds, unlike the fast code from above step 2 (efda481cbd7a).
Comment 1 Benoit Jacob 2018-11-28 18:02:21 UTC
Created attachment 901 [details]
benchmark
Comment 3 Benoit Jacob 2018-11-28 18:12:46 UTC
Created attachment 902 [details]
faster kernel

The faster kernel.  Certainly a bit of a departure from current design and I expect Gael you'll have better ideas about how to optimally integrate this. Hopefully this patch is useful as a starting point and to show what performance one can shoot for.
Comment 4 Renjie Liu 2018-11-29 01:54:51 UTC
Thanks Benoit for making the improvement and fix!
Comment 5 Gael Guennebaud 2018-11-29 09:30:42 UTC
Maybe this idea could be generalized to also handle arch with 32 registers (Altivec/VSX, Mips/MSA, and AVX512). With more registers, we could load 4 rhs elements and broadcast them afterward exactly as we do for 2pX4 micro kernel.

Maybe we could unify the 3 variants by introducing a generic proxy to the four rhs elements, let's call it "rhs_panel". It needs to be parametrized by the number of registers already taken:

Traits::RhsPanel<15> rhs_panel;

traits.loadLhs(&blA[(0+3*K)*LhsProgress], A0);  \
traits.loadLhs(&blA[(1+3*K)*LhsProgress], A1);  \
traits.loadLhs(&blA[(2+3*K)*LhsProgress], A2);  \
traits.loadRhs(blB + (0+4*K)*Traits::RhsProgress, rhs_panel); \
traits.madd(A0, rhs_panel(fix<0>), C0); \
traits.madd(A1, rhs_panel(fix<0>), C4); \
traits.madd(A2, rhs_panel(fix<0>), C8); \
traits.madd(A0, rhs_panel(fix<1>), C1); \
traits.madd(A1, rhs_panel(fix<1>), C5); \
traits.madd(A2, rhs_panel(fix<1>), C9); \
...

If it remains less than 4 registers, RhsPanel would simply stores a RhsScalar*, and operator() would issue a standard broadcast/splat from memory. This assumes the compiler is able to factorize the redundant broadcasts.

If it remains at least 4 registers, RhsPanel would holds 4 packets filled by loadRhs (as in broadcastRhs).

With NEON, RhsPanel would hold only a single Packet and operator() would return a proxy embedding this packet with the compile-time lane number.

If that's simpler, we could also unify operator() with madd:

traits.madd(A0, rhs_panel, C0, fix<0>);


We 32 registers we should also add a 4x4 or even 5x4, or maybe 3x8 micro-kernel but that's another topic!
Comment 6 Christoph Hertzberg 2018-11-29 10:02:55 UTC
It would be great if we could implement all this using class templates instead of macros, which would unroll the same way fixed-sized matrix multiplications unroll. Ideally, this could even share the same code.
Comment 7 Benoit Jacob 2018-11-29 14:38:26 UTC
(In reply to Gael Guennebaud from comment #5)
> Maybe this idea could be generalized to also handle arch with 32 registers
> (Altivec/VSX, Mips/MSA, and AVX512).

Note that on 64-bit ARM, there are 32 NEON (128bit) registers.


> With more registers, we could load 4
> rhs elements and broadcast them afterward exactly as we do for 2pX4 micro
> kernel.

Indeed!

> 
> Maybe we could unify the 3 variants by introducing a generic proxy to the
> four rhs elements, let's call it "rhs_panel". [...]

Glad I got your brain started on this :-D

Happy with anything that performs well on both big cores and little cores of any pixel phone or similar, per attached benchmark or any other benchmark you can run on a device.

The collection of asm kernels in this file,

https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neon-gemm-kernel-benchmark.cc

with their code comments, represents  all I know on this topic (if you're interested, I can try to publicly release more complete benchmark results from it than we have so far in https://docs.google.com/spreadsheets/d/1-0LjdMvW0XtH1bYknC0bQINoFaxjTuL9eplZZcitykI/edit#gid=0 )
Comment 8 Renjie Liu 2018-12-03 09:00:22 UTC
Friendly ping, Hey Gael & Christoph, any update on this?
Let me know if I can help. Thanks a lot!
Comment 9 Gael Guennebaud 2018-12-03 14:34:28 UTC
Renjie, do you feel brave enough to try to implement the unified approach I suggested ?

To reduce the meta-programming kung-fu we could simplify a bit the implementation with the following API.

Let's remind the 3 cases:

(a) we have plenty of registers and the 4 rhs coeffs are broadcasted to 4 RhsPackets
(b) we are tight in registers (AVX+FMA+3x4 kernel), we need to broadcast the rhs coeffs once at a time into a single register
(c) NEON: we copy the 4 rhs coeffs into a single RhsPacket and perform madd_by_lane.

Then a micro kernel would look like:

Traits::RhsPanel<15> rhs_panel;
// here 15 = number of reserved registers for A* and C*
// RhsPanel could be:
//  one RhsPacket (b) and (c)
//  a struct of 4 RhsPacket B_0,B1,B2,B3 (a)

// no changes for loadLhs:
traits.loadLhs(&blA[(0+3*K)*LhsProgress], A0);  \
traits.loadLhs(&blA[(1+3*K)*LhsProgress], A1);  \
traits.loadLhs(&blA[(2+3*K)*LhsProgress], A2);  \

// loadRhs and broadcastRhs would be unified:
traits.loadRhs(blB + (0+4*K)*Traits::RhsProgress, rhs_panel); \

// in the following calls, the 5th parameter would be ignored in case (b)
traits.madd(A0, rhs_panel, C0, TMP, fix<0>); \
traits.madd(A1, rhs_panel, C1, TMP, fix<0>); \
traits.madd(A2, rhs_panel, C2, TMP, fix<0>); \

// update the rhs_panel in case (b), empty in cases (a) and (c)
traits.updateRhs(blB + (1+4*K)*Traits::RhsProgress, rhs_panel); \
traits.madd(A0, rhs_panel, C3, TMP, fix<1>); \
traits.madd(A1, rhs_panel, C4, TMP, fix<1>); \
traits.madd(A2, rhs_panel, C5, TMP, fix<1>); \

...


This way we don't have to assume that the compiler will always do a good job at factorizing out redundant broadcast, and we don't ask more to the compiler in term of register allocation. So this should be safe.
Comment 10 Renjie Liu 2018-12-04 01:42:59 UTC
Thanks for the guidance! I will give it a try! :D
Comment 11 Renjie Liu 2018-12-05 02:15:32 UTC
Hey Gael, just want to make sure I understand it correctly:

1) Having a RhsPanel defined in each gebp_trait and templated by the registers taken. (it's either a RhsPacket or RhsPacketx4)
2) Rewrite madd & loadRhs in each gebp_trait, madd for NEON and vectorization, basically like http://eigen.tuxfamily.org/bz/attachment.cgi?id=902&action=diff. (but we need to make sure RhsProgress == 1)? maybe we still need a different signature?
3) Declare RhsPanel<register_taken> in different micro kernels for 3px4, 2px4 and 1px4.
4) Get rid of BroadcastRhs for the gebp_trait since loadRhs can handle RhsPacketx4.

WDYT?

Thanks for your advice!
Comment 12 Gael Guennebaud 2018-12-05 17:08:26 UTC
Yes, that's basically what I had in mind but feel-free to adjust it as implementing it, I would not be surprised if some unexpected difficulties come up in the process. For instance I haven't thought about the compatibly with the "swapped rhs-lhs" mode.
Comment 13 Renjie Liu 2018-12-06 15:33:43 UTC
got it! I will try to do it!
Comment 14 Renjie Liu 2018-12-12 02:56:19 UTC
Hey Gael, I'm truly sorry for the delay.

I'm busy working on a project needs to be done recently, I noticed this issue is blocking another issue. Please take it over if necessary, or I can still work on that if this issue can wait?

I did not make much progress except for 1), I haven't rewrote the madd/loadRhs, also for 1), my declaration looks like

```
template <typename RhsPacket, typename RhsPacketx4, int registers_taken>
using RhsPanelHelper = typename conditional<(registers_taken < 15), RhsPacket, RhsPacketx4>::type;
```

and inside each gebp_traits,

I have something like 

```
typedef struct {RhsPacket B_0; RhsPacket B1; RhsPacket B2; RhsPacket B3;} RhsPacketx4;

```

then I defined

```
template <int registers_taken>
using RhsPanel = RhsPanelHelper<RhsPacket, RhsPacketx4, registers_taken>;
```

inside gebp_kernelm so inside gebp_kernel, we can use the syntax described in 1). 

however, I'm not sure about if my approach (`using` for template is since c++11?) is permitted for eigen?

Thanks,
Comment 15 Gael Guennebaud 2018-12-12 14:03:34 UTC
no worry, the dependence is not that strong, I was just concerned about limiting conflicts.

Regarding:

template <typename RhsPacket, typename RhsPacketx4, int registers_taken>
using RhsPanelHelper = typename conditional<(registers_taken < 15), RhsPacket, RhsPacketx4>::type;

and other "using" statements, you can wrap it:

template <typename RhsPacket, typename RhsPacketx4, int registers_taken>
struct RhsPanelHelper {
  typedef typename conditional<(registers_taken < 15), RhsPacket, RhsPacketx4>::type type;
};
Comment 16 Gael Guennebaud 2018-12-12 16:49:13 UTC
*** Bug 1385 has been marked as a duplicate of this bug. ***
Comment 17 Renjie Liu 2018-12-13 02:11:04 UTC
got it, thank you!
Comment 18 Renjie Liu 2018-12-24 11:21:00 UTC
Created attachment 910 [details]
proposed refactoring patch

Really sorry about the delay, finally find sometime to work on that. I managed to generate a patch as Gael described with help from Gael & Benoit. :)

I have checked it passed the tests & verified the speed up on neon kernels. However, I have not verified the performance on other platforms. 

Specifialy, not sure about whether the performance for the 1px4 kernel will regress my current approach I defined and extra T0 has a struct of 4 rhspacket, but it's also a unused variable, so hopefully it can be optimized by the compiler?

PTAL. thanks a lot!
Comment 19 Renjie Liu 2019-01-07 03:57:30 UTC
Friendly ping, hi Gael, can you take a look at the patch? thanks a lot! :D
Comment 20 Gael Guennebaud 2019-01-09 16:27:41 UTC
Thank you for the efforts.

I just had a look, but I cannot test it as it don't apply on the head and I don't know when you branched.


Have you tested it on AVX+FMA? I don't understand how it can still work in this case. This corresponds to case (b) in my comment #9:

  (b) we are tight in registers (AVX+FMA+3x4 kernel),
  we need to broadcast the rhs coeffs once at a time into a single register

that I suggested to handle with calls to:

  traits.updateRhs(blB + (1+4*K)*Traits::RhsProgress, rhs_panel);


I also have other comments:
- RhsPacketx4 should be generically defined only once.
- It seems you replaced some calls to pbroadcast4 by manual multiple loads.
Comment 21 Renjie Liu 2019-01-10 08:27:27 UTC
Created attachment 914 [details]
new_proposed_refactoring_patch
Comment 22 Renjie Liu 2019-01-10 08:32:12 UTC
(In reply to Gael Guennebaud from comment #20)

Hey Gael, really sorry about that, I didn't fully understand what you said before, indeed like what you said, my patch caused some performance regression on AVX+FMA for the 3px4 kernel. I have updated my patch to resolve the issue. PTAL.

And some replies inline.

Thanks a lot for your advice and help!

> Thank you for the efforts.
> 
> I just had a look, but I cannot test it as it don't apply on the head and I
> don't know when you branched.
> 
> 
> Have you tested it on AVX+FMA? I don't understand how it can still work in
> this case. This corresponds to case (b) in my comment #9:
> 
>   (b) we are tight in registers (AVX+FMA+3x4 kernel),
>   we need to broadcast the rhs coeffs once at a time into a single register
> 
> that I suggested to handle with calls to:
> 
>   traits.updateRhs(blB + (1+4*K)*Traits::RhsProgress, rhs_panel);
> 
> 
> I also have other comments:
> - RhsPacketx4 should be generically defined only once.
I found it's a little bit hard to do so, like for the Neon kernel needs RhsPacketx4 needs to be typedefed as float32x4 since it's needed by the neon intrinsics.

> - It seems you replaced some calls to pbroadcast4 by manual multiple loads.
Since pbroadcast4 are only used in broadcastRhs so I can get rid of them?

Again, thanks for the review and happy new year. :)
Comment 23 Renjie Liu 2019-01-12 00:58:00 UTC
Hi Gael, do you mind taking another look for the new proposed patch?
thanks a lot!
Comment 24 Gael Guennebaud 2019-01-14 21:16:57 UTC
This looks better. It would help a lot if you could merge your patch with the head of Eigen so that I give it a shot.

Some comments:

These lines are not correct:

#if defined(EIGEN_VECTORIZE_FMA) && defined(EIGEN_VECTORIZE_AVX)
typedef typename Traits::RhsPacket RhsPanel15;
#else
typedef typename RhsPanelHelper<RhsPacket, RhsPacketx4, 15>::type RhsPanel15;
#endif

This will reduce perf on AVX512 for which we have 32 registers.

This logic should be handled by RhsPanelHelper based on EIGEN_ARCH_DEFAULT_NUMBER_OF_REGISTERS.


Regarding pbroadcast4, code like:

loadRhs(b, dest.B_0);
loadRhs(b + 1, dest.B1);
loadRhs(b + 2, dest.B2);
loadRhs(b + 3, dest.B3)

for which loadRhs is a simple pset1 should rather call pbroadcast4 (on some SIMD engines it is better implemented as one pload + 4 shuffles).
Comment 25 Renjie Liu 2019-01-15 10:29:00 UTC
Created attachment 917 [details]
updated_patch

thanks for the review, I have synced to the head and updated the patch, please take a look, thanks a lot!
Comment 26 Gael Guennebaud 2019-01-15 14:33:58 UTC
Your patch does not apply. Make sure it includes all your local commits.
Comment 27 Renjie Liu 2019-01-16 04:56:25 UTC
Created attachment 918 [details]
updated_patch

really sorry about that, I'm not a hg expert. reuploaded a patch, this time I make sure it applies to a clean repo. plz let me know if you have any question.

thanks a lot.
Comment 28 Gael Guennebaud 2019-01-16 13:01:13 UTC
Alright, I applied your patch:

https://bitbucket.org/eigen/eigen/commits/556fb4ceb654/
User:        renjieliu
Summary:     Bug: 1633: refactor gebp kernel and optimize for neon

and did some cleanup:

https://bitbucket.org/eigen/eigen/commits/465744fa91ea/
User:        ggael
Summary:     Bug 1633: use proper type for madd temporaries, factorize RhsPacketx4.


Thank you for the work!
Comment 29 Benoit Jacob 2019-01-21 15:23:31 UTC
Thanks a lot, Gael and Renjie, for resolving all this!
Comment 30 Gael Guennebaud 2019-01-29 11:51:56 UTC
Renjie,

have you benchmarked the final version of the patch that we applied?

I now have access to a ThunderX2 and I observed a 10% slow down with the patch:

3.3:              12.5 GFlops
bug1624           13.5 GFlops (but out of bounds memory access)
+fix:             12.5 GFlops
after this patch: 11.5 GFlops

I the generated assembly is also pretty bad, one step of the 3pX4 kernel:

        ldr     q4, [x2, 16]
        ldr     q16, [x1, 48]
        ldr     q15, [x1, 64]
        dup     d5, v4.d[0]
        dup     d4, v4.d[1]
        ldr     q14, [x1, 80]
        dup     v19.4s, v5.s[0]
        dup     v17.4s, v4.s[0]
        dup     v5.4s, v5.s[1]
        dup     v4.4s, v4.s[1]
        fmla    v13.4s, v16.4s, v19.4s
        fmla    v12.4s, v15.4s, v19.4s
        fmla    v11.4s, v14.4s, v19.4s
        fmla    v10.4s, v16.4s, v5.4s
        fmla    v9.4s,  v15.4s, v5.4s
        fmla    v8.4s,  v14.4s, v5.4s
        fmla    v7.4s,  v16.4s, v17.4s
        fmla    v1.4s,  v15.4s, v17.4s
        fmla    v2.4s,  v14.4s, v17.4s
        fmla    v6.4s,  v16.4s, v4.4s
        fmla    v3.4s,  v15.4s, v4.4s
        fmla    v0.4s,  v14.4s, v4.4s
        add     x28, x1, 352
        prfm pldl1keep, [x28]

whereas I would expect:

        ldr     q4, [x2, 16]
        ldr     q16, [x1, 48]
        ldr     q15, [x1, 64]
        ldr     q14, [x1, 80]
        fmla    v13.4s, v16.4s, v4.s[2]
        fmla    v12.4s, v15.4s, v4.s[2]
        fmla    v11.4s, v14.4s, v4.s[2]
        fmla    v10.4s, v16.4s, v4.s[3]
        fmla    v9.4s,  v15.4s, v4.s[3]
        fmla    v8.4s,  v14.4s, v4.s[3]
        fmla    v7.4s,  v16.4s, v4.s[0]
        fmla    v1.4s,  v15.4s, v4.s[0]
        fmla    v2.4s,  v14.4s, v4.s[0]
        fmla    v6.4s,  v16.4s, v4.s[1]
        fmla    v3.4s,  v15.4s, v4.s[1]
        fmla    v0.4s,  v14.4s, v4.s[1]
        add     x28, x1, 352
        prfm pldl1keep, [x28]

Any ideas?
Comment 31 Gael Guennebaud 2019-01-29 12:06:45 UTC
Here, changing the four madd overloads to:

template<int LaneID>
  EIGEN_STRONG_INLINE void madd(const LhsPacket& a, const RhsPacketx4& b, AccPacket& c, RhsPacket& /*tmp*/, const FixedInt<LaneID>&) const
  {
    c = vfmaq_laneq_f32(c, a, b, LaneID);
  }

works as expected, and I get a significant speed increase:

14.3 GFlops

According to: https://developer.arm.com/technologies/neon/intrinsics

vfmaq_laneq_f32 seems to be equally supported as vfmaq_lane_f32. Any reasons for not using vfmaq_laneq_f32 in the first place??
Comment 32 Gael Guennebaud 2019-01-29 12:10:21 UTC
hm, the assembly is still not as good as it should be:

        ldr	q1, [x2, 16]
	add	x30, x1, 352
	ldr	q25, [x1, 48]
	dup	v28.4s, v1.s[0]
	dup	v27.4s, v1.s[1]
	dup	v26.4s, v1.s[2]
	dup	v1.4s, v1.s[3]
	ldr	q24, [x1, 64]
	ldr	q23, [x1, 80]
	fmla	v22.4s, v25.4s, v28.4s
	fmla	v21.4s, v24.4s, v28.4s
	fmla	v20.4s, v23.4s, v28.4s
	fmla	v3.4s, v25.4s, v27.4s
	fmla	v19.4s, v24.4s, v27.4s
	fmla	v18.4s, v23.4s, v27.4s
	fmla	v6.4s, v25.4s, v26.4s
	fmla	v16.4s, v24.4s, v26.4s
	fmla	v7.4s, v23.4s, v26.4s
	fmla	v17.4s, v25.4s, v1.4s
	fmla	v5.4s, v24.4s, v1.4s
	fmla	v2.4s, v23.4s, v1.4s

I don't know why GCC issues those dup instructions...
Comment 33 Gael Guennebaud 2019-01-29 12:30:33 UTC
OK, so now if I play with inline asm and replace vfmaq_laneq_f32 by:

         if(LaneID==0)  asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" (a), "w" (b) :  );
    else if(LaneID==1)  asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" (a), "w" (b) :  );
    else if(LaneID==2)  asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" (a), "w" (b) :  );
    else if(LaneID==3)  asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" (a), "w" (b) :  );

I eventually get 18.8 GFlops with the following generated ASM:

	ldr	q23, [x1, 192]
	add	x30, x1, 496
	ldr	q22, [x2, 64]
	fmla v20.4s, v23.4s, v22.s[0]
	fmla v17.4s, v23.4s, v22.s[1]
	fmla v6.4s, v23.4s, v22.s[2]
	fmla v3.4s, v23.4s, v22.s[3]
	ldr	q23, [x1, 208]
	fmla v19.4s, v23.4s, v22.s[0]
	fmla v16.4s, v23.4s, v22.s[1]
	fmla v5.4s, v23.4s, v22.s[2]
	fmla v2.4s, v23.4s, v22.s[3]
	ldr	q23, [x1, 224]
	fmla v18.4s, v23.4s, v22.s[0]
	fmla v7.4s, v23.4s, v22.s[1]
	fmla v4.4s, v23.4s, v22.s[2]
	fmla v1.4s, v23.4s, v22.s[3]
	prfm pldl1keep, [x30]

So now I have to teach GCC to load A0, A1, and A2 into three distinct registers (as it did before I added the inline asm... damn register allocators!)
Comment 34 Gael Guennebaud 2019-01-29 12:41:52 UTC
For the dup, here is how GCC implements vfmaq_laneq_f32:


__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
	         float32x4_t __c, const int __lane)
{
  return __builtin_aarch64_fmav4sf (__b,
				    __aarch64_vdupq_laneq_f32 (__c, __lane),
				    __a);
}


I'll report a bug to GCC.
Comment 35 Gael Guennebaud 2019-01-29 12:57:02 UTC
GCC's bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101
Comment 36 Renjie Liu 2019-01-29 13:45:29 UTC
I think you're right, vfmaq_laneq_f32 is better than vfmaq_lane_f32, and it makes the code easier as well. :)

Thanks for catching this!

For the gcc part, I really have no idea why it happened, I compiled with clang 4.0.1, and it seems even for vfmaq_lane_f32, it's producing the right result:

#begin step of gebp micro kernel 3pX4
        //NO_APP
        //APP
        #Note: these asm comments work around bug 935!
        //NO_APP
        //APP
        prfm pldl1keep, [x19]

        //NO_APP
        fmla    v19.4s, v20.4s, v23.s[0]
        fmla    v17.4s, v20.4s, v23.s[1]
        fmla    v5.4s, v20.4s, v23.s[2]
        fmla    v3.4s, v20.4s, v23.s[3]
        ldp     q22, q20, [x13, #32]
        fmla    v18.4s, v21.4s, v23.s[0]
        fmla    v7.4s, v21.4s, v23.s[1]
        fmla    v4.4s, v21.4s, v23.s[2]
        fmla    v1.4s, v21.4s, v23.s[3]
        fmla    v16.4s, v22.4s, v23.s[0]
        fmla    v6.4s, v22.4s, v23.s[1]
        fmla    v2.4s, v22.4s, v23.s[2]
        fmla    v0.4s, v22.4s, v23.s[3]
        ldp     q21, q22, [x13, #64]
        ldr     q23, [x16, #16]
        add     x19, x13, #352          // =352
        //APP
        #end step of gebp micro kernel 3pX4


Thanks for filing the bug btw.
Comment 37 Gael Guennebaud 2019-01-29 14:09:38 UTC
Created attachment 922 [details]
Speed up gebp on arm64

Here are my local changes that allowed to reach 19.3 GFlops (current Eigen version reaches 11.5). I also added faster support for double.
Comment 38 Gael Guennebaud 2019-01-29 14:24:49 UTC
Created attachment 923 [details]
Fix perf with arm64 and gcc, add explicit support for double

Here is the patch I plan to apply.
Comment 39 Renjie Liu 2019-01-29 14:40:57 UTC
thanks for the fix!
Comment 40 Gael Guennebaud 2019-01-30 10:54:06 UTC
Applied: https://bitbucket.org/eigen/eigen/commits/e2b2a2b25718/
Comment 41 Nobody 2019-12-04 18:12:27 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1633.

Note You need to log in before you can comment on or make changes to this bug.