Created attachment 332 [details] NEON Duplicate lane load NEON implementations of ploaddup can be improved by using the vld1_dup_*() intrinsics instead of splitting the scalar loads from the vdup_n_*() splat/duplication. Patch for Eigen/src/Core/arch/NEON/PacketMath.h attached. I found gcc 4.6.3 to go from (pseudo asm): ldmia.w r0, {r2, r3} vdup.32 d0, r2 vdup.32 d1, r3 to vld1.32 {d0[]}, [r0]! vld1.32 {d1[]}, [r0]
I known enough ARM & NEON, so I'm not sure to understand why this version is better? vdup seems to be exactly what we want. The fact GCC added a register load instruction seems to be unrelated?
Sorry for the slow reponse. I admit this patch is very minor, but the vld1_dup_*() intrinsics were provided with exactly the ploaddup style operation in mind. They discourage the compiler from using the gp registers (and then the additional transfer cost to neon registers) or from loading scalar floats that may result in use of the vfp pipeline (which will cause stalls when neon pipeline takes over again).
Alright: https://bitbucket.org/eigen/eigen/commits/03c0153b9f2f/ Changeset: 03c0153b9f2f User: Simon Pilgrim Date: 2013-06-23 14:13:21 Summary: Fix bug 590: NEON Duplicate lane load
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/590.