Created attachment 543 [details]
On ARM, prefetches are vital for good performance. If I remove them, products typically go twice slower!
Current prefetches aren't optimal. This patch fixes the 3px4 kernel, which is what is used for floats on ARM. It's both what makes sense given the access patterns, and what actually runs fastest of all the prefetching patterns that I tried. It's a > 10% speedup on both a nexus 4 and a nexus 5. It also unblocks better understanding cache tuning.
Comment on attachment 543 [details]
Fine to me.
-- GitLab Migration Automatic Message --
This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/953.