Created attachment 413 [details]
Patch against the latest version of the code
Using unaligned sse loads/stores is slower than using the corresponding aligned instructions, even if the underlying address is aligned.
The attached patch attempts to use aligned loads/stores as much as possible in the gebp_kernel. This results in a performance gain of a few percent on the Eigen matrix-matrix benchmark as depicted by the attached before/after pictures (run on Sandy Bridge with gcc 4.6).
Created attachment 414 [details]
Results of the Matrix-Matrix benchmark before applying the patch
Created attachment 415 [details]
Results of the Matrix-Matrix benchmark after applying the patch
The drawback of this patch is that it has a non negligible impact on both compilation times and binary code size since it duplicates each instantiation of a matrix-matrix product. Since the speedup seems to be rather small, I'm not sure it's a reasonable change.
Created attachment 423 [details]
merged benchmark plots
Looking at this old benchmark (http://download.tuxfamily.org/eigen/btl-results-110323/matrix_matrix.pdf) it seems that MKL was (is still?) doing something similar since it exhibits strong spikes for sizes allowing aligned accesses.
Two more remarks:
- there already exist conditional version of the pstore/pload; they are called pstoret and ploadt
- to avoid binary code duplication, maybe a dynamic branching in the kernel would have a negligible impact
(In reply to comment #4)
> Created attachment 423 [details]
> merged benchmark plots
The impact of that patch looks indeed very minor (there are even some outliers for which performance slightly drops). If this means duplicating the binary size and compile-time I would rather not do it.
Related to the discussion you started in bug 721: How would the impact be on other architectures?
(In reply to comment #6)
> - to avoid binary code duplication, maybe a dynamic branching in the kernel
> would have a negligible impact
Have you checked the impact of that idea (on performance and binary size)?
(In reply to comment #7)
> (In reply to comment #4)
> > Created attachment 423 [details]
> > merged benchmark plots
> The impact of that patch looks indeed very minor (there are even some outliers
> for which performance slightly drops). If this means duplicating the binary
> size and compile-time I would rather not do it.
> Related to the discussion you started in bug 721: How would the impact be on
> other architectures?
We have to check, especially with AVX and NEON.
> (In reply to comment #6)
> > - to avoid binary code duplication, maybe a dynamic branching in the kernel
> > would have a negligible impact
> Have you checked the impact of that idea (on performance and binary size)?
no, this was more a todo item but my guess is that the overhead should be negligible with this approach.
Created attachment 433 [details]
I have attached an updated version of the patch: it relies on the existing ploadt and pstoret primitives instead of introducing new ones that do the same thing. I also updated the code to choose between aligned and unaligned memory accesses in the gebp_kernel itself. This reduces the size of the binary.
I have rerun the matrix-matrix benchmarks on SandyBridge and gcc-4.8.3 with the updated patch: the code is now a little faster for matrices that are less than 2328x2328, a little slower for bigger matrices. At this point it looks like a wash, I'll revisit this patch later once the AVX code is merged.
-- GitLab Migration Automatic Message --
This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/724.