1615
2018-10-19 21:45:53 +0000
Performance of (aliased) matrix multiplication with fixed size 3x3 matrices slow
2019-12-04 18:04:01 +0000
1
1
1
Unclassified
Eigen
Core - matrix products
3.4 (development)
All
All
RESOLVED
FIXED
Normal
Performance Problem
---
814
0
rmlarsen
eigen.nobody
chtz
gael.guennebaud
rmlarsen
oldest_to_newest
7595
0
rmlarsen
2018-10-19 21:45:53 +0000
The following code:
void foo(float* A, float* B, int iterations) {
Eigen::Matrix<T, 3, 3> R1, R2;
for (int i = 0; i < 9; ++i) {
R1.data()[i] = A[i];
R2.data()[i] = B[i];
}
Eigen::Matrix<T, 3, 3> R1, R2;
for (int i = 0; i < 9; ++i) {
R1.data()[i] = A[i];
R2.data()[i] = B[i];
}
for (int i = 0; i < iterations; ++i) {
R2 = R2 * R1;
}
}
runs about 3x slower with Eigen @ HEAD, compared to our internal branch (a mix of 3.2 and many backports from 3.3 and 3.4). Have there been any significant changes to the this code? Where should I be looking?
7596
1
rmlarsen
2018-10-19 21:48:09 +0000
I should mention that performance for 4x4 matrices seems unchanged:
@Head:
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------------------------------
BM_Matrix3x3MultiplyFloatsEigen 21.4 21.4 32458344
BM_Matrix4x4MultiplyFloatsEigen 5.33 5.33 100000000
Internal Eigen fork:
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------------------------------
BM_Matrix3x3MultiplyFloatsEigen 6.57 6.58 100000000
BM_Matrix4x4MultiplyFloatsEigen 5.32 5.33 100000000
I
7597
2
chtz
2018-10-19 23:18:14 +0000
Do you have the same drop from any (public) ancestor of HEAD as well? In that case you can try to bisect (https://www.mercurial-scm.org/repo/hg/help/bisect) to a version where the performance drops significantly.
Running the lazy_gemm from bench/perf_monitoring, I don't see any significant drop for 3x3 gemm, from any of the revisions listed in changesets.txt to the current head. This is not quite the same as your benchmark, of course.
I was testing with g++-5.5 on an Core i5-4210U (with -march=native -O3).
7600
3
gael.guennebaud
2018-10-20 17:25:53 +0000
I cannot think about any relevant changes. Looking to the generated assembly will very likely be insightful. Also, make sure R2 is fully used, for instance by making foo returning R2.sum().
7607
4
rmlarsen
2018-10-22 20:53:19 +0000
Thanks for the feedback. Let me try to narrow it down further.
7777
5
gael.guennebaud
2018-12-12 17:30:09 +0000
I confirm that with 3.2 the 3x3 product is more than x2 faster. I get:
3.2:
3x3: 6.6
4x4: 5.0
3.3/head
3x3: 15.0
4x4: 5.0
I'll look at the asm later.
7778
6
gael.guennebaud
2018-12-12 20:09:20 +0000
Same issue without aliasing (C.noalias() = A*B).
The problem is that the following function:
template<typename Kernel>
struct dense_assignment_loop<Kernel, DefaultTraversal, InnerUnrolling>
{
EIGEN_DEVICE_FUNC static EIGEN_STRONG_INLINE void run(Kernel &kernel)
{
typedef typename Kernel::DstEvaluatorType::XprType DstXprType;
const Index outerSize = kernel.outerSize();
for(Index outer = 0; outer < outerSize; ++outer)
copy_using_evaluator_DefaultTraversal_InnerUnrolling<Kernel, 0, DstXprType::InnerSizeAtCompileTime>::run(kernel, outer);
}
};
is not inlined starting with 3.3. Actually it was visible in my (old) bench:
http://eigen.tuxfamily.org/perf_monitoring/ggaelmacbook26_gcc/haswell-fma-gcc-slazy_gemm.html
Replacing EIGEN_STRONG_INLINE by EIGEN_ALWAYS_INLINE does the trick with both gcc and clang, but that's of course not an option! I'm clueless...
7780
7
gael.guennebaud
2018-12-13 09:20:43 +0000
The regression started with:
changeset: 9175:abc7a3600098
user: Gael Guennebaud <g.gael@free.fr>
date: Wed Jun 15 00:01:16 2016 +0200
summary: Include the cost of stores in unrolling (also fix infinite unrolling with expression costing 0 like Constant)
SO this means that compiling with increased unrolling limit (-DEIGEN_UNROLLING_LIMIT=110) does fix the issue (the default is 100).
I did not spotted it because the compiler fully unrolled the for loop anyway, but looks like our meta unroller does a better job than the compiler, which is surprising to me.
Anyway, I simply propose to increase the default unrolling limit, which makes sense in regards to the aforementioned changeset.
7781
8
gael.guennebaud
2018-12-13 10:09:17 +0000
Partial fix:
https://bitbucket.org/eigen/eigen/commits/53bf4b24aba5/
Summary: Bug 1615: slightly increase the default unrolling limit to compensate for changeset abc7a3600098.
This solves a performance regression with clang and 3x3 matrix products.
However, there is still a regression with gcc when the product alias, i.e.:
B = B*A;
<=>
T = B*A;
B = T;
This time the culprit is:
changeset: 8989:6c2dc56e73b3
summary: Bug 256: enable vectorization with unaligned loads/stores.
So compiling with -DEIGEN_UNALIGNED_VECTORIZE=0 does fix this precise issue, but introduce several other regressions! Looking at the assembly, the "problem" is that with unaligned vectorization, the copy "B=T" is vectorized whereas the product B*A is evaluated one coeff at a time. The consequence is that GCC really allocate T on the stack and perform an explicit copy after the product.
Without explicit vectorization, GCC manages to keep parts of T in registers with a better interleaving of mul/add and load/stores.
I don't have a good workaround for that last issue, and since this only concerns product with aliasing (that are not very recommended for performance anyway), I propose wontfix.
7783
9
rmlarsen
2018-12-13 17:16:18 +0000
Thanks for investigating and fixing, Gael. I'll ask the Google team in question if they can switch to the noalias version.
7841
10
rmlarsen
2019-01-10 21:46:48 +0000
FYI: The increase in unrolling limit fixes the issue for float, but not double in our case.
10059
11
eigen.nobody
2019-12-04 18:04:01 +0000
-- GitLab Migration Automatic Message --
This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1615.