The following code: void foo(float* A, float* B, int iterations) { Eigen::Matrix<T, 3, 3> R1, R2; for (int i = 0; i < 9; ++i) { R1.data()[i] = A[i]; R2.data()[i] = B[i]; } Eigen::Matrix<T, 3, 3> R1, R2; for (int i = 0; i < 9; ++i) { R1.data()[i] = A[i]; R2.data()[i] = B[i]; } for (int i = 0; i < iterations; ++i) { R2 = R2 * R1; } } runs about 3x slower with Eigen @ HEAD, compared to our internal branch (a mix of 3.2 and many backports from 3.3 and 3.4). Have there been any significant changes to the this code? Where should I be looking?
I should mention that performance for 4x4 matrices seems unchanged: @Head: Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------------ BM_Matrix3x3MultiplyFloatsEigen 21.4 21.4 32458344 BM_Matrix4x4MultiplyFloatsEigen 5.33 5.33 100000000 Internal Eigen fork: Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------------ BM_Matrix3x3MultiplyFloatsEigen 6.57 6.58 100000000 BM_Matrix4x4MultiplyFloatsEigen 5.32 5.33 100000000 I
Do you have the same drop from any (public) ancestor of HEAD as well? In that case you can try to bisect (https://www.mercurial-scm.org/repo/hg/help/bisect) to a version where the performance drops significantly. Running the lazy_gemm from bench/perf_monitoring, I don't see any significant drop for 3x3 gemm, from any of the revisions listed in changesets.txt to the current head. This is not quite the same as your benchmark, of course. I was testing with g++-5.5 on an Core i5-4210U (with -march=native -O3).
I cannot think about any relevant changes. Looking to the generated assembly will very likely be insightful. Also, make sure R2 is fully used, for instance by making foo returning R2.sum().
Thanks for the feedback. Let me try to narrow it down further.
I confirm that with 3.2 the 3x3 product is more than x2 faster. I get: 3.2: 3x3: 6.6 4x4: 5.0 3.3/head 3x3: 15.0 4x4: 5.0 I'll look at the asm later.
Same issue without aliasing (C.noalias() = A*B). The problem is that the following function: template<typename Kernel> struct dense_assignment_loop<Kernel, DefaultTraversal, InnerUnrolling> { EIGEN_DEVICE_FUNC static EIGEN_STRONG_INLINE void run(Kernel &kernel) { typedef typename Kernel::DstEvaluatorType::XprType DstXprType; const Index outerSize = kernel.outerSize(); for(Index outer = 0; outer < outerSize; ++outer) copy_using_evaluator_DefaultTraversal_InnerUnrolling<Kernel, 0, DstXprType::InnerSizeAtCompileTime>::run(kernel, outer); } }; is not inlined starting with 3.3. Actually it was visible in my (old) bench: http://eigen.tuxfamily.org/perf_monitoring/ggaelmacbook26_gcc/haswell-fma-gcc-slazy_gemm.html Replacing EIGEN_STRONG_INLINE by EIGEN_ALWAYS_INLINE does the trick with both gcc and clang, but that's of course not an option! I'm clueless...
The regression started with: changeset: 9175:abc7a3600098 user: Gael Guennebaud <g.gael@free.fr> date: Wed Jun 15 00:01:16 2016 +0200 summary: Include the cost of stores in unrolling (also fix infinite unrolling with expression costing 0 like Constant) SO this means that compiling with increased unrolling limit (-DEIGEN_UNROLLING_LIMIT=110) does fix the issue (the default is 100). I did not spotted it because the compiler fully unrolled the for loop anyway, but looks like our meta unroller does a better job than the compiler, which is surprising to me. Anyway, I simply propose to increase the default unrolling limit, which makes sense in regards to the aforementioned changeset.
Partial fix: https://bitbucket.org/eigen/eigen/commits/53bf4b24aba5/ Summary: Bug 1615: slightly increase the default unrolling limit to compensate for changeset abc7a3600098. This solves a performance regression with clang and 3x3 matrix products. However, there is still a regression with gcc when the product alias, i.e.: B = B*A; <=> T = B*A; B = T; This time the culprit is: changeset: 8989:6c2dc56e73b3 summary: Bug 256: enable vectorization with unaligned loads/stores. So compiling with -DEIGEN_UNALIGNED_VECTORIZE=0 does fix this precise issue, but introduce several other regressions! Looking at the assembly, the "problem" is that with unaligned vectorization, the copy "B=T" is vectorized whereas the product B*A is evaluated one coeff at a time. The consequence is that GCC really allocate T on the stack and perform an explicit copy after the product. Without explicit vectorization, GCC manages to keep parts of T in registers with a better interleaving of mul/add and load/stores. I don't have a good workaround for that last issue, and since this only concerns product with aliasing (that are not very recommended for performance anyway), I propose wontfix.
Thanks for investigating and fixing, Gael. I'll ask the Google team in question if they can switch to the noalias version.
FYI: The increase in unrolling limit fixes the issue for float, but not double in our case.
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1615.