This bugzilla service is closed. All entries have been migrated to

Bug 1615

Summary: Performance of (aliased) matrix multiplication with fixed size 3x3 matrices slow
Product: Eigen Reporter: Rasmus Munk Larsen <rmlarsen>
Component: Core - matrix productsAssignee: Nobody <eigen.nobody>
Severity: Performance Problem CC: chtz, gael.guennebaud, rmlarsen
Priority: Normal    
Version: 3.4 (development)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 814    

Description Rasmus Munk Larsen 2018-10-19 21:45:53 UTC
The following code:

void foo(float* A, float* B, int iterations) {
  Eigen::Matrix<T, 3, 3> R1, R2;
  for (int i = 0; i < 9; ++i) {[i] = A[i];[i] = B[i];
  Eigen::Matrix<T, 3, 3> R1, R2;
  for (int i = 0; i < 9; ++i) {[i] = A[i];[i] = B[i];
  for (int i = 0; i < iterations; ++i) {
    R2 = R2 * R1;

runs about 3x slower with Eigen @ HEAD, compared to our internal branch (a mix of 3.2 and many backports from 3.3 and 3.4). Have there been any significant changes to the this code? Where should I be looking?
Comment 1 Rasmus Munk Larsen 2018-10-19 21:48:09 UTC
I should mention that performance for 4x4 matrices seems unchanged:


Benchmark                         Time(ns)        CPU(ns)     Iterations
BM_Matrix3x3MultiplyFloatsEigen         21.4           21.4     32458344  
BM_Matrix4x4MultiplyFloatsEigen          5.33           5.33   100000000  

Internal Eigen fork:

Benchmark                         Time(ns)        CPU(ns)     Iterations
BM_Matrix3x3MultiplyFloatsEigen          6.57           6.58   100000000  
BM_Matrix4x4MultiplyFloatsEigen          5.32           5.33   100000000  
Comment 2 Christoph Hertzberg 2018-10-19 23:18:14 UTC
Do you have the same drop from any (public) ancestor of HEAD as well? In that case you can try to bisect ( to a version where the performance drops significantly.

Running the lazy_gemm from bench/perf_monitoring, I don't see any significant drop for 3x3 gemm, from any of the revisions listed in changesets.txt to the current head. This is not quite the same as your benchmark, of course.

I was testing with g++-5.5 on an Core i5-4210U (with -march=native -O3).
Comment 3 Gael Guennebaud 2018-10-20 17:25:53 UTC
I cannot think about any relevant changes. Looking to the generated assembly will very likely be insightful. Also, make sure R2 is fully used, for instance by making foo returning R2.sum().
Comment 4 Rasmus Munk Larsen 2018-10-22 20:53:19 UTC
Thanks for the feedback. Let me try to narrow it down further.
Comment 5 Gael Guennebaud 2018-12-12 17:30:09 UTC
I confirm that with 3.2 the 3x3 product is more than x2 faster. I get:

3x3: 6.6
4x4: 5.0

3x3: 15.0
4x4:  5.0

I'll look at the asm later.
Comment 6 Gael Guennebaud 2018-12-12 20:09:20 UTC
Same issue without aliasing (C.noalias() = A*B).

The problem is that the following function:

template<typename Kernel>
struct dense_assignment_loop<Kernel, DefaultTraversal, InnerUnrolling>
  EIGEN_DEVICE_FUNC static EIGEN_STRONG_INLINE void run(Kernel &kernel)
    typedef typename Kernel::DstEvaluatorType::XprType DstXprType;

    const Index outerSize = kernel.outerSize();
    for(Index outer = 0; outer < outerSize; ++outer)
      copy_using_evaluator_DefaultTraversal_InnerUnrolling<Kernel, 0, DstXprType::InnerSizeAtCompileTime>::run(kernel, outer);

is not inlined starting with 3.3. Actually it was visible in my (old) bench:

Replacing EIGEN_STRONG_INLINE by EIGEN_ALWAYS_INLINE does the trick with both gcc and clang, but that's of course not an option! I'm clueless...
Comment 7 Gael Guennebaud 2018-12-13 09:20:43 UTC
The regression started with:

changeset:   9175:abc7a3600098
user:        Gael Guennebaud <>
date:        Wed Jun 15 00:01:16 2016 +0200
summary:     Include the cost of stores in unrolling (also fix infinite unrolling with expression costing 0 like Constant)

SO this means that compiling with increased unrolling limit (-DEIGEN_UNROLLING_LIMIT=110) does fix the issue (the default is 100).

I did not spotted it because the compiler fully unrolled the for loop anyway, but looks like our meta unroller does a better job than the compiler, which is surprising to me.

Anyway, I simply propose to increase the default unrolling limit, which makes sense in regards to the aforementioned changeset.
Comment 8 Gael Guennebaud 2018-12-13 10:09:17 UTC
Partial fix:
Summary:     Bug 1615: slightly increase the default unrolling limit to compensate for changeset abc7a3600098.
This solves a performance regression with clang and 3x3 matrix products.

However, there is still a regression with gcc when the product alias, i.e.:

B = B*A;


T = B*A;
B = T;

This time the culprit is:

changeset:   8989:6c2dc56e73b3
summary:     Bug 256: enable vectorization with unaligned loads/stores.

So compiling with -DEIGEN_UNALIGNED_VECTORIZE=0 does fix this precise issue, but introduce several other regressions! Looking at the assembly, the "problem" is that with unaligned vectorization, the copy "B=T" is vectorized whereas the product B*A is evaluated one coeff at a time. The consequence is that GCC really allocate T on the stack and perform an explicit copy after the product.

Without explicit vectorization, GCC manages to keep parts of T in registers with a better interleaving of mul/add and load/stores.

I don't have a good workaround for that last issue, and since this only concerns product with aliasing (that are not very recommended for performance anyway), I propose wontfix.
Comment 9 Rasmus Munk Larsen 2018-12-13 17:16:18 UTC
Thanks for investigating and fixing, Gael. I'll ask the Google team in question if they can switch to the noalias version.
Comment 10 Rasmus Munk Larsen 2019-01-10 21:46:48 UTC
FYI: The increase in unrolling limit fixes the issue for float, but not double in our case.
Comment 11 Nobody 2019-12-04 18:04:01 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: