This bugzilla service is closed. All entries have been migrated to

Bug 303

Summary: bigMat.block(...).array()*mat.array() is slower than mat.array()*mat.array()
Product: Eigen Reporter: Philippe Marti <philippe.marti>
Component: Core - expression templatesAssignee: Nobody <eigen.nobody>
Status: NEW ---    
Severity: Unknown CC: gael.guennebaud, jacob.benoit.1
Priority: ---    
Version: 3.0   
Hardware: x86 - 64-bit   
OS: Linux   
Description Flags
Program to get timings none

Description Philippe Marti 2011-06-23 14:36:02 UTC
I was doing some simple timings and got suprising timings (at least to me).

with matrices bigMat 500x20, mat1 100x20, mat2 100x20, I compute:

sol = bigMat.block(0,0,100,20).array() * mat2.array();

But compared to:

sol = mat1.array() * mat2.array();

The first computation is ~30% slower on my machine. But:

sol = bigMat.block(0,0,100,20).transpose().array() * mat2.transpose().array()

Is about the same time. Is that really what should be expected ?

Comment 1 Jitse Niesen 2011-09-11 12:11:28 UTC
Created attachment 208 [details]
Program to get timings

On my computer (Intel Core i5-2410M, 2.30GHz) and using gcc with -O2, the following timings are produced by the attached program:

Multiplication 1 takes 2.82 usec 
Multiplication 2 takes 1.25 usec 
Multiplication 3 takes 2.73 usec 

Here, 'multiplication 1' refers to the multiplication with block(), 'multiplication 2' refers to the multiplication without block() and 'multiplication 3' refers to the multiplication with block() and transpose().

It's not surprising to me that the second formulation (without the block) is faster. The second formulation is implemented to something like this (ignoring vectorization and some additional optimizations):

   for (int i = 0; i < 100*20; ++i)
      *(sol + i) = *(mat1 + i) * *(mat2 + i);

The first formulation yields something like this:

   for (int row = 0; i < 100; ++row)
      for (int col = 0; col < 20; ++col)
         *(sol + row * 20 + col) = *(bigMat + row * 20 + col) * *(mat2 + row * 20 + col);

The point is that the first formulation requires a double loop while the second one is translated in a single loop. The single loop has less branching so it's faster.

This also explains why the difference gets (relatively) smaller if the matrices get bigger.

That being said, the difference is bigger than I'd expected. I'll leave it open for our performance gurus to decide whether there is an issue here that needs to be fixed or not.
Comment 2 Nobody 2019-12-04 10:54:52 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: