Summary:  bigMat.block(...).array()*mat.array() is slower than mat.array()*mat.array()  

Description
Philippe Marti
20110623 14:36:02 UTC
Created attachment 208 [details]
Program to get timings
On my computer (Intel Core i52410M, 2.30GHz) and using gcc with O2, the following timings are produced by the attached program:
Multiplication 1 takes 2.82 usec
Multiplication 2 takes 1.25 usec
Multiplication 3 takes 2.73 usec
Here, 'multiplication 1' refers to the multiplication with block(), 'multiplication 2' refers to the multiplication without block() and 'multiplication 3' refers to the multiplication with block() and transpose().
It's not surprising to me that the second formulation (without the block) is faster. The second formulation is implemented to something like this (ignoring vectorization and some additional optimizations):
for (int i = 0; i < 100*20; ++i)
*(sol + i) = *(mat1 + i) * *(mat2 + i);
The first formulation yields something like this:
for (int row = 0; i < 100; ++row)
for (int col = 0; col < 20; ++col)
*(sol + row * 20 + col) = *(bigMat + row * 20 + col) * *(mat2 + row * 20 + col);
The point is that the first formulation requires a double loop while the second one is translated in a single loop. The single loop has less branching so it's faster.
This also explains why the difference gets (relatively) smaller if the matrices get bigger.
That being said, the difference is bigger than I'd expected. I'll leave it open for our performance gurus to decide whether there is an issue here that needs to be fixed or not.
