This bugzilla service is closed. All entries have been migrated to
Bug 469 - very bad performance with small dynamic pre-allocated matrices
Summary: very bad performance with small dynamic pre-allocated matrices
Status: RESOLVED DUPLICATE of bug 404
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - matrix products (show other bugs)
Version: 3.0
Hardware: All All
: Normal Performance Problem
Assignee: Nobody
Depends on:
Blocks: 404
  Show dependency treegraph
Reported: 2012-05-30 08:40 UTC by laurent.deniau
Modified: 2019-12-04 11:41 UTC (History)
2 users (show)

41 lines of C++ showing the problem with small dynamic matrices (689 bytes, application/octet-stream)
2012-05-30 08:40 UTC, laurent.deniau
no flags Details
te (772 bytes, text/plain)
2012-06-02 16:21 UTC, Gael Guennebaud
no flags Details
timings showing the constant time added by allocations in Eigen (25.15 KB, application/pdf)
2012-06-02 17:13 UTC, laurent.deniau
no flags Details
code use for timings (945 bytes, application/octet-stream)
2012-06-04 07:04 UTC, laurent.deniau
no flags Details
g++-4.7 output (warnings) (8.20 KB, application/octet-stream)
2012-06-04 07:05 UTC, laurent.deniau
no flags Details
assembly output gzipped (g++ -S) (20.00 KB, application/x-gzip)
2012-06-04 07:06 UTC, laurent.deniau
no flags Details

Description laurent.deniau 2012-05-30 08:40:09 UTC
Created attachment 271 [details]
41 lines of C++ showing the problem with small dynamic matrices

The attached file show a difference of x130 in speed on my laptop between fixed and dynamic pre-allocated matrices. I don't see any reason for such discrepancy (I would expect few percents difference) since it takes care of intermediate temporaries. Dynamic matrices seems to systematically reallocate the storage instead of reuse it when storage sizes are compatible, which is a killer for small matrices. In many problems, matrices sizes are unknown (input-defined) but size of intermediate results can be computed and storages pre-allocated. This is a very important "real life" case. Maybe I have missed some Eigen trick to solve the problem...

Timings on my laptop Intel Core i5 2.3 Ghz with g++ 4.7 and -O3 -DNDEBUG -msse4:

C (naive): 10 sec
Eigen fixed size: 1.2 sec (very impressive!)
Eigen dynamic size: 163 sec (very bad!)

g++-4.7 -msse4.2 -std=c++11 -O3 -Ieigen-3.0.5/ -DNDEBUG matrix.cpp -o matrix
Comment 1 Jitse Niesen 2012-05-30 17:42:22 UTC
On my computer, the difference is substantial but not as big as you claim. My timings )with the same compiler options, but with G++ 4.5) are:

Fixed size: 1.37 sec
Dynamic size: 11.3 sec

For some reason, G++ 4.7 seem to generate very bad code on your platform.

Nevertheless, this is a big penalty for using dynamic sizes. Part of the reason is that A = B * C allocates a temporary object for holding the result of the product; if you write the two products as A.noalias() = B * C then the timings improve to:

Fixed size: 1.24 sec
Dynamic size: 6.5 sec

Another reason is that the matrix product code for dynamic-size matrices is written with large matrices in mind (bug 404).

Re "Dynamic matrices seems to systematically reallocate the storage instead of reuse it when storage sizes are compatible", as far as I can see, no (re)allocations take place when you add noalias() .
Comment 2 Jitse Niesen 2012-05-30 17:44:47 UTC
When I wrote "On my computer, the difference is substantial but not as big as you claim", I did not mean to imply that I don't believe you. I'm sure you got the timings that you show. I only wanted to say that my configuration yields different results.
Comment 3 laurent.deniau 2012-05-30 18:48:32 UTC
Yes, I have observed large difference (slower) too between 4.5 and 4.6/4.7 series.

But I think that it is more on the assignment operator that the problem occurs, because I tested both versions with/without explicit temporaries (see the commented code) and I observe no timing difference.

 MatrixXd m1(N,M);
 MatrixXd m2(M,N);
 MatrixXd m3(N,N);
 MatrixXd m4(N,N);
 MatrixXd mr(N,N);


 for (int i=0; i < 100000000; i++) {
   // explicit temporaries version
   m3.noalias() = m1 * m2;   
   m4.noalias() = m1 * m2;
   mr.noalias() = m3 + m4;

   // implicit temporaries version
//  mr = m1 * m2 + m1 * m2;

Ah, I forgot to say that I also tested with noalias as you mentioned and saw no difference as well. I suspect that some template specialization does not work here... I can send the assembly output from gcc4.7 on demand. The loop is easy to find on static-size but the dynamic-size is very very long... I tryied to compile the same code on linux (ubuntu x86_64) with g++ 4.4.3 without success.

It's unfortunate that "small" dynamic-size matrices do not use the same algorithm (Strassen algorithm?) because the static-size code performs very well on 100x100 matrices. But I doubt that the x180 slowdown is coming from that, and I can see many assembly call_malloc inside the loop, which would let suppose that memory allocations are still performed...
Comment 4 Gael Guennebaud 2012-06-02 16:21:20 UTC
Created attachment 273 [details]

Here I get a factor 8 on a Core2 and a factor 13 on a core i7. Such a speedup factor has to be expected.
I also attached a more elaborated test file that avoids agressive compiler optimizations. In your example the compiler could simply drop the loop without changing the result.
Comment 5 laurent.deniau 2012-06-02 17:11:43 UTC
I have tested Eigen against my own matrix library using only dynamic-size matrices. The size are provided on the command line to ensure no knowledge at compile time. The last column is the Eigen time of the code above, including the allocations that occur even with noalias (!). But since the overhead should be almost constant, I remove the constant time in column Eigen-alloc. Then I plot the result of my lib (hand written SSE2 code), a simple matrix mul optimized by with GCC with -msse4 and Eigen-alloc. The timings are the user time of the time unix command.

For the case of static sizes, my lib performs like Eigen if I force it to execute the timing loop. Otherwise the compiler detects the invariance of the result and discards the loop giving timing = 0 (but not for Eigen).
Comment 6 laurent.deniau 2012-06-02 17:13:15 UTC
Created attachment 274 [details]
timings showing the constant time added by allocations in Eigen
Comment 7 Jitse Niesen 2012-06-03 23:51:11 UTC
I tried the code of comment 3 with GCC 4.4, 4.5, 4.6 and 4.7, but there are no allocations inside the loop (checked with valgrind + memcheck). The assembly does show allocations inside the loop, but these allocations are never executed. The cause for the calls to malloc is (partially?) that Eigen will resize the matrix on the left-hand side if necessary. 

I also tried with different sizes, and the N = 1 case takes about 75% of the time of the N = 2 case. This is a considerable overhead but nowhere as bad as your timings in comment 6. 

So it's still a bit of a mystery to me. Exactly what code are these timings from comment 6 taken from? (You can remove your own library if that makes it easier.) 
And perhaps we do have to look at the assembly to understand where the problem lies.
Comment 8 laurent.deniau 2012-06-04 07:03:47 UTC
Here is all the information:

file attached: matrix.cpp, matrix.cpp.gcc-output (compilation output), matrix.s.gz (assembly output)

$g++-4.7 -msse4.2 -std=c++11 -Wall -W -pedantic -O3 -I/Users/ldeniau/Projects/madx/projects/eigen-3.0.5/ -DNDEBUG matrix.cpp -o matrix++

$time ./matrix++ 6 6 6
P = 6, Q = 6, R= 6
  154.5  386.25     618  849.75  1081.5 1313.25
 231.75 579.375     927 1274.62 1622.25 1969.88
    309   772.5    1236  1699.5    2163  2626.5
 386.25 965.625    1545 2124.38 2703.75 3283.12
  463.5 1158.75    1854 2549.25  3244.5 3939.75
 540.75 1351.88    2163 2974.12 3785.25 4596.38

real	2m8.224s
user	2m8.053s
sys	0m0.113s

laptop with:
  Processor Name:	Intel Core i5
  Processor Speed:	2.3 GHz
  Number of Processors:	1
  Total Number of Cores:	2
  L2 Cache (per Core):	256 KB
  L3 Cache:	3 MB
  Memory:	8 GB
Comment 9 laurent.deniau 2012-06-04 07:04:42 UTC
Created attachment 275 [details]
code use for timings
Comment 10 laurent.deniau 2012-06-04 07:05:10 UTC
Created attachment 276 [details]
g++-4.7 output (warnings)
Comment 11 laurent.deniau 2012-06-04 07:06:04 UTC
Created attachment 277 [details]
assembly output gzipped (g++ -S)
Comment 12 Christoph Hertzberg 2014-06-20 16:41:08 UTC

*** This bug has been marked as a duplicate of bug 404 ***
Comment 13 Christoph Hertzberg 2014-06-20 17:44:10 UTC
I got a mail from Laurent, that this is actually more about lots of memory allocations inside the product.
If you are using gcc on a non-linux system (e.g., MacOS, MinGW or Cygwin) then it is more likely a duplicate of Bug 729.
Can you try if defining the following before including Eigen solves the issue?
#define EIGEN_ALLOCA alloca

Apologies for not reading the comments carefully, before closing.
Comment 14 Nobody 2019-12-04 11:41:54 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance:

Note You need to log in before you can comment on or make changes to this bug.