Summary: | very bad performance with small dynamic pre-allocated matrices | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Eigen | Reporter: | laurent.deniau | ||||||||||||||
Component: | Core - matrix products | Assignee: | Nobody <eigen.nobody> | ||||||||||||||
Status: | RESOLVED DUPLICATE | ||||||||||||||||
Severity: | Performance Problem | CC: | chtz, jitseniesen | ||||||||||||||
Priority: | Normal | ||||||||||||||||
Version: | 3.0 | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Bug Depends on: | |||||||||||||||||
Bug Blocks: | 404 | ||||||||||||||||
Attachments: |
|
On my computer, the difference is substantial but not as big as you claim. My timings )with the same compiler options, but with G++ 4.5) are: Fixed size: 1.37 sec Dynamic size: 11.3 sec For some reason, G++ 4.7 seem to generate very bad code on your platform. Nevertheless, this is a big penalty for using dynamic sizes. Part of the reason is that A = B * C allocates a temporary object for holding the result of the product; if you write the two products as A.noalias() = B * C then the timings improve to: Fixed size: 1.24 sec Dynamic size: 6.5 sec Another reason is that the matrix product code for dynamic-size matrices is written with large matrices in mind (bug 404). Re "Dynamic matrices seems to systematically reallocate the storage instead of reuse it when storage sizes are compatible", as far as I can see, no (re)allocations take place when you add noalias() . When I wrote "On my computer, the difference is substantial but not as big as you claim", I did not mean to imply that I don't believe you. I'm sure you got the timings that you show. I only wanted to say that my configuration yields different results. Yes, I have observed large difference (slower) too between 4.5 and 4.6/4.7 series. But I think that it is more on the assignment operator that the problem occurs, because I tested both versions with/without explicit temporaries (see the commented code) and I observe no timing difference. MatrixXd m1(N,M); MatrixXd m2(M,N); MatrixXd m3(N,N); MatrixXd m4(N,N); MatrixXd mr(N,N); ... for (int i=0; i < 100000000; i++) { // explicit temporaries version m3.noalias() = m1 * m2; m4.noalias() = m1 * m2; mr.noalias() = m3 + m4; // implicit temporaries version // mr = m1 * m2 + m1 * m2; } Ah, I forgot to say that I also tested with noalias as you mentioned and saw no difference as well. I suspect that some template specialization does not work here... I can send the assembly output from gcc4.7 on demand. The loop is easy to find on static-size but the dynamic-size is very very long... I tryied to compile the same code on linux (ubuntu x86_64) with g++ 4.4.3 without success. It's unfortunate that "small" dynamic-size matrices do not use the same algorithm (Strassen algorithm?) because the static-size code performs very well on 100x100 matrices. But I doubt that the x180 slowdown is coming from that, and I can see many assembly call_malloc inside the loop, which would let suppose that memory allocations are still performed... Created attachment 273 [details]
te
Here I get a factor 8 on a Core2 and a factor 13 on a core i7. Such a speedup factor has to be expected.
I also attached a more elaborated test file that avoids agressive compiler optimizations. In your example the compiler could simply drop the loop without changing the result.
I have tested Eigen against my own matrix library using only dynamic-size matrices. The size are provided on the command line to ensure no knowledge at compile time. The last column is the Eigen time of the code above, including the allocations that occur even with noalias (!). But since the overhead should be almost constant, I remove the constant time in column Eigen-alloc. Then I plot the result of my lib (hand written SSE2 code), a simple matrix mul optimized by with GCC with -msse4 and Eigen-alloc. The timings are the user time of the time unix command. For the case of static sizes, my lib performs like Eigen if I force it to execute the timing loop. Otherwise the compiler detects the invariance of the result and discards the loop giving timing = 0 (but not for Eigen). Created attachment 274 [details]
timings showing the constant time added by allocations in Eigen
I tried the code of comment 3 with GCC 4.4, 4.5, 4.6 and 4.7, but there are no allocations inside the loop (checked with valgrind + memcheck). The assembly does show allocations inside the loop, but these allocations are never executed. The cause for the calls to malloc is (partially?) that Eigen will resize the matrix on the left-hand side if necessary. I also tried with different sizes, and the N = 1 case takes about 75% of the time of the N = 2 case. This is a considerable overhead but nowhere as bad as your timings in comment 6. So it's still a bit of a mystery to me. Exactly what code are these timings from comment 6 taken from? (You can remove your own library if that makes it easier.) And perhaps we do have to look at the assembly to understand where the problem lies. Here is all the information: file attached: matrix.cpp, matrix.cpp.gcc-output (compilation output), matrix.s.gz (assembly output) $g++-4.7 -msse4.2 -std=c++11 -Wall -W -pedantic -O3 -I/Users/ldeniau/Projects/madx/projects/eigen-3.0.5/ -DNDEBUG matrix.cpp -o matrix++ $time ./matrix++ 6 6 6 P = 6, Q = 6, R= 6 154.5 386.25 618 849.75 1081.5 1313.25 231.75 579.375 927 1274.62 1622.25 1969.88 309 772.5 1236 1699.5 2163 2626.5 386.25 965.625 1545 2124.38 2703.75 3283.12 463.5 1158.75 1854 2549.25 3244.5 3939.75 540.75 1351.88 2163 2974.12 3785.25 4596.38 real 2m8.224s user 2m8.053s sys 0m0.113s laptop with: Processor Name: Intel Core i5 Processor Speed: 2.3 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 3 MB Memory: 8 GB Created attachment 275 [details]
code use for timings
Created attachment 276 [details]
g++-4.7 output (warnings)
Created attachment 277 [details]
assembly output gzipped (g++ -S)
*** This bug has been marked as a duplicate of bug 404 *** I got a mail from Laurent, that this is actually more about lots of memory allocations inside the product. If you are using gcc on a non-linux system (e.g., MacOS, MinGW or Cygwin) then it is more likely a duplicate of Bug 729. Can you try if defining the following before including Eigen solves the issue? #define EIGEN_ALLOCA alloca Apologies for not reading the comments carefully, before closing. -- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/469. |
Created attachment 271 [details] 41 lines of C++ showing the problem with small dynamic matrices The attached file show a difference of x130 in speed on my laptop between fixed and dynamic pre-allocated matrices. I don't see any reason for such discrepancy (I would expect few percents difference) since it takes care of intermediate temporaries. Dynamic matrices seems to systematically reallocate the storage instead of reuse it when storage sizes are compatible, which is a killer for small matrices. In many problems, matrices sizes are unknown (input-defined) but size of intermediate results can be computed and storages pre-allocated. This is a very important "real life" case. Maybe I have missed some Eigen trick to solve the problem... Timings on my laptop Intel Core i5 2.3 Ghz with g++ 4.7 and -O3 -DNDEBUG -msse4: C (naive): 10 sec Eigen fixed size: 1.2 sec (very impressive!) Eigen dynamic size: 163 sec (very bad!) g++-4.7 -msse4.2 -std=c++11 -O3 -Ieigen-3.0.5/ -DNDEBUG matrix.cpp -o matrix