This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen

Bug 1763

Summary: Optimization of symmetric eigendecomposition
Product: Eigen Reporter: Tadej Ciglarič <tadej.c>
Component: EigenvaluesAssignee: Nobody <eigen.nobody>
Status: REVIEWNEEDED ---    
Severity: Optimization CC: chtz, gael.guennebaud, jacob.benoit.1, jitseniesen, tadej.c
Priority: Normal    
Version: 3.4 (development)   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on:    
Bug Blocks: 1608    
Attachments:
Description Flags
Addition of block tridiagonalization into Eigen none

Description Tadej Ciglarič 2019-10-29 16:16:31 UTC
I implemented a symmetric eigendecomposition algorithm that is significantly faster than one currently in Eigen (~3x on Core i5 2500). I achieved that speedup by using blocked tridiagonalization and replaced QR iteration with MRRR. 

If you are interested I would like to add that to Eigen. Currently my code works with symmetric matrices of doubles, but I think it should not be too much work to adapt it for general selfadjoint matrices.
Comment 1 Christoph Hertzberg 2019-10-30 16:32:59 UTC
Generally we welcome all contributions! You may also share preliminary code.

Do you think it is portable to complex scalars as well? 
And ideally also to custom types? We prefer not having to provide entirely different implementations for different types.

Also try to keep the tridiagonalization and the MRRR iteration as separated as possible, to make it possible to test them individually -- currently, we provide SelfAdjointEigenSolver::computeFromTridiagonal for that (this method must be kept, of course).
Comment 2 Tadej Ciglarič 2019-10-30 20:40:31 UTC
Currently my code is in these three files: https://github.com/bstatcomp/math/blob/cpu_eigendecomposition/stan/math/prim/mat/fun/tridiagonalization.hpp 
https://github.com/bstatcomp/math/blob/cpu_eigendecomposition/stan/math/prim/mat/fun/mrrr.hpp
https://github.com/bstatcomp/math/blob/cpu_eigendecomposition/stan/math/prim/mat/fun/symmetric_eigensolver.hpp

I think it would require some changes to make everything work with complex numbers. MRRR and tridiagonalization code is separated and documented.

At which point would you suggest I integrate these into Eigen? Just modify  what SelfAdjointEigenSolver::computeFromTridiagonal does (and find a similar method for tridiagonalization), or provide a sepatare class for using different eigensolver?

Also I am not familiar with marcurial. What is prefered way of accepting contributions? Do I open a pull request, attach a patch here or ... ?
Comment 3 Tadej Ciglarič 2019-11-11 12:45:16 UTC
I made it work wirh general selfadjoint matrices. 

To integrate it with Eigen I still need some answers to questions from the previous post. Unless one of you that better know internals of Eigen would prefere to do it?
Comment 4 Christoph Hertzberg 2019-11-11 17:37:24 UTC
Thanks again!

I can't promise when I will have time for a thorough review, but some first suggestions:

* For all sizes or indexes use `Eigen::Index` instead of `int` or `size_t`
* Avoid using `auto` (at the moment we want to keep Eigen C++03 compatible)
* Instead of `std::norm` use `Eigen::numext::abs2` ("norm" is just confusing)
* Avoid allocating dynamic matrices as much as possible (especially inside loops)
  You can either provide "workspace" class variables, or pass them as parameters
  (for the latter, you can provide convenience methods which allocate the
  temporaries and then call the "worker" methods)
* Is it possible to share code with "Eigen/src/Householder/BlockHouseholder.h"?
  I haven't had a closer look yet, so no idea how different these are.


Do you happen to already have implemented benchmarks showing the speed-difference? (Ideally, separately for tri-diagonalization and the reduction)
How much does the speed depend on the problem size? 
Does the speed difference come at the cost of less accuracy?

You can also have a look at bench/benchEigenSolver.cpp (hopefully this still compiles -- I barely run these). Of course, the existing benchmark could also be split into tri-diagonalization and the remaining computation).


Depending on the last questions, we can entirely replace the existing implementation, or make runtime-switches depending on the problem size, or offer this as an alternative implementation (like BDCSVD vs JacobiSVD).

For final integration, you can either attach a patch here or make a PR (if you don't like mercurial: We will be migrating to gitlab within the next months, if you can wait that long).
Comment 5 Tadej Ciglarič 2019-11-12 12:15:27 UTC
Thanks for the suggestions. I already avoided dynamic allocations in favor of allocating workspace in advance in performance-critical parts. I can look over the code again, but I think it should be ok to leave them where allocations are not in critical parts of code? 

I use slightly different definition of packed form and how I store householder vectors in it. While it would be possible to change to how Eigen does it it would require some effort. Also my version of multiplication with matrix represented by householder vectors is slightly (10-20%) faster. I am not sure if that is due to changed format or something else.

I already did some benchmarks. I will post them here soon. My version is significantly faster at all sizes, except for very small ones. Even at those sizes I can try to better tune block size.

I think block triangularization should not significantly affect accuraccy. MRRR, however, is less acurate than QR iteration. Also it can completely fail in some rare cases when multiple eigenvalues are extremely close but different in exact arithmetic (with relative difference close to machine percision). When I quickly looked at QR iteration code it seemed it can also fail in some cases.
Comment 6 Tadej Ciglarič 2019-11-13 09:38:10 UTC
I posted benchamrking and error calculation source here: https://gist.github.com/t4c1/2d766bb2076006bc0a9a448842055a4a. Results for double in raw form and graphs are also included.
Comment 7 Christoph Hertzberg 2019-11-13 10:26:28 UTC
Just looking at your benchmark results (without any in-depth analysis yet), I'd say we can make Tridiagonalization have a threshold which decides when to use blocking. But given the significant difference in precision we should have two different EigenSolver implementations (alternatively, have a flag in the template parameter).

For smaller sizes, large parts of the overhead may very likely be caused by repeated allocation/de-allocation inside the loops.
Ideally, we should not need any runtime allocations if an EigenSolver object is repeatedly used with same-sized matrices (except in the first run). For some decompositions we already have unit-tests which guarantee that.
Comment 8 Gael Guennebaud 2019-11-14 17:25:45 UTC
Those are very welcome contributions! Block-tridiagonal is something I wanted to implement for years, so I'm glade you managed to have a look at it!

I think we should first try to get the block-tridiagonalization which is the easiest as there is no API considerations. However, the implementation should really be decomposed as an unblocked and blocked path with some threshold, as well as leverage the existing BlockHouseholder code. See the HouseholderQR implementation for a good example: https://bitbucket.org/eigen/eigen/src/06e8e999d79f2a9a61107c3c39d86764f21ce363/Eigen/src/QR/HouseholderQR.h#lines-265

The UpperBidiagonalization code is also a good example: https://bitbucket.org/eigen/eigen/src/06e8e999d79f2a9a61107c3c39d86764f21ce363/Eigen/src/SVD/UpperBidiagonalization.h#lines-90

If your version is 10%-20% faster we should really understand why as I don't see any major difference that would explain such a gap.

For MRRR, we definitely should provide it in addition to the current one. It could be exposed as a computeMRRR() method or as new solver class. In any case we need a compile-time option so that only the desired code is compiled.
Comment 9 Tadej Ciglarič 2019-11-18 08:42:05 UTC
I removed all allocations in block tridiagonalization and most in MRRR (remaining ones would be difficult to remove). It makes no significant difference in speed, even with small sizes.
Comment 10 Tadej Ciglarič 2019-11-19 10:48:06 UTC
I made an error calculating the speed difference. Actually my code is no more than 5-10% faster. 

I have a hard time comparing eigne and my code. I figured out that apply_block_householder_on_the_left() coencides with body of the outer loop in my code and make_block_householder_triangular_factor() seems to do approximately the same as the inner loop in my code. However I think they are not exactely the same. I guess there are some differences in mathematical formulations that are the basis for code.
Comment 11 Tadej Ciglarič 2019-11-22 11:26:18 UTC
Created attachment 964 [details]
Addition of block tridiagonalization into Eigen
Comment 12 Tadej Ciglarič 2019-11-22 11:28:35 UTC
I changed block tridiagonalization so that it uses Householder primitives from Eigen. I also added a patch that shows how I would integrate it to Eigen. It probably is not perfect, but should be ready for a review.
Comment 13 Nobody 2019-12-04 18:51:53 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1763.