Difference between revisions of "Todo"
From Eigen
(→QR module) |
(→Core module) |
||
Line 12: | Line 12: | ||
* Vectorization | * Vectorization | ||
− | |||
− | |||
− | |||
** Evaluate adding vectorization support to (certain) block expressions. Related to adding support for specifying startRow/startCol parameters as template parameters. => WIP via unaligned load and store | ** Evaluate adding vectorization support to (certain) block expressions. Related to adding support for specifying startRow/startCol parameters as template parameters. => WIP via unaligned load and store | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
* BLAS/LAPACK backend? | * BLAS/LAPACK backend? | ||
Line 63: | Line 54: | ||
* Misc Features | * Misc Features | ||
− | |||
− | |||
** Improve current meta unroller to the more general concept of partial loop unrolling. | ** Improve current meta unroller to the more general concept of partial loop unrolling. | ||
** Consider the use of alloca (dynamic allocation on the stack for dynamic-sized temporary). After some benchmarks, it seems this really pay off for matrices smaller than 16x16, for larger matrices the gain is not so obvious. In practice alloca cannot be called in MatrixStorage, and it as to be called in the function creating the temporary. This might be implemented using Matrix::map(). | ** Consider the use of alloca (dynamic allocation on the stack for dynamic-sized temporary). After some benchmarks, it seems this really pay off for matrices smaller than 16x16, for larger matrices the gain is not so obvious. In practice alloca cannot be called in MatrixStorage, and it as to be called in the function creating the temporary. This might be implemented using Matrix::map(). |
Revision as of 02:28, 3 June 2008
Contents
Core module
This module is essentially complete and working, but needs improvement in these areas:
- Optimizing executable code size
- When an xpr evaluates its own arguments, we should use that to make it depend only on the resulting matrix type, not on the xpr type. This way, code would be generated only once per actual matrix type, not once per xpr type.
- Can that be done systematically? Theoretically, any xpr may evaluate its arguments, this is handled in ei_nested.
- In practice, the main xpr for which this matters is Product, for two reasons. First, it is one of the few xprs using ei_nested with parameter n>1, which means that it is much more likely to evaluate its arguments. Second, it generates a lot of code.
- Use if's instead of template specialization whenever possible. When the condition evaluates at compile-time, this gets optimized; when the condition is not known at compile-time, generating code for each possible value is a bad idea (bad codesize-vs-speed trade-off).
- When an xpr evaluates its own arguments, we should use that to make it depend only on the resulting matrix type, not on the xpr type. This way, code would be generated only once per actual matrix type, not once per xpr type.
- Vectorization
- Evaluate adding vectorization support to (certain) block expressions. Related to adding support for specifying startRow/startCol parameters as template parameters. => WIP via unaligned load and store
- BLAS/LAPACK backend?
- BLAS: apparently that would only be useful for matrix product and backward substitution. However, Gael already optimized the matrix product enough to beat the best free BLAS out there (though some proprietary BLAS such as MKL are still faster)... so perhaps a BLAS backend is not the top priority. Especially as it would be nontrivial to do (many different kinds of expressions to catch using partial template specialization).
- BLAS integration idea: instead of catching all kinds of expressions using partial template specialization a much easier solution would be to have the notion of Blasable (!) expression. Matrix is of course Blasable, as well as Map and any combinations of Block and Transpose (an maybe Triangular too). All other expression have to evaluated. Such Blasable expressions (i.e., the ones having thye flag ReferencableBit) have to define a getBlasParameters function that returns (and recursively updates) the parameters needed to send the data to BLAS:
- address of the first element (actually this one is already given by coeffRef(0,0)),
- the real storage order (not the RowMajorBit flag),
- transpose status,
- stride (length of a row or column according to the storage order).
- LAPACK: letting big algorithms (such as inversion, QR...) use a LAPACK backend should be easier (because here we deal with plain matrices not expressions) and more useful (because there are too many different algorithms to optimize for us to do it ourselves).
- Fast product for large matrices
- Principle of the algorithm:
- Let's assume we are doing a product A * B = C of size: MxK * KxN = MxN
- The first concept is to perform the product by large blocks fitting into L2 cache (currently we deal with blocks of 256x256).
- The most outer-loop loops over the rows of A per block of 256 rows. At each iteration the block of A of size 256xK is copied to a temporary buffer with some reorganization/interleaving such that the product kernel can is 100% cache friendly (see latter). This reorganization also allow to guaranty that the data will be correctly aligned for a fast vectorization.
- Next for each macro block C' of 256x256 of the result C, we loop over each macro block A' (resp. B') of A (resp. B) that participates to C', and we apply the following product kernel:
- Here the problem is to perform a matrix product: C' += A' * B' of 256x256x256.
- This product is performed per micro block C" of C' of size Wx1. This allows to keep the Wx1 coefficients in W vector registers (in an expanded way).
- More precisely, and assuming a hardware vector size of 4, the most inner-loop performs at each iteration:
- - load a column-vector of B' (4x1), multiply it to the W row-vectors A' and accumulate each product to W vector registers.
- - At the end of the inner loop, the elements of each register vector have to be summed together and accumulated to the micro block C".
- This last accumulation step requires "horizontal" computations that can be performed one vector at a time (then C is accessed is a scalar way that is good if it is not aligned or to handle boundaries), or we can perform 4 accumulations/redux at a time that is slightly faster (typically only 3 instructions with SSE3)
- Current implementation in Eigen:
- One of the goal is to have only one instanciation of that code per scalar type, therefore:
- C is assumed to be column major (otherwise the product is transposed).
- The code to copy A to a blocky version is duplicated to handle the different storage orders (transposition) of A.
- B is assumed to be column major, otherwise it is evaluated to a ColumnMajor matrix.
- when computing the dot products of W sub-rows of A times one sub-column of B, if the sub column is not aligned, then it is copied to a small aligned buffer. This is much faster than using unaligned loads and allows to have a single instance of the inner loop.
- the alignment of C is managed dynamically without overhead as explain in the algorithm.
- Issues/todo:
- B gets evaluated if it is not column-major => offer too different algorithms ?
- the algorithm actually computes C+=A*B => find a way to catch such expressions to avoid temporaries and clearing C to zero.
- some trivial expressions get evaluated like conjugate or negate,
- more generally, if A as to be evaluated then directly evaluate it to a blocky form and tells the algorithm to directly use A.
- to handle negate: the user should take care to write c = -(a*b)
- to handle conjugate or adjoint: let's add to UnaryOp the ability to return a reference (_coeffRef) is the functor is the identity (that is the case is the scalar type is real). This is also applicable to cast or a possible functor to extract the real part of real matrix.
- Principle of the algorithm:
- Misc Features
- Improve current meta unroller to the more general concept of partial loop unrolling.
- Consider the use of alloca (dynamic allocation on the stack for dynamic-sized temporary). After some benchmarks, it seems this really pay off for matrices smaller than 16x16, for larger matrices the gain is not so obvious. In practice alloca cannot be called in MatrixStorage, and it as to be called in the function creating the temporary. This might be implemented using Matrix::map().
- Another fine tuning idea (very low priority): reuse a previously allocated temporary when possible. For instance, theoretically, the Cholesky decomposition could be done "in place", my working directly on the input matrix. Currently, Cholesky stores its own data. However, a typical use case is: (m.adjoint() * m).cholesky()... In that case, if m is large enough, the matrix product creates a temporary and Cholesky too while the latter could simply "pick" the data allocated by the product since we know this data cannot be reused by any other expression. I guess this concept could be extended to a more general mechanism. (this is only applicable for dynamic-size matrix and in the case we don't use alloca, cf. the previous point)
LU module
- Inversion: it's done, but more work is needed:
- fixed-size 4x4: either merge the vectorization code by Markos (hopefully making it work with SSE as well as AltiVec) or improve Eigen's generic vectorization code enough to achieve comparable performance.
- general case: evaluate doing at least one step of divide-and-conquer as in the size 4 case, to reduce overall complexity and improve parallelizability.
- Determinant: only fixed-size <= 4 is done, the general case remains to do. Perhaps make it just use a LU decomposition, as large determinants are seldom used anyway.
- LU decomposition: to do. Should have optimized paths for fixed-size <= 4. Very important, as this will be our main device for exact solving.
- provide isInvertible(), inverse(), determinant(). It's not redundant to have it also here. Once the LU decomposition is computed these are much faster to compute.
- provide rank(), kernelBasis(), imageBasis(), someAntecedent(vector).
Cholesky module
This would provide nice exact-solving capabilities for positive matrices.
- The code for Cholesky decomposition and positive solver is done
- still need to check the API
- applications: positive exact solving, generalized eigen problems, etc.
- should linear regression use that, or SVD? I heard SVD is ideal for regression, but still Cholesky seems good too. At least for now one could implement linear regression with Cholesky, as this is needed soon (required for libavogadro to port to eigen2).
QR module
We have QR, eigenvalues in selfadjoint case, half-working eigenvectors, and matrix norm. Todo:
- fix bug: eigenvectors are wrong in selfadjoint case.
- implement eigensolver in non-selfadjoint case
- make sure we support complex numbers.
- generalized eigen problem using Cholesky decomposition (here or in Cholesky ?), spectral radius, matrix exponential and maybe more operator calculus by applying a cwiseUnaryOp to the eigenvalues vector.
SVD module
To do. Should provide SVD decomposition, pseudoinverse, what the vision folks need, and least squares / linear regression.
- Maybe Gram-Schmidt goes here? It's really similar to SVD.
Geometry module
Started by gael. Still need:
- Projective geometry / homographies
- Rotation matrices: reuse Eigen 1 code.
- Need a fixed-size Gram-Schmidt here, so it's not clear whether it belongs here or in SVD. Is this a case for having it in a separate module?
Sparse matrices/vectors
- We could implement a wrapper around GMM++ bringing nice features such as arithmetic operators coupled with copy-on-write. The point of this wrapper would be to enable good integration with Eigen, by providing conversion operators/constructors. The demand comes from the Krita and Step projects.
- Another ambitious long-term plan would be to implement sparse matrices/vectors on top of Eigen. One could then reuse the algorithms provided in GMM++.
Unit-tests
- Keep adding more!
Documentation
- Keep updating
- all our issues with doxygen are now solved (i.e. we have workarounds)
- Write special pages on certain topics, illustrate them with examples:
- Compiler support/options
- Vectorization
- Parallelization
- Fixed-size matrices/vectors
- in particular, usage with OpenGL
- Dynamic-size matrices/vectors