Todo

From Eigen
Revision as of 18:06, 8 May 2008 by Bjacob (Talk | contribs)

Jump to: navigation, search
Eigen Owl Todo.jpg

Core module

This module is essentially complete and working, but needs improvement in these areas:

  • Optimizing executable code size
    • When an xpr evaluates its own arguments, we should use that to make it depend only on the resulting matrix type, not on the xpr type. This way, code would be generated only once per actual matrix type, not once per xpr type.
      • Can that be done systematically? Theoretically, any xpr may evaluate its arguments, this is handled in ei_nested.
      • In practice, the main xpr for which this matters is Product, for two reasons. First, it is one of the few xprs using ei_nested with parameter n>1, which means that it is much more likely to evaluate its arguments. Second, it generates a lot of code.
  • Vectorization
    • Add AltiVec support. [done - thanks to markos]
    • Fix compilation errors with complex numbers (probably just disable vectorization in this case). In particular, ensure the tests compile and succeed with vectorization turned on. [hopefully done]
    • Add ability for certain operations to view a matrix op as a vector op. Needed e.g. to vectorize ops on 2x2 matrices. [done - ggael]
    • Evaluate adding vectorization support to (certain) block expressions. Related to adding support for specifying startRow/startCol parameters as template parameters. => WIP via unaligned load and store
    • Add vectorization support to dynamic-size matrices. Here the good approach is not really to align the data array on 16-byte boundary (that might help sometimes, but typically we'll be working with dynamic block expressions not preserving the alignment). The better approach is probably to allow arbitrary operations to vectorize the part of them that can be vectorized. That would consist in doing with non-vectorized instructions the operations near unaligned boundaries, and vectorize what's in the middle (between aligned boundaries). That means runtime checks, but looping in dynamic-sized objects means conditional branching anyway. [partly done - ggael]

Currently the vectorization is enabled if and only if we can guaranty at compile time that the first element is 16-byte aligned. It is actually too difficult to check at runtime the alignment of the data since adresses are hidden by expressions, moreover shifting the start of the vectorization as suggest could work only if ALL matrices involved also requires this shift. The vectorization of Block is a real nightmare. One option would be to use the expensive non-aligned load/store operation if the expression might require that (like Block). This could be done by making packetCoeff templated.

  • Parallelization?
    • I don't manage to get any benefit from OpenMP with generic expressions, even in simple cases that should be easy, such as assignment of a cwise xpr. Unless we get a good benefit for such assignments and for redux, I don't think that we should bother about OpenMP.
    • It is easier to get a benefit in particular algorithms such as matrix product evaluation, but that's pointless since then the even better solution is a good BLAS backend.
  • BLAS/LAPACK backend?
    • BLAS: apparently that would only be useful for matrix product and backward substitution. However, Gael already optimized the matrix product enough to beat the best free BLAS out there (though some proprietary BLAS such as MKL are still faster)... so perhaps a BLAS backend is not the top priority. Especially as it would be nontrivial to do (many different kinds of expressions to catch using partial template specialization).
    • LAPACK: letting big algorithms (such as inversion, QR...) use a LAPACK backend should be easier (because here we deal with plain matrices not expressions) and more useful (because there are too many different algorithms to optimize for us to do it ourselves).
  • Fast matrix product
    • The major remaining limitation of the new, WIP, product implementation id the need that rhs (in res = lhs * rhs) is column major.
      • So if both lhs and rhs are row major, no problem, we can simply flip the product.
      • But if lhs if col-major and rhs row-major, we have to first copy/eval it to a col-major temporary. Another option would be to implement another algorithm that directly deals with that. I (ggael) already have an algorithm like that. However it is currently a bit slower than the current algorithm, and less flexible to handle arbitrary matrix dimensions. Moreover this double the code size, that is not nice for a template library. Any idea ?
    • Principle of the algorithm:
      • Let's assume we are doing a product A * B = C of size: MxK * KxN = MxN
      • The first concept is to perform the product by large blocks fitting into L2 cache (currently we deal with blocks of 256x256).
      • The most outer-loop loops over the rows of A per block of 256 rows. At each iteration the block of A of size 256xK is copied to a temporary buffer with some reorganization/interleaving such that the product kernel can is 100% cache friendly (see latter). This reorganization also allow to guaranty that the data will be correctly aligned for a fast vectorization.
      • Next for each macro block C' of 256x256 of the result C, we loop over each macro block A' (resp. B') of A (resp. B) that participates to C', and we apply the following product kernel:
        • Here the problem is to perform a matrix product: C' += A' * B' of 256x256x256.
        • This product is performed per micro block C" of C' of size Wx1. This allows to keep the Wx1 coefficients in W vector registers (in an expanded way).
        • More precisely, and assuming a hardware vector size of 4, the most inner-loop performs at each iteration:
        • - load a column-vector of B' (4x1), multiply it to the W row-vectors A' and accumulate each product to W vector registers.
        • - At the end of the inner loop, the elements of each register vector have to be summed together and accumulated to the micro block C".
        • This last accumulation step requires "horizontal" computations that can be performed one vector at a time (then C is accessed is a scalar way that is good if it not aligned or to handle boundaries), or we can perform 4 accumulations/redux at a time that is much faster (typically only 3 instructions with SSE3)


  • Misc Features
    • We need an efficient way of computing m * m.adjoint() and m.adjoint() * m. Can this be done cleanly? I am afraid we'll have to re-write at least a separate meta-unroller for this one. So maybe make a new Xpr for that while we're at it. [partly done - ggael]

Currently you can write a = m * m.adjoint().upper(), or a.upper() = m * m.adjoint() and only the upper part of the product will be evaluated, unless the cacheOptimalProduct code is run: need to disable it if the left and right xpr have different storage order since, in that case the default implementation seems ok wrt cache.

    • I don't think anymore we need masks for copying upper/lower triangles. Probably too much hassle. Was mostly useful in fixed-size cases, but for them algorithms which might use that are often superseded by special algorithms for small sizes (such as cofactors for computing an inverse).
    • Improve current meta unroller the more general concept of partial loop unrolling.

LU module

  • Inversion: it's done, but more work is needed:
    • fixed-size 4x4: either merge the vectorization code by Markos (hopefully making it work with SSE as well as AltiVec) or improve Eigen's generic vectorization code enough to achieve comparable performance.
    • general case: evaluate doing at least one step of divide-and-conquer as in the size 4 case, to reduce overall complexity and improve parallelizability.
  • Determinant: only fixed-size <= 4 is done, the general case remains to do. Perhaps make it just use a LU decomposition, as large determinants are seldom used anyway.
  • LU decomposition: to do. Should have optimized paths for fixed-size <= 4. Very important, as this will be our main device for exact solving.
    • provide isInvertible(), inverse(), determinant(). It's not redundant to have it also here. Once the LU decomposition is computed these are much faster to compute.
    • provide rank(), kernelBasis(), imageBasis(), someAntecedent(vector).

Cholesky module

This would provide nice exact-solving capabilities for positive matrices.

  • The code for Cholesky decomposition and positive solver is done
  • still need to check the API
  • applications: positive exact solving, generalized eigen problems, etc.
  • should linear regression use that, or SVD? I heard SVD is ideal for regression, but still Cholesky seems good too. At least for now one could implement linear regression with Cholesky, as this is needed soon (required for libavogadro to port to eigen2).

QR module

Basic code for the QR decomposition is already there. To do: eigenvalues (using QR algorithm), generalized eigen problem using Cholesky decomposition (here or in Cholesky ?), spectral radius, operator norm, operator exponential and maybe more operator calculus by applying a cwiseUnaryOp to the eigenvalues vector.

SVD module

To do. Should provide SVD decomposition, pseudoinverse, what the vision folks need, and least squares / linear regression.

  • Maybe Gram-Schmidt goes here? It's really similar to SVD.

Geometry module

  • cross product goes here, as an expression.
  • Projective geometry / homographies: i'm (bjacob) starting to have precise ideas about this, so leave that to me.
  • Rotation matrices: reuse Eigen 1 code.
  • Quaternions: evaluate reusing Tackat's code in Marble.
  • Need a fixed-size Gram-Schmidt here, so it's not clear whether it belongs here or in SVD. Is this a case for having it in a separate module?

Sparse matrices/vectors

  • We could implement a wrapper around GMM++ bringing nice features such as arithmetic operators coupled with copy-on-write. The point of this wrapper would be to enable good integration with Eigen, by providing conversion operators/constructors. The demand comes from the Krita and Step projects.
  • Another ambitious long-term plan would be to implement sparse matrices/vectors on top of Eigen. One could then reuse the algorithms provided in GMM++.

Unit-tests

  • Update them to cover all of the API and add tests for bad bugs we had in the past, such as the sizeof bug.
  • By the way add tests to check that fixed-size object really don't cause mallocs.
  • Get rid of QTestLib, and by the way remove the QtCore dependency.
  • Split tests into many small executables, using CTest to run them. Might even allow us to use -O3 -g3 for compiling, getting decent speed. Currently that causes g++ to use too much memory, especially with >1 jobs.

Documentation

  • Update massively: it's out of date
  • Fix doxygen issues, such as the groups in MatrixBase (as determined by \name) appearing in random order.
  • Write special pages on certain topics, illustrate them with examples:
    • Compiler support/options
    • Vectorization
    • Parallelization
    • Fixed-size matrices/vectors
      • in particular, usage with OpenGL
    • Dynamic-size matrices/vectors