Working notes - Expression evaluator
Goals
The goal here is to refactor our expression template mechanism so that the construction of the expression tree and its evaluation are 100% decoupled. Basically, at construction time, we should only assemble stupid *Expression* objects that only brings the proper API. Then, at evaluation time, via partial specialization, we would recursively create *Evaluator* objects reflecting how the considered sub-expressions can and have to be evaluated. In between we could then implement *TreeOptimizer* objects that would reorganize and optimize the expression tree through partial specialization.
Expressions
An expression has to be as simple as possible. Here is a list of elements to take into account for the new design:
One unique expression class semantic operation
For instance, this means that the expression type of the linear product of two abstract expressions A and B will always be Product<A,B>.
Bring the correct API
In practice this means that the expression has to know about both its StorageKind (Dense, Sparse, Band, etc.) and XprKind (Matrix versus Array) so that it can inherit from the right base class (MatrixBase, ArrayBase, SparseMatrixBase, etc.). This is already achieved via a per expression intermediate FooImpl class, e.g.:
Transpose<Xpr> <- TransposeImpl<Transpose<Xpr>,Xpr::StorageKind> TransposeImpl<TransposeXpr,Dense> <- dense_xpr_base<Transpose<TransposeXpr> >::type TransposeImpl<TransposeXpr,Sparse> <- SparseMatrixBase<TransposeXpr>
After this evaluator refactoring, these FooImpl classes should not bring any implementation details anymore, and so they could be renamed, and maybe even factorized into a unique dispatcher helper class. Note that the way we select whether the expression itself is, e.g., Sparse or Dense, it still specific to the expression semantic and is not always trivial (e.g., for binary operations). Nevertheless, it still seems cleaner and simpler to me to decouple this bit of logic from the dispatcher mechanism such that later one can be implemented once for all per storage/xpr kinds.
Operations that should still be done by the expression
The fact that the expressions do not have to bring any implementation details does not means they have to be as stupid as possible. Indeed, they know about the semantic of the operation they represent, and so they should still be responsible for some basic stuffs like:
- the computation of the scalar type,
- computation of the sizes,
- sanity checks (sizes, scalar types),
- ...
On the other hand, they should not, and cannot, determine whether they are "vectorizable", "linear", if it has to be evaluated or not, etc. They cannot do any of this because:
- this depends on the StorageKind,
- this depends on the sub expressions that will be evaluated into temporary and we cannot know this information at construction time (e.g., D = A + B*C won't produce a temporary once processed by the evaluator),
- ...
Proposed implementation
The interface that any expression has to provide:
template<...> struct traits<ExpressionFoo<...> > { typedef ... Scalar; typedef ... StorageKind; typedef ... XprKind; enum { RowsAtCompileTime = ..., ColsAtCompileTime = ..., MaxRowsAtCompileTime = ..., // this seems specific to Dense and related to the evaluation => Evaluator?? MaxColsAtCompileTime = ..., // this seems specific to Dense and related to the evaluation => Evaluator?? /* no Cost, Vectorization flags, linear flags, storage order flags, etc. */ }; }; template<...> class ExpressionFooDispatcher<ExpressionFoo<...>, Dense> { Scalar* data(); // optional Index innerStride(); // optional Index outerStride(); // optional }; template<...> class ExpressionFoo : public ExpressionFooDispatcher<ExpressionFoo<...> > { // import of the traits Expression(const Arg0& arg0, ...) : m_arg0(arg0), ... { // check the sizes } Index rows() { /* to be implemented */ } Index cols() { /* to be implemented */ } // no coeff like method protected: const typename Arg0::Nested m_arg0; // Nested: by reference for Matrix and the like, and by value for others, no evaluation anymore. ... };
Evaluator
The main goal of the Evaluator class is to create an *evaluable object* while *taking care of the evaluation of sub expressions into temporaries*. It is worth noting that the evaluation mechanisms depend a lot on the StorageKind. For instance, Dense expression are evaluated by means of coeff(i,j) and/or packet(i,j) calls inside an evaluation loop. The evaluation of Sparse expression is based on iterators. The evaluation of "large" Product expressions is also based on different paradigm. Therefore, *the evaluator of generic expressions has to specialized differently according to the StorageKind*. On the other hand, the XprKind (Matrix vs Array) does not seems to be relevant here since this is only affecting the API and the semantic of some operators. These decisions are handled by the Expressions.
In the current design, the *evaluable* object would simply be an instance of the Evaluator itself.
Temporaries
The first difficultly here is to find a way to create the temporaries without having the risk that they are destroyed when we need them, and without having to copy them multiple times.
To this end, each evaluator has to store its "sub" evaluator by value. Always. Then, when an expression knows that it has to be evaluated itself, it can simply store a PlainObject result member, inherits the evaluator of PlainObject, and in its constructor, evaluates itself into the result member.
The case of an expression that has to evaluate one (or multiple) of its children is more tricky and depends whether the expression itself has to be evaluated or not. Here are two examples:
(A+B) * C; (A+B).lazyProduct(C);
The first case is easier because the Product expression itself has to be evaluated. Therefore it is enough to evaluate the lhs (A+B) into a temporary in the constructor of the evaluator of Product:
evaluator(const Product<Lhs,Rhs>& prod) { Lhs::PlainObject lhs_tmp = prod.lhs(); evaluate_product(m_result, lhs_tmp, prod.rhs()); }
Of course this code has to be adapted to evaluate the sub expressions only when needed, but that's not the difficult part.
In the second example (lazy product), the product is not evaluated, and so the lhs has to be evaluated in a member of the evaluator and the evaluation has to happen in the constructor. Perhaps we could imagine something like that:
struct evaluator<LazyProduct<Lhs,Rhs> { Lhs::PlainObject m_lhs_tmp; evaluator<Lhs::PlainObject> m_lhs_eval; evaluator<Rhs> m_rhs_eval; evaluator(const LazyProduct<Lhs,Rhs>& prod) : m_lhs_tmp(prod.lhs()), m_lhs_eval(m_lhs_tmp), m_rhs_eval(prod.rhs()) {} Scalar coeff(i,j) { return sum_k m_lhs_eval(i,k) * m_rhs_eval(k,j); } };
This case seems to be more tricky to implement because it probably requires a different evaluator specialization for each of the 4 possible cases.
Dense world
For now, let's consider the dense world only.
template<typename T> struct evaluator_impl {}; template<typename T> struct evaluator { typedef evaluator_impl<T> type; }; template<typename T> struct evaluator<const T> { typedef evaluator_impl<T> type; }; template<...> struct evaluator_impl<ExpressionFoo<...> > : ExpressionFoo_evaluator_dispatcher<ExpressionFoo<...>, ExpressionFoo<...>::StorageKind> { evaluator_impl(const ExpressionFooType& t) : Base(t) {} } template<...> struct ExpressionFoo_evaluator_dispatcher<ExpressionFoo<...>, Dense> { typedef ExpressionFoo<...> ExpressionFooType; ExpressionFoo_evaluator_dispatcher(const ExpressionFooType& t) : m_arg0_eval(t.nestedArg0()), ... {} typedef typename ExpressionFooType::Scalar Scalar; typedef typename ExpressionFooType::CoeffReturnType CoeffReturnType; CoeffReturnType coeff(Index i, Index j) const { /* implementation goes here */ } CoeffReturnType coeff(Index index) const; // if it has linear access Scalar& coeffRef(Index i, Index j); // if it has direct access Scalar& coeffRef(Index index); // if linear and direct access // packet methods below only need to be provided if expression provides packet access typedef typename ExpressionFooType::PacketScalar PacketScalar; typedef typename ExpressionFooType::PacketReturnType PacketReturnType; template<int LoadMode> PacketReturnType packet(Index row, Index col) const; template<int LoadMode> PacketReturnType packet(Index index) const; // if linear access template<int StoreMode> void writePacket(Index row, Index col, const PacketScalar& x); // if direct access template<int StoreMode> void writePacket(Index index, const PacketScalar& x); // if linear and direct access protected: typename evaluator<Arg0>::type m_arg0_eval; // ... };
TreeOptimizer
In comparison to the previous technical choices, implementing a TreeOptimizer looks like a piece of cake. Typically, a TreeOptimizer takes as input an Expression type representing the expression tree, and build a new, hopefully better, expression type (NewXpr). To be able to instantiate a NewXpr object from an initial Xpr object, the TreeOptimizer also has to implement a static NewXpr make(const Xpr&) function. Here is a proof of concept:
// the TreeOpt has to be specialized at least for each Expression template<typename Xpr> struct TreeOpt; // Matrix: nothing to do => trivial template<> struct TreeOpt<Matrix> { typedef Matrix NewXpr; static const Matrix& make(const Matrix& xpr) { return xpr; } }; // Sum: recursively optimize each child and assemble a new Sum expression from the optimized version of its children template<typename A, typename B> struct TreeOpt<Sum<A,B> > { typedef Sum<A,B> Xpr; typedef typename TreeOpt<A>::NewXpr NewA; typedef typename TreeOpt<B>::NewXpr NewB; typedef Sum<NewA,NewB> NewXpr; static NewXpr make(const Xpr& xpr) { return TreeOpt<A>::make(xpr.m_a) + TreeOpt<B>::make(xpr.m_b); } }; // This partial specialization put any product on the rhs of a sum (e.g., A*B + C => C + A*B) // This is mainly to simplify further processing of the tree by reducing the required number of partial specialization. template<typename A, typename B, typename C> struct TreeOpt<Sum<Product<A,B>,C> > { typedef Sum<Product<A,B>, C> Xpr; typedef typename TreeOpt<C>::NewXpr NewC; typedef Sum<NewC,Product<A,B> > NewXpr; static NewXpr make(const Xpr& xpr) { return TreeOpt<C>::make(xpr.m_b) + xpr.m_a; } }; // This partial specialization transform a sub expression like (X + A*B) + Y to (X+Y) + A*B by interchanging A*B with Y. template<typename A, typename B, typename X, typename Y> struct TreeOpt<Sum<Sum<X,Product<A,B> >,Y> > { typedef Sum<Sum<X,Product<A,B> >,Y> Xpr; typedef typename TreeOpt<X>::NewXpr NewX; typedef typename TreeOpt<Y>::NewXpr NewY; typedef Sum<Sum<NewX,NewY>,Product<A,B> > NewXpr; static NewXpr make(const Xpr& xpr) { return (TreeOpt<X>::make(xpr.m_a.m_a) + TreeOpt<Y>::make(xpr.m_b)) + xpr.m_a.m_b; } };
These few rules allow to transform an expression like R = A + B*C + D + E; into R = (A+D+E) + (A*B); which can then be easily evaluated as R = A+D+E; R+=A*B; without any temporary. To make this happen the tree optimization has to be executed multiple times. We can easily know when to stop the iteration by tracking the number of tree changes in, e.g., a TreeOpt::CountChange enum.
On the one hand, this tree optimization might be expensive at compilation time and so it should be optional. On the other hand it would also allow to reduce the number of expressions which to be really evaluated by transforming any complex expression into a canonical form.
Working plan
In order to keep disruption to the Eigen repository to a minimum, the new evaluator mechanism is currently developed in a separate repository: https://bitbucket.org/ggael/eigen-evaluators.
What has been done so far?
- General expressions
- Implementation of evaluators for most of the basic expressions: NullaryCwiseOp, UnaryCwiseOp, BinaryCwiseOp, Transpose, Replicate, Block, Map, Ref.
- Remove Flagged expression
- Dense evaluation
- Refactoring of the dense assignment loops using a generic kernel.
- New assignment functor for a clean implementation of compound assignments and swap
- Reductions also use a generic kernel
- Refactoring of triangular_assignment_loop.
- Products
- New uniform Product<Lhs,Rhs,Options> expression. (Options = DefaultProduct, AliasFreeProduct or LazyProduct)
- ProductBase, GeneralProduct, DiagonalProduct, TriangularProduct, SelfAdjointProduct are not used anymore.
- Changes in utility classes
- internal::nested<> never evaluates: it must only be used to assemble an expression tree
- new internal::nested_eval<T,N> which introduces a temporary according to the number of evaluations N. It is used in some functions (like normalized()) and evaluators.
- Decompositions
- Inverse and Solve in all official and unsupported modules (LU, Cholesky, QR, SVD, SparseCholesky, SparseLU, SparseQR, LinearIterativeSolvers, CholmodSupport, UmfpackSupport, PastixSupport, SuperLUSupport, PardisoSupport)
- Sparse evaluation
- Refactoring of the sparse assignment loops.
- iterators are now provided by the evaluators.
Open questions
- What about DirectAccessBit?
- Currently it is still handled by the expressions because DenseCoeffBase need it to enable data/outerStride/innerStride and the likes. However, it not clear that the propagation of DirectAccess is generic and it seems to rather be an evaluation details.
- What to do with TriangularView::solve<OnTheRight>(lhs)?
- We want to deprecate this API and use lhs * tri.inverse().
- So TriangularView::solve<OnTheRight> could return a Product<Lhs,Inverse<TriangularView>> instead of a Solve<> expression for the "OnTheLeft" case.
- Clarify StorageKind versus XprShape
- What to do with the intermediate *Impl classes like TransposeImpl?
- It seems they should not be needed anymore.
- What to do with the packet(i,j) and similar functions?
- Ideally we would simply drop them.
- What to do with NestByValue, Flagged and ForceAlignedAccess classes?
- NestByValue: never used in Eigen, designed to return expression with temporaries, but's that not the right solution for this because this introduces numerous copies. We should rather offer a wrapper with shallow copies and reference counting.
- ForceAlignedAccess: not used anymore. It has been replaced in StableNorm.h by a Ref<>, so do we still want to support it in the future, or make it part of Block, like vec.segment<Aligned16>(start,size), vec.segment<Start,Aligned16>(size)
Partly resolved questions
- What to do with expr.coeff(i,j)?
- Shall we provide a generic implementation based on evaluators: evaluator<Derived>(derived()).coeff(i,j)?
- This is what we currently do.
- Shall we provide a generic implementation based on evaluators: evaluator<Derived>(derived()).coeff(i,j)?
TODO list
- Refactor the traits<> class for expressions and move part of the logic to evaluators. Almost done, though a pass of cleaning would be needed, in particular to make sure that when evaluating an expression into a temp, then the storage order of the temp must come from the evaluator, not the expression.
- Implement evaluators for Sparse modules. Almost done. Cleaning needed.
- Extend unit tests to check for the creation of the right number of temporaries (e.g., for solve, inverse, transforms, homogeneous, hnormalized, etc.)
TODO list for 3.4
- Implement evaluators for the Geometry module.
- More expression templates for triangular and selfadjoint views.
- Requires propagation of expression shapes
- See whether PermutationMatrix and Transpositions could be unified, and see other cleaning opportunities (e.g., wrt Transpose, and traits<>)
- Cleaner refactoring of the product logic. This could be much simpler if we could have a more generic implementation taking any dense shapes (full, triangular, selfadjoint) on either sides...
- Fused evaluation-reduction, see for instance (a+b).normalized() where a+b and its norm could be computed at once.
- FMA
- Tree optimizer
Summary
Here is an outline of the initial working plan:
Nr | Description | Status | Who | Prereqs |
---|---|---|---|---|
1 | Implement the evaluators for some simple cases (Matrix, Array, Transpose, CwiseNullaryOp, CwiseUnaryOp except Random, CwiseBinaryOp). Implement copy_using_evaluator to mimic the assignment operator using these evaluators. | DONE | Benoit, Thomas, Jitse | — |
2 | Implement the evaluator for Product. This is the tricky case because it needs to use temporaries. We will need to introduce another expression object, e.g. Product, which does not automatically evaluate when nested. | DONE | Gael | — |
3 | Extend copy_using_evaluator to do vectorization. | DONE | Jitse, … | 1 |
4 | Extend copy_using_evaluator to do unrolling. | DONE | 3 | |
5 | Implement the evaluator for all other simple cases (Map, Block, Select, Replicate, Reverse, *Wrapper, *View) | DONE | 1, 4? | |
6 | Implement the evaluator for solvers | DONE | Gael | 5 |
7 | Implement the evaluator for inverse() | WIP | Gael | 6 |
8 | Finish the refactoring around ReturnByValue | 2, 6 | ||
9 | Figure if this allows us to refactor TransposeImpl, perhaps move stuff to evaluator<Transpose> | 1 | ||
10 | Refactor BLAS traits | for 3.4 | 2, 4? | |
11 | Put everything in the main Eigen branch. Change operator= to copy_using_evaluator and operator* to evalprod. Is this the correct time? | DONE | 1–10 | |
12 | Allow optimizations like D=A+B*C --> D=A;D+=B*C. We agreed that this will be best done by partially specializing the implementation of operator= as a template struct (like the current assign_selector/assign_impl). | DONE | Gael | 2 |
13 | Allow optimizations like A.inverse()*B --> A.lu().solve(B) | 7 | ||
14 | General tree optimizations, chained products optimizations | 2 |