# Working notes - Expression evaluator

## Contents

# Goals

The goal here is to refactor our expression template mechanism so that the construction of the expression tree and its evaluation are 100% decoupled. Basically, at construction time, we should only assemble stupid *Expression* objects that only brings the proper API. Then, at evaluation time, via partial specialization, we would recursively create *Evaluator* objects reflecting how the considered sub-expressions can and have to be evaluated. In between we could then implement *TreeOptimizer* objects that would reorganize and optimize the expression tree through partial specialization.

# Expressions

An expression has to be as simple as possible. Here is a list of elements to take into account for the new design:

### One unique expression class semantic operation

For instance, this means that the expression type of the linear product of two abstract expressions A and B will always be Product<A,B>.

### Bring the correct API

In practice this means that the expression as to know about both its StorageKind (Dense, Sparse, Band, etc.) and XprKind (Matrix versus Array) such that it can inherit from the right base class (MatrixBase, ArrayBase, SparseMatrixBase, etc.). This is already achieved via a per expression intermediate FooImpl class, e.g.:

Transpose<Xpr> <- TransposeImpl<Transpose<Xpr>,Xpr::StorageKind> TransposeImpl<TransposeXpr,Dense> <- dense_xpr_base<Transpose<TransposeXpr> >::type TransposeImpl<TransposeXpr,Sparse> <- SparseMatrixBase<TransposeXpr>

After this evaluator refactoring, these FooImpl classes should not bring any implementation details anymore, and so they could be renamed, and maybe even factorized into a unique dispatcher helper class. Note that the way we select whether the expression itself is, e.g., Sparse or Dense, it still specific to the expression semantic and is not always trivial (e.g., for binary operations), nevertheless, it still seems cleaner and simpler to me to decouple this bit of logic from the dispatcher mechanism such that later one can be implemented once for all per storage/xpr kinds.

### Operations that should still be done by the expression

The fact that the expressions do not have to bring any implementation details does not means they have to be as stupid as possible. Indeed, they know about the semantic of the operation they represent, and so they should still be responsible for some basic stuffs like:

- the computation of the scalar type,
- computation of the sizes,
- sanity checks (sizes, scalar types),
- ...

On the other hand, they should not, and cannot, determine whether they are "vectorizable", "linear", if it has to be evaluated or not, etc. They cannot do any of this because:

- this depends on the StorageKind,
- this depends on the sub expressions that will be evaluated into temporary and we cannot know this information at construction time (e.g., D = A + B*C won't produce a temporary once processed by the evaluator),
- ...

### Proposed implementation

The interface that any expression has to provide:

template<...> struct traits<Expression<...> > { typedef ... Scalar; typedef ... StorageKind; typedef ... XprKind; enum { RowsAtCompileTime = ..., ColsAtCompileTime = ..., MaxRowsAtCompileTime = ..., // this seems specific to Dense and related to the evaluation => Evaluator?? MaxColsAtCompileTime = ..., // this seems specific to Dense and related to the evaluation => Evaluator?? /* no Cost, Vectorization flags, linear flags, storage order flags, etc. */ }; }; template<...> class Expression : public expression_dispatcher<Expression<...> >::type { // import of the traits Expression(const Arg0& arg0, ...) : m_arg0(arg0), ... { // check the sizes } Index rows() { /* to be implemented */ } Index cols() { /* to be implemented */ } protected: const typename Arg0::Nested m_arg0; ... };

# Evaluator

The main goal of the Evaluator class is to create an *evaluable object* while *taking care of the evaluation of sub expressions into temporaries*. It is worth noting that the evaluation mechanisms depend a lot on the StorageKind. For instance, Dense expression are evaluated by means of coeff(i,j) and/or packet(i,j) calls inside an evaluation loop. The evaluation of Sparse expression is based on iterators. The evaluation of "large" Product expressions is also based on different paradigm. Therefore, *the evaluator of generic expressions has to specialized differently according to the StorageKind*. On the other hand, the XprKind (Matrix vs Array) does not seems to be relevant here since this is only affecting the API and the semantic of some operators. These decisions are handled by the Expressions.

In the current design, the *evaluable* object would simply be an instance of the Evaluator itself.

## Temporaries

The first difficultly here is to find a way to create the temporaries without having the risk that they are destroyed when we need them, and without having to copy them multiple times.

To this end, each evaluator has to store its "sub" evaluator by value. Always. Then, when an expression knows that is has to be evaluated itself, it can simply store a PlainObject result member, inherits the evaluator of PlainObject, and in its constructor, evaluates itself into the result member.

The case of an expression that has to evaluate one (or multiple) of its children is more tricky and depends whether the expression itself has to be evaluated or not. Here are two examples:

(A+B) * C; (A+B).lazyProduct(C);

The first case is easier because the Product expression itself has to be evaluated. Therefore it is enough to evaluate the lhs (A+B) into a temporary in the constructor of the evaluator of Product:

evaluator(const Product<Lhs,Rhs>& prod) { Lhs::PlainObject lhs_tmp = prod.lhs(); evaluate_product(m_result, lhs_tmp, prod.rhs()); }

Of course this code has to be adapted to evaluate the sub expressions only when needed, but that's not the difficult part.

In the second example (lazy product), the product is not evaluated, and so the lhs as to be evaluated in a member of the evaluator and the evaluation has to happen in the constructor. PErhaps we could imagine something like that:

struct evaluator<LazyProduct<Lhs,Rhs> { Lhs::PlainObject m_lhs_tmp; evaluator<Lhs::PlainObject> m_lhs_eval; evaluator<Rhs> m_rhs_eval;

evaluator(const LazyProduct<Lhs,Rhs>& prod) : m_lhs_tmp(prod.lhs()), m_lhs_eval(m_lhs_tmp), m_rhs_eval(prod.rhs()) {} Scalar coeff(i,j) { return sum_k m_lhs_eval(i,k) * m_rhs_eval(k,j); } };

This case seems to be more tricky to implement because it probably require different a evaluator specialization for each of the 4 possible cases.

## Dense world

For now, let's consider the dense world only.

# TreeOptimizer

In comparison to the previous technical choices, implementing a TreeOptimizer looks like a piece of cake. Typically, a TreeOptimizer takes as input an Expression type representing the expression tree, and build a new, hopefully better, expression type (NewXpr). To be able to instantiate a NewXpr object from an initial Xpr object, the TreeOptimizer also has to implement a static NewXpr make(const Xpr&) function. Here is a proof of concept:

// the TreeOpt has to be specialized at least for each Expression template<typename Xpr> struct TreeOpt; // Matrix: nothing to do => trivial template<> struct TreeOpt<Matrix> { typedef Matrix NewXpr; static const Matrix& make(const Matrix& xpr) { return xpr; } }; // Sum: recursively optimize each child and assemble a new Sum expression from the optimized version of its children template<typename A, typename B> struct TreeOpt<Sum<A,B> > { typedef Sum<A,B> Xpr; typedef typename TreeOpt<A>::NewXpr NewA; typedef typename TreeOpt<B>::NewXpr NewB; typedef Sum<NewA,NewB> NewXpr; static NewXpr make(const Xpr& xpr) { return TreeOpt<A>::make(xpr.m_a) + TreeOpt<B>::make(xpr.m_b); } }; // This partial specialization put any product on the rhs of a sum (e.g., A*B + C => C + A*B) // This is mainly to simplify further processing of the tree by reducing the required number of partial specialization. template<typename A, typename B, typename C> struct TreeOpt<Sum<Product<A,B>,C> > { typedef Sum<Product<A,B>, C> Xpr; typedef typename TreeOpt<C>::NewXpr NewC; typedef Sum<NewC,Product<A,B> > NewXpr; static NewXpr make(const Xpr& xpr) { return TreeOpt<C>::make(xpr.m_b) + xpr.m_a; } }; // This partial specialization transform a sub expression like (X + A*B) + Y to (X+Y) + A*B by interchanging A*B with Y. template<typename A, typename B, typename X, typename Y> struct TreeOpt<Sum<Sum<X,Product<A,B> >,Y> > { typedef Sum<Sum<X,Product<A,B> >,Y> Xpr; typedef typename TreeOpt<X>::NewXpr NewX; typedef typename TreeOpt<Y>::NewXpr NewY; typedef Sum<Sum<NewX,NewY>,Product<A,B> > NewXpr; static NewXpr make(const Xpr& xpr) { return (TreeOpt<X>::make(xpr.m_a.m_a) + TreeOpt<Y>::make(xpr.m_b)) + xpr.m_a.m_b; } };

These few rules allow to transform an expression like R = A + B*C + D + E; into R = (A+D+E) + (A*B); which can then be easily evaluated as R = A+D+E; R+=A*B; without any temporary. To make this happen the tree optimization has to be executed multiple times. We can easily know when to stop the iteration by tracking the number of tree changes in, e.g., a TreeOpt::CountChange enum.

On the one hand, this tree optimization might be expensive at compilation time and so it should be optional. On the other hand it would also allow to reduce the number of expressions which to be really evaluated by transforming any complex expression into a canonical form.