As motivated by bug 404, it would be useful to allow tuning evaluation parameters on a per expression basis (or group of expressions). Parameters includes at least:
- vectorization: ON/OFF/enforce packet size
- multi-threading: ON/OFF/nb threads
- unrolling: ON/OFF/limit
This could be addressed through a "device" similar to what is done in Tensor, or by the definition of an evaluation context:
It would be also quite helpful to add some control on per-object basis. I would propose to allow an additional flag (say, Eigen::SingleThread) to the Options template parameter of Eigen::Matrix and Eigen::SparseMatrix templates which will allow to suppress the automatic OMP parallelization of all the operations involving this matrix.
1) In some situations it makes sense to use the parallelized and non-parallelized matrix operations in one program. For example, one might prefer to avoid parallelization of of matrix operations when calculating fitness of each individual in evolutionary strategies (like CMA-ES) (and parallelizing the analysis of several individuals instead) whereas would desire to parallelize the subsequent treatment of covariance matrix. Currently, Eigen allows global control over multithreading either via EIGEN_DONT_PARALLELIZE token or via Eigen::setNbThreads(n) but lacks the levers for thread-specific, operationwise control.
2) Performance improvement. From the implementation of Parallelizer.h it is clear that Eigen decides whether and how to parallelize the specific matrix operation based on the structure of arguments by evaluating first several if checks. This is an evident performance bottleneck for very small dynamic size matrices.
Also, it would be a nice idea to use some sort of wrapper for parallelization which would allow to choose among OMP, POSIX, std::thread-based parallelizations etc implementations which then can be chosen by defining the appropriate preprocessor token (conceptual implementation is attached).
Created attachment 694 [details]
wrapper for parallelization
I'm not sur about the "per object" control: what if you do A*B with A wanting single-threading and B multiple threads?
We can think about 5 levels of control:
1 - macros: define the default behavior through macros
2 - global: adjust some parameters at runtime through function calls (for all threads, not compile-time control)
3 - block level: compile-time control for a block of code: probably difficult to achieve in C++03 unless we restrict the block of code to Eigen's expressions:
A = B*C,
C = A+B+2*G.col(j)*G.row(i)
In c++11, we can probably do something more general using a macro and a lambda (not 100% sure though).
4 - expression level: as is Tensor, something like: A.device(my_device) = B*C;
5 - operation level (especially for accuracy control): A = sqrt(B, policy) + sin(C, policy);
Basically, the level3 would essentially be syntactic sugar added on top of level4 to avoid repeating the underlying device every time.
In the Tensor module, different multi-threading devices have already been implemented, including CUDA.
Added to SIMD, threading, and unrolling, we should also be able to control temporary allocations, and accuracy.
Regarding accuracy control, we could take inspiration from boost::math's policies: http://www.boost.org/doc/libs/1_60_0/libs/math/doc/html/math_toolkit/pol_tutorial.html
I guess 4 would be the easiest to achieve and we would be consistent with the Tensor module. If we wanted a "per object" control, we could always let the target object decide the evaluation strategy.
For variant 3: Where would the actual assignment happen? In the destructor of an assigment_proxy object?
Then (A=exprA, B=exprB) would actually be of some type Assigner<AType, AExpr, Assigner<BType, BExpr,void> > and operator of useDevice() would somehow modify the Assigner?
I'm not sure, if we won't introduce trouble with that mechanism -- e.g., how sure are we that Assigner objects are not copy-constructed/destructed (and thus executed) while they are concatenated by the comma operator (and before useDevice has a chance to actually access it)?
Alternative to 5:
A = (sqrt(B).device(policy1) + sin(C).device(policy2)).device(policy3);
Otherwise, how would you decide the policy of the + operator in this case. OTOH, I doubt that this really makes sense here (unless, maybe if sqrt(B) needs to be stored into a temporary).
Perhaps, we could even simply overload the already existing .eval() method to decide the evaluation strategy.
(In reply to Gael Guennebaud from comment #3)
> I'm not sur about the "per object" control: what if you do A*B with A
> wanting single-threading and B multiple threads?
I think that in the case of per-object control all matrices must share the same evaluation control flag. Otherwise, we can observe unexpected behavior when evaluating matrix expressions like A*B+A*C=A*(B+C) Changing the evaluation strategy can be achieved only by type casting all the operands.
Level 3 and 4 controls look as a great and highly reasonable alternatives though. The only concern is performance issue: maybe in addition to (instead of?) run-time switch device(my_device) it is worth to introduce a templated compile-time strategy selector device<my_device>().
Just some thoughts until someone pickup this feature request:
I still think level #3 is doable if we evaluate the sub-expressions immediately directly in operator,. But this still means that for standard expressions, the evaluation have to be triggered in the destructor, which still look risky and need special care for A = B = expr.
Regarding #4, I'm not fan of:
A.device(omp) = B*C;
D.device(omp) = C*B;
which is kind of verbose. We could reduce typing with:
omp (A) = B*C;
omp (D) = C*B;
Here "omp" becomes more like an attribute/qualifier of the expression. We could even go one step further by abusing operator overloading:
omp | A = B*C;
omp | D = C*B;
or a prompt style:
omp >> A = B*C;
omp >> D = C*B;
Regarding #5, I think it only makes sense for controlling speed vs accuracy of some costly functions like sqrt, sin, cos, log, exp, etc. It could also be passed as a template parameter:
y.template cos<Accurate>() // in template code
but this cannot apply to free-function:
maybe no big deal?
More importantly I don't know how to specify speed-vs-accuracy, as the available tradeoffs depend a lot on the given functions. For instance, one might define it in terms of ULP, or absolute errors, and/or provides guarantees on input range:
- no NaN, no INF
- >=0 for sqrt
- >0 for log and rsqrt
- [-pi,pi] for trigo