Arguably one of the key features of Eigen is that its provide top speed with very high ease of use.
I am currently looking to optimize some vectorial code, and converting from STL to Eigen types seems like one of the easiest way to make the best of the CPU.
This particular code other than sums and multiplications, needs a cast operator (float to int, in a binning operation).
From what I can understand from the Eigen code base it seems that the ::cast<> function simply calls static_cast<> as an unary operator. However there seem to exist SIMD instructions for specific common kind of casts (such as CVTTSD2SI )
It would be then great if such ::cast<> calls would be also SIMD enabled.
Any chance of seeing this feature appear?
Given the Eigen infrastructure it seems that this should not too hard to implement; but getting "hands on" inside Eigen is quite scary to me.
Indeed, as long as the respective packet-size does not change that should be easy. This includes float<->int and double<->int64.
It's also relatively easy if the source is bigger than the destination, as you can create a new package by converting 2/4/8 packages of the bigger type and pack them to a single package of the smaller type.
For int-types using PACKSSWB/PACKSSDW/PACKUSWB or with shuffling, for double->float using, e.g., MOVLHPS.
Converting smaller types to bigger, either requires redundant work or some kind of meta-packages, or temporaries (e.g., it would be relatively easy to implement some specializations if the source has direct-access).
We had this discussion on the list some time ago.
Doable, yes. Easy, ... well let's see!
Handling change of packet size would require:
1) the concept of a "meta packet" grouping multiple native packets in a single type and implement all the p* functions for it. Not difficult.
2) functions for the conversion from/to meta packets to/from native packets. As you said, that's relatively easy.
3) update the vectorization logic to permit this. The vectorization logic is already one of the most complicated mechanism is Eigen, so extending and complexifying it sounds really scary!!
Nonetheless, we should support as soon as possible the possibility to use and instantiate packet of different sizes for a given scalar type (NEON and AVX permits that). In practice this means packet access members of the form:
packet<PacketSize,Alignment>(Index i, Index j)
So if we combine this with meta packets, implementing (3) should be relatively easy. Meta-packets should also be extremely useful for partial loop unrolling which is only effective if all loads operation are done first and into different registers that is a task the compiler if not good at when naively unrolling a loop.
Using the ExprEval, we probably don't need to use Meta-Packets for this.
(In reply to Christoph Hertzberg from comment #4)
> Using the ExprEval, we probably don't need to use Meta-Packets for this.
Do you remember what was your idea about how evaluator could avoid the need for meta-packets? The float-to-double case is not obvious to me.
we could imagine a mechanism of per-block evaluation that would enable the evaluation of parts of sub-expressions. The blocks will have to be small enough to fit in L1. This would solve the float-to-double case without meta-packets, and open the doors for many other optimizations.
(In reply to Gael Guennebaud from comment #5)
> Do you remember what was your idea about how evaluator could avoid the need
> for meta-packets? The float-to-double case is not obvious to me.
At the time I wrote this I had only a very vague idea how the evaluator was going to work. I think I mostly made this dependency because you mentioned somewhere that this should work somehow.
One easy case I can think of is if cast<>() is the last operation before writing to memory. In that case only a custom DstEvaluator would be required which casts before writing to memory -- that might be a rather uncommon use case, however.
Another easy case (probably the more relevant in practice) is loading from float and operating in double. In that case we could simply combine _mm_cvtps_pd with _mm_loadl_pi to a SrcEvaluator.
I see you just suggested basically the same here:
For slightly more complicated expressions, this gets much more complicated:
VectorXd res = VectorXd() + (float(x)*VectorXf()).cast<double>();
Ignoring loop=control and handling of the tail, this should result in something like this:
// x * VectorXf():
__m128 t0 = _mm_mul_ps(_mm_load_ps(srcFloat), x);
// cast and add each half:
I guess with meta-packets this tends to be easier to solve:
Basically, a Packet<float,4> just needs to have a cast<double>() function which simply returns a Packet<double,4> (using _mm_cvtps_pd and _mm_movehl_ps).
However, with that implementation the second use case above would be sub-optimal, as it would disallow vectorizing Vector2f::cast<double>() and basically just replaces one load by one movhl instruction.
-- GitLab Migration Automatic Message --
This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/512.