New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1044 - Parallel assignment
Summary: Parallel assignment
Status: DECISIONNEEDED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - general (show other bugs)
Version: 3.4 (development)
Hardware: All All
: Normal Feature Request
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-07-26 11:35 UTC by Jonas Adler
Modified: 2015-07-26 19:35 UTC (History)
3 users (show)



Attachments

Description Jonas Adler 2015-07-26 11:35:43 UTC
Many users (myself included) use Eigen because of its simplicity in writing fast and clear code. One part where i find it lacking is in support for parallel assignment.

In a recent project I had a line of code similar to this using most of the run-time:

A.row(i) += A.row(j)

The point being that in the end, my Eigen code was as fast as manually writing:

for (int k=0; k<n; k++)
    A(i,k) += A(j,k)

But this was on a 8 core machine. If i wanted to improve the performance of the code by parallelizing this loop, I'd be forced to do so myself. By doing this myself I'd need to sacrifice both simplicity and clarity, being forced to do something like:

#pragma omp parallel for
for (int k=0; k<n; k++)
    A(i,k) += A(j,k)

My suggestion here (which I would be willing to implement myself if accepted) is to add the following syntax:

A.row(i).parallel() += A.row(j)

This could be implemented similarly to how noalias() is currently implemented. At a lower level this could be written by having an outer loop over blocks inside assign_impl, possibly by adding another template parameter 'Parallelization' with default value 'NoParallelization'. An example in the one dimensional case would be:

template<typename Derived1, typename Derived2, int Unrolling>
struct internal::assign_impl<Derived1, Derived2, LinearVectorization, Unrolling, OMPParallelization>
{
    static void run(Derived1 &dst, const Derived2 &src)
    {
        const int size = dst.size();
        const int n_threads = nbThreads();
        #pragma omp parallel
        {
            const int id = omp_get_thread_num();
            const Index istart = id*size/n_threads;
            Index iend = (id+1)*size/n_threads;
            if(id == n_threads-1) iend = size;
            internal::assign_impl<Eigen::VectorBlock<Derived1>,Eigen::VectorBlock<Derived2>,LinearVectorization,Unrolling,NoParallelization>::run(dst.segment(istart,iend),src.segment(istart,iend));
        }
    }
};

In the long run, this could also be added to the assign_selector, automatically picking a parallel assignment if deemed correct (for example if the size of the array is large enough), but this can be put of for now.

Is this a decent suggestion that I should try implementing on a fork, or do you think it is to complicated or out of Eigen's scope?

If it is interesting, is my implementation suggestion reasonable, or is there some better way to do it?
Comment 1 Christoph Hertzberg 2015-07-26 14:17:07 UTC
Interesting idea. We only have parallelization for Matrix*Matrix products integrated so far, but even for that it might be worth-while to give the user control on where to enable or disable parallelization.

How much do you gain with parallelization, in your example?
I'd suggest making some benchmarks how much you gain (depending on operation/scalar type/vectorization/number of cores, etc), before going into implementation details.

Side topic:
If you do a lot of .row(*) operations on A, have you tried storing it row-major? I would assume this has a big impact on performance as well -- especially, if the rows are aligned to packet-sizes this would make it possible to vectorize the operation (could gain near 2-8 speed-up depending on architecture and scalar type).
Comment 2 Gael Guennebaud 2015-07-26 19:35:33 UTC
This is definitely a wanted feature. At the time of Eigen 2.0, I've also experimented with omp for coefficient-wise evaluation:

https://bitbucket.org/eigen/eigen/src/429fa67a258fd3e0c60483f12341be23970c0a77/disabled/EvalOMP.h?at=2.0

but at that time compilers and common CPU were not that good to leverage significant speed up. Nowadays, I've also experienced very nice speed-up with omp and simple coefficient-wise loops.

Actually, a similar feature is already available in the unsupported/Tensors module. It is implemented through a "Device" template parameter of the evaluator. This mechanism is more general as it allows for different multi-threading backends, GPUs, etc:

https://bitbucket.org/eigen/eigen/src/e8733e565660e6e1d3b48be52864352117872f7c/unsupported/Eigen/CXX11/src/Tensor/?at=default#markdown-header-controlling-how-expressions-are-evaluated

Note You need to log in before you can comment on or make changes to this bug.