This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen

Bug 1044

Summary: Parallel assignment
Product: Eigen Reporter: Jonas Adler <jonas_lajv>
Component: Core - generalAssignee: Nobody <eigen.nobody>
Status: DECISIONNEEDED ---    
Severity: Feature Request CC: chtz, gael.guennebaud, jacob.benoit.1
Priority: Normal    
Version: 3.4 (development)   
Hardware: All   
OS: All   
Whiteboard:

Description Jonas Adler 2015-07-26 11:35:43 UTC
Many users (myself included) use Eigen because of its simplicity in writing fast and clear code. One part where i find it lacking is in support for parallel assignment.

In a recent project I had a line of code similar to this using most of the run-time:

A.row(i) += A.row(j)

The point being that in the end, my Eigen code was as fast as manually writing:

for (int k=0; k<n; k++)
    A(i,k) += A(j,k)

But this was on a 8 core machine. If i wanted to improve the performance of the code by parallelizing this loop, I'd be forced to do so myself. By doing this myself I'd need to sacrifice both simplicity and clarity, being forced to do something like:

#pragma omp parallel for
for (int k=0; k<n; k++)
    A(i,k) += A(j,k)

My suggestion here (which I would be willing to implement myself if accepted) is to add the following syntax:

A.row(i).parallel() += A.row(j)

This could be implemented similarly to how noalias() is currently implemented. At a lower level this could be written by having an outer loop over blocks inside assign_impl, possibly by adding another template parameter 'Parallelization' with default value 'NoParallelization'. An example in the one dimensional case would be:

template<typename Derived1, typename Derived2, int Unrolling>
struct internal::assign_impl<Derived1, Derived2, LinearVectorization, Unrolling, OMPParallelization>
{
    static void run(Derived1 &dst, const Derived2 &src)
    {
        const int size = dst.size();
        const int n_threads = nbThreads();
        #pragma omp parallel
        {
            const int id = omp_get_thread_num();
            const Index istart = id*size/n_threads;
            Index iend = (id+1)*size/n_threads;
            if(id == n_threads-1) iend = size;
            internal::assign_impl<Eigen::VectorBlock<Derived1>,Eigen::VectorBlock<Derived2>,LinearVectorization,Unrolling,NoParallelization>::run(dst.segment(istart,iend),src.segment(istart,iend));
        }
    }
};

In the long run, this could also be added to the assign_selector, automatically picking a parallel assignment if deemed correct (for example if the size of the array is large enough), but this can be put of for now.

Is this a decent suggestion that I should try implementing on a fork, or do you think it is to complicated or out of Eigen's scope?

If it is interesting, is my implementation suggestion reasonable, or is there some better way to do it?
Comment 1 Christoph Hertzberg 2015-07-26 14:17:07 UTC
Interesting idea. We only have parallelization for Matrix*Matrix products integrated so far, but even for that it might be worth-while to give the user control on where to enable or disable parallelization.

How much do you gain with parallelization, in your example?
I'd suggest making some benchmarks how much you gain (depending on operation/scalar type/vectorization/number of cores, etc), before going into implementation details.

Side topic:
If you do a lot of .row(*) operations on A, have you tried storing it row-major? I would assume this has a big impact on performance as well -- especially, if the rows are aligned to packet-sizes this would make it possible to vectorize the operation (could gain near 2-8 speed-up depending on architecture and scalar type).
Comment 2 Gael Guennebaud 2015-07-26 19:35:33 UTC
This is definitely a wanted feature. At the time of Eigen 2.0, I've also experimented with omp for coefficient-wise evaluation:

https://bitbucket.org/eigen/eigen/src/429fa67a258fd3e0c60483f12341be23970c0a77/disabled/EvalOMP.h?at=2.0

but at that time compilers and common CPU were not that good to leverage significant speed up. Nowadays, I've also experienced very nice speed-up with omp and simple coefficient-wise loops.

Actually, a similar feature is already available in the unsupported/Tensors module. It is implemented through a "Device" template parameter of the evaluator. This mechanism is more general as it allows for different multi-threading backends, GPUs, etc:

https://bitbucket.org/eigen/eigen/src/e8733e565660e6e1d3b48be52864352117872f7c/unsupported/Eigen/CXX11/src/Tensor/?at=default#markdown-header-controlling-how-expressions-are-evaluated
Comment 3 Nobody 2019-12-04 14:47:06 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1044.