This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen
Bug 1294 - RowMajor Matrix-Vector mul x10 slower than ColMajor
Summary: RowMajor Matrix-Vector mul x10 slower than ColMajor
Status: CONFIRMED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - matrix products (show other bugs)
Version: 3.3 (current stable)
Hardware: x86 - 64-bit Windows
: High Performance Problem
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on: 312
Blocks: 3.4
  Show dependency treegraph
 
Reported: 2016-09-10 10:56 UTC by Nikolai
Modified: 2019-12-04 16:13 UTC (History)
3 users (show)



Attachments
Assembly listing for Col and Row major Transformation mult (12.64 KB, text/plain)
2016-09-12 13:29 UTC, Nikolai
no flags Details
Code for benchmark (7.86 KB, text/plain)
2016-09-12 13:31 UTC, Nikolai
no flags Details

Description Nikolai 2016-09-10 10:56:40 UTC
Visual Studio 2015 Upd3, Windows10 x64. Full Optimization. Test code:

#include<Eigen/StdVector>
using HRC = high_resolution_clock;

template <class Type, class Type2, int RM1, int RM2>
Matrix<Type2, 3, 1, RM2> mul(Matrix<Type, 4, 4, RM1> const& m, Matrix<Type2, 3, 1, RM2> const& p) {
    Matrix<Type2, 4, 1, RM2> p4;
    p4.template head<3>() = p;
    p4[3] = 1;
    return (m * p4.template cast<Type>()).template head<3>().template cast<Type2>();
}

template <typename T1, typename T2>
void test(string cap, T1 init, T2 f) {
    static constexpr size_t N = 2000000;
    static constexpr size_t R = 20;

    auto m = init();
    using VT = Matrix<float, 3, 1>;
    vector<VT, aligned_allocator<VT>> data(N);
    for (auto& i:data) i.setRandom();

    cout << cap << ": ";

    int64_t avg = 0, avgmin = numeric_limits<int64_t>::max();
    for (size_t i = 0; i < R; ++i) {
        auto t1 = HRC::now();
        f(data, m);
        auto t2 = HRC::now();
        auto dur = duration_cast<microseconds>(t2 - t1).count();
        avg += dur;
        avgmin = min(avgmin, dur);       
    }
    avg /= R;
    cout << "Avg: " << avg / 1000.0 << ", min: " << avgmin / 1000.0 << endl;
}

int main() {
    test("MatFloat ColMajor",
         [=] {
             Matrix<float, 4, 4, ColMajor> m{};
             m.setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = mul(m, data[j]);
             }
         }
    );
    test("MatFloat RowMajor",
         [=] {
             Matrix<float, 4, 4, RowMajor> m{};
             m.setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = mul(m, data[j]);
             }
         }
    );
    test("MatDouble ColMajor",
         [=] {
             Matrix<double, 4, 4, ColMajor> m{};
             m.setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = mul(m, data[j]);
             }
         }
    );
    test("MatDouble RowMajor",
         [=] {
             Matrix<double, 4, 4, RowMajor> m{};
             m.setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = mul(m, data[j]);
             }
         }
    );
    return 0;
}

Results:
MatFloat  ColMajor: Avg: 15.595,  min: 14.881
MatFloat  RowMajor: Avg: 243.246, min: 186.862
MatDouble ColMajor: Avg: 111.187, min: 77.764
MatDouble RowMajor: Avg: 270.19,  min: 222.239

If I try to mul with something like this:

template <class Type, class Type2, int RM1, int RM2>
Matrix<Type2, 3, 1, RM2> mul(Matrix<Type, 4, 4, RM1> const& m, Matrix<Type2, 3, 1, RM2> const& p) {
    using RT = decltype(Type() * Type2());
    Matrix<Type2, 3, 1, RM2> res;
    for (size_t i = 0; i < 3; ++i) {
        RT sum = 0;
        for (size_t j = 0; j < 3; ++j) {
            sum += m(i, j) * p[j];
        }
        sum += m(i, 3);
        res[i] = static_cast<Type2>(sum);
    }
    return res;
}

than RowMajor multiplication works with the same speed as ColMajor.
Comment 1 Nikolai 2016-09-10 10:58:53 UTC
Same thing with Transform.
Comment 2 Nikolai 2016-09-10 11:00:42 UTC
    test("TrFloat ColMajor",
         [=] {
             Transform<float, 3, Affine, ColMajor> m{};
             m.matrix().setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = (m * data[j]).eval();
             }
         }
    );
    test("TrFloat RowMajor",
         [=] {
             Transform<float, 3, Affine, RowMajor> m{};
             m.matrix().setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = (m * data[j]).eval();
             }
         }
    );
    test("TrDouble ColMajor",
         [=] {
             Transform<double, 3, Affine, ColMajor> m{};
             m.matrix().setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = (m * data[j].cast<double>()).cast<float>().eval();
             }
         }
    );
    test("TrDouble RowMajor",
         [=] {
             Transform<double, 3, Affine, RowMajor> m{};
             m.matrix().setRandom();
             return m;
         },
         [=](auto& data, auto const& m) {
             for (size_t j = 0; j < data.size(); ++j) {
                 data[j] = (m * data[j].cast<double>()).cast<float>().eval();
             }
         }
    );

Results:
TrFloat  ColMajor: Avg: 17.346,  min: 15.697
TrFloat  RowMajor: Avg: 183.117, min: 166.935
TrDouble ColMajor: Avg: 16.455,  min: 15.88
TrDouble RowMajor: Avg: 175.963, min: 167.702
Comment 3 Christoph Hertzberg 2016-09-12 11:47:34 UTC
Hi, first of all, if you provide examples, please make sure they actually compile (I needed to add several includes and using namespace; also gcc does not seem to like the auto-lambda functions -- that might be a C++14 thing, though). Also use the attachment feature for longer listings.

Regarding the bug: After some cleaning up, I get the following timings with g++-4.8.4 (on a Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz) with sse4.2 enabled:
MatFloat ColMajor: Avg: 8.49, min: 8.366
MatFloat RowMajor: Avg: 33.426, min: 33.195
MatDouble ColMajor: Avg: 23.953, min: 23.833
MatDouble RowMajor: Avg: 37.581, min: 37.362

Partially, the reason is Bug 312 (i.e., inefficient use of haddps). Furthermore, ColMajor is just more efficient here, since the last column can just be added without multiplying by 1.0 first (1.0*x always results in x, even for inf and NaN).
The factor of 10 you experience is likely because some function does not get inlined. Can you have a look at the generated assembly and try to figure out which function is called inside your loop? (Sorry, I can't help much with MSVC myself).
Comment 4 Christoph Hertzberg 2016-09-12 11:59:38 UTC
Addendum:
With your manual multiplication I get:
MatFloat ColMajor: Avg: 9.688, min: 9.605
MatFloat RowMajor: Avg: 22.339, min: 22.071
MatDouble ColMajor: Avg: 14.97, min: 14.881
MatDouble RowMajor: Avg: 33.213, min: 32.936

What do you get on MSVC?

Replacing the manual construction of p4 (in the original version) by
  p.homogenous()
(requires #include <Eigen/Geometry>) gives me:
MatFloat ColMajor: Avg: 12.575, min: 12.32
MatFloat RowMajor: Avg: 24.775, min: 24.382
MatDouble ColMajor: Avg: 24.322, min: 23.947
MatDouble RowMajor: Avg: 36.702, min: 36.107

This means there is some room for improvement, even on gcc
Comment 5 Nikolai 2016-09-12 13:03:02 UTC
As disassemply shown, VS2015 doesn't inline    `Eigen::internal::product_evaluator<Eigen::Product<Eigen::Matrix<float,4,4,1,4,4>,Eigen::Matrix<float,4,1,0,4,1>,1>,3,Eigen::DenseShape,Eigen::DenseShape,float,float>::coeff` when there is a RowMajor option set. Even then Inline function expansion flag set to Any Suitable (/Ob2).
Comment 6 Nikolai 2016-09-12 13:28:07 UTC
But even if I force inlining by replacing inline to EIGEN_STRONG_INLINE in functions which are not inlined, it stil runs very slow and generates 3 times more assemply code than ColMajor. Assembly listing in attachment.
Comment 7 Nikolai 2016-09-12 13:29:02 UTC
Created attachment 728 [details]
Assembly listing for Col and Row major Transformation mult
Comment 8 Nikolai 2016-09-12 13:31:50 UTC
Created attachment 729 [details]
Code for benchmark

Benchmark results:
MatFloat ColMajor 1: Avg: 73.384, min: 71.113
MatFloat ColMajor 2: Avg: 46.094, min: 45.129
MatFloat ColMajor 3: Avg: 10.38, min: 10.119
MatFloat ColMajor 4: Avg: 10.593, min: 10.043
MatFloat ColMajor Fast: Avg: 7.048, min: 6.8
MatFloat RowMajor 1: Avg: 212.208, min: 209.467
MatFloat RowMajor 2: Avg: 115.892, min: 114.243
MatFloat RowMajor 3: Avg: 115.658, min: 113.437
MatFloat RowMajor 4: Avg: 116.359, min: 114.114
MatFloat RowMajor Fast: Avg: 6.975, min: 6.762
MatFloat RowMajor SSE: Avg: 6.105, min: 5.915
TrFloat ColMajor: Avg: 11.039, min: 10.815
TrFloat RowMajor: Avg: 123.501, min: 121.11
Comment 9 Gael Guennebaud 2016-09-13 07:19:54 UTC
Here is what I get with "clang++ -std=c++14 -DNDEBUG -msse4 -O3" on a  Haswell @2.6GHz:

MatFloat ColMajor 1: Avg: 4.146, min: 4.089
MatFloat ColMajor 2: Avg: 4.585, min: 4.212
MatFloat ColMajor 3: Avg: 4.227, min: 4.055
MatFloat ColMajor 4: Avg: 4.154, min: 4.107
MatFloat ColMajor Fast: Avg: 5.723, min: 5.194
MatFloat RowMajor 1: Avg: 8.995, min: 8.301
MatFloat RowMajor 2: Avg: 8.767, min: 8.578
MatFloat RowMajor 3: Avg: 8.755, min: 8.548
MatFloat RowMajor 4: Avg: 9.66, min: 8.945
MatFloat RowMajor Fast: Avg: 5.397, min: 5.195
MatFloat RowMajor SSE: Avg: 4.835, min: 4.694
TrFloat ColMajor: Avg: 5.055, min: 4.67
TrFloat RowMajor: Avg: 9.221, min: 8.561

With Intel Compiler:

MatFloat ColMajor 1: Avg: 15.943, min: 15.775
MatFloat ColMajor 2: Avg: 5.492, min: 5.332
MatFloat ColMajor 3: Avg: 5.635, min: 5.29
MatFloat ColMajor 4: Avg: 5.881, min: 5.386
MatFloat ColMajor Fast: Avg: 5.447, min: 5.118
MatFloat RowMajor 1: Avg: 28.182, min: 26.78
MatFloat RowMajor 2: Avg: 27.205, min: 26.289
MatFloat RowMajor 3: Avg: 27.539, min: 26.425
MatFloat RowMajor 4: Avg: 28.775, min: 27.005
MatFloat RowMajor Fast: Avg: 5.403, min: 5.111
MatFloat RowMajor SSE: Avg: 4.744, min: 4.527
TrFloat ColMajor: Avg: 4.078, min: 3.893
TrFloat RowMajor: Avg: 31.083, min: 27.413

With GCC:

MatFloat ColMajor 1: Avg: 5.345, min: 5.204
MatFloat ColMajor 2: Avg: 4.077, min: 3.918
MatFloat ColMajor 3: Avg: 4.552, min: 4.18
MatFloat ColMajor 4: Avg: 5.068, min: 4.007
MatFloat ColMajor Fast: Avg: 5.044, min: 4.446
MatFloat RowMajor 1: Avg: 23.524, min: 20.414
MatFloat RowMajor 2: Avg: 23.614, min: 21.394
MatFloat RowMajor 3: Avg: 23.7, min: 21.56
MatFloat RowMajor 4: Avg: 22.083, min: 21.384
MatFloat RowMajor Fast: Avg: 4.313, min: 4.031
MatFloat RowMajor SSE: Avg: 6.06, min: 4.896
TrFloat ColMajor: Avg: 4.732, min: 3.947
TrFloat RowMajor: Avg: 22.606, min: 21.592
Comment 10 Nobody 2019-12-04 16:13:37 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1294.

Note You need to log in before you can comment on or make changes to this bug.