Visual Studio 2015 Upd3, Windows10 x64. Full Optimization. Test code: #include<Eigen/StdVector> using HRC = high_resolution_clock; template <class Type, class Type2, int RM1, int RM2> Matrix<Type2, 3, 1, RM2> mul(Matrix<Type, 4, 4, RM1> const& m, Matrix<Type2, 3, 1, RM2> const& p) { Matrix<Type2, 4, 1, RM2> p4; p4.template head<3>() = p; p4[3] = 1; return (m * p4.template cast<Type>()).template head<3>().template cast<Type2>(); } template <typename T1, typename T2> void test(string cap, T1 init, T2 f) { static constexpr size_t N = 2000000; static constexpr size_t R = 20; auto m = init(); using VT = Matrix<float, 3, 1>; vector<VT, aligned_allocator<VT>> data(N); for (auto& i:data) i.setRandom(); cout << cap << ": "; int64_t avg = 0, avgmin = numeric_limits<int64_t>::max(); for (size_t i = 0; i < R; ++i) { auto t1 = HRC::now(); f(data, m); auto t2 = HRC::now(); auto dur = duration_cast<microseconds>(t2 - t1).count(); avg += dur; avgmin = min(avgmin, dur); } avg /= R; cout << "Avg: " << avg / 1000.0 << ", min: " << avgmin / 1000.0 << endl; } int main() { test("MatFloat ColMajor", [=] { Matrix<float, 4, 4, ColMajor> m{}; m.setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = mul(m, data[j]); } } ); test("MatFloat RowMajor", [=] { Matrix<float, 4, 4, RowMajor> m{}; m.setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = mul(m, data[j]); } } ); test("MatDouble ColMajor", [=] { Matrix<double, 4, 4, ColMajor> m{}; m.setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = mul(m, data[j]); } } ); test("MatDouble RowMajor", [=] { Matrix<double, 4, 4, RowMajor> m{}; m.setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = mul(m, data[j]); } } ); return 0; } Results: MatFloat ColMajor: Avg: 15.595, min: 14.881 MatFloat RowMajor: Avg: 243.246, min: 186.862 MatDouble ColMajor: Avg: 111.187, min: 77.764 MatDouble RowMajor: Avg: 270.19, min: 222.239 If I try to mul with something like this: template <class Type, class Type2, int RM1, int RM2> Matrix<Type2, 3, 1, RM2> mul(Matrix<Type, 4, 4, RM1> const& m, Matrix<Type2, 3, 1, RM2> const& p) { using RT = decltype(Type() * Type2()); Matrix<Type2, 3, 1, RM2> res; for (size_t i = 0; i < 3; ++i) { RT sum = 0; for (size_t j = 0; j < 3; ++j) { sum += m(i, j) * p[j]; } sum += m(i, 3); res[i] = static_cast<Type2>(sum); } return res; } than RowMajor multiplication works with the same speed as ColMajor.
Same thing with Transform.
test("TrFloat ColMajor", [=] { Transform<float, 3, Affine, ColMajor> m{}; m.matrix().setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = (m * data[j]).eval(); } } ); test("TrFloat RowMajor", [=] { Transform<float, 3, Affine, RowMajor> m{}; m.matrix().setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = (m * data[j]).eval(); } } ); test("TrDouble ColMajor", [=] { Transform<double, 3, Affine, ColMajor> m{}; m.matrix().setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = (m * data[j].cast<double>()).cast<float>().eval(); } } ); test("TrDouble RowMajor", [=] { Transform<double, 3, Affine, RowMajor> m{}; m.matrix().setRandom(); return m; }, [=](auto& data, auto const& m) { for (size_t j = 0; j < data.size(); ++j) { data[j] = (m * data[j].cast<double>()).cast<float>().eval(); } } ); Results: TrFloat ColMajor: Avg: 17.346, min: 15.697 TrFloat RowMajor: Avg: 183.117, min: 166.935 TrDouble ColMajor: Avg: 16.455, min: 15.88 TrDouble RowMajor: Avg: 175.963, min: 167.702
Hi, first of all, if you provide examples, please make sure they actually compile (I needed to add several includes and using namespace; also gcc does not seem to like the auto-lambda functions -- that might be a C++14 thing, though). Also use the attachment feature for longer listings. Regarding the bug: After some cleaning up, I get the following timings with g++-4.8.4 (on a Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz) with sse4.2 enabled: MatFloat ColMajor: Avg: 8.49, min: 8.366 MatFloat RowMajor: Avg: 33.426, min: 33.195 MatDouble ColMajor: Avg: 23.953, min: 23.833 MatDouble RowMajor: Avg: 37.581, min: 37.362 Partially, the reason is Bug 312 (i.e., inefficient use of haddps). Furthermore, ColMajor is just more efficient here, since the last column can just be added without multiplying by 1.0 first (1.0*x always results in x, even for inf and NaN). The factor of 10 you experience is likely because some function does not get inlined. Can you have a look at the generated assembly and try to figure out which function is called inside your loop? (Sorry, I can't help much with MSVC myself).
Addendum: With your manual multiplication I get: MatFloat ColMajor: Avg: 9.688, min: 9.605 MatFloat RowMajor: Avg: 22.339, min: 22.071 MatDouble ColMajor: Avg: 14.97, min: 14.881 MatDouble RowMajor: Avg: 33.213, min: 32.936 What do you get on MSVC? Replacing the manual construction of p4 (in the original version) by p.homogenous() (requires #include <Eigen/Geometry>) gives me: MatFloat ColMajor: Avg: 12.575, min: 12.32 MatFloat RowMajor: Avg: 24.775, min: 24.382 MatDouble ColMajor: Avg: 24.322, min: 23.947 MatDouble RowMajor: Avg: 36.702, min: 36.107 This means there is some room for improvement, even on gcc
As disassemply shown, VS2015 doesn't inline `Eigen::internal::product_evaluator<Eigen::Product<Eigen::Matrix<float,4,4,1,4,4>,Eigen::Matrix<float,4,1,0,4,1>,1>,3,Eigen::DenseShape,Eigen::DenseShape,float,float>::coeff` when there is a RowMajor option set. Even then Inline function expansion flag set to Any Suitable (/Ob2).
But even if I force inlining by replacing inline to EIGEN_STRONG_INLINE in functions which are not inlined, it stil runs very slow and generates 3 times more assemply code than ColMajor. Assembly listing in attachment.
Created attachment 728 [details] Assembly listing for Col and Row major Transformation mult
Created attachment 729 [details] Code for benchmark Benchmark results: MatFloat ColMajor 1: Avg: 73.384, min: 71.113 MatFloat ColMajor 2: Avg: 46.094, min: 45.129 MatFloat ColMajor 3: Avg: 10.38, min: 10.119 MatFloat ColMajor 4: Avg: 10.593, min: 10.043 MatFloat ColMajor Fast: Avg: 7.048, min: 6.8 MatFloat RowMajor 1: Avg: 212.208, min: 209.467 MatFloat RowMajor 2: Avg: 115.892, min: 114.243 MatFloat RowMajor 3: Avg: 115.658, min: 113.437 MatFloat RowMajor 4: Avg: 116.359, min: 114.114 MatFloat RowMajor Fast: Avg: 6.975, min: 6.762 MatFloat RowMajor SSE: Avg: 6.105, min: 5.915 TrFloat ColMajor: Avg: 11.039, min: 10.815 TrFloat RowMajor: Avg: 123.501, min: 121.11
Here is what I get with "clang++ -std=c++14 -DNDEBUG -msse4 -O3" on a Haswell @2.6GHz: MatFloat ColMajor 1: Avg: 4.146, min: 4.089 MatFloat ColMajor 2: Avg: 4.585, min: 4.212 MatFloat ColMajor 3: Avg: 4.227, min: 4.055 MatFloat ColMajor 4: Avg: 4.154, min: 4.107 MatFloat ColMajor Fast: Avg: 5.723, min: 5.194 MatFloat RowMajor 1: Avg: 8.995, min: 8.301 MatFloat RowMajor 2: Avg: 8.767, min: 8.578 MatFloat RowMajor 3: Avg: 8.755, min: 8.548 MatFloat RowMajor 4: Avg: 9.66, min: 8.945 MatFloat RowMajor Fast: Avg: 5.397, min: 5.195 MatFloat RowMajor SSE: Avg: 4.835, min: 4.694 TrFloat ColMajor: Avg: 5.055, min: 4.67 TrFloat RowMajor: Avg: 9.221, min: 8.561 With Intel Compiler: MatFloat ColMajor 1: Avg: 15.943, min: 15.775 MatFloat ColMajor 2: Avg: 5.492, min: 5.332 MatFloat ColMajor 3: Avg: 5.635, min: 5.29 MatFloat ColMajor 4: Avg: 5.881, min: 5.386 MatFloat ColMajor Fast: Avg: 5.447, min: 5.118 MatFloat RowMajor 1: Avg: 28.182, min: 26.78 MatFloat RowMajor 2: Avg: 27.205, min: 26.289 MatFloat RowMajor 3: Avg: 27.539, min: 26.425 MatFloat RowMajor 4: Avg: 28.775, min: 27.005 MatFloat RowMajor Fast: Avg: 5.403, min: 5.111 MatFloat RowMajor SSE: Avg: 4.744, min: 4.527 TrFloat ColMajor: Avg: 4.078, min: 3.893 TrFloat RowMajor: Avg: 31.083, min: 27.413 With GCC: MatFloat ColMajor 1: Avg: 5.345, min: 5.204 MatFloat ColMajor 2: Avg: 4.077, min: 3.918 MatFloat ColMajor 3: Avg: 4.552, min: 4.18 MatFloat ColMajor 4: Avg: 5.068, min: 4.007 MatFloat ColMajor Fast: Avg: 5.044, min: 4.446 MatFloat RowMajor 1: Avg: 23.524, min: 20.414 MatFloat RowMajor 2: Avg: 23.614, min: 21.394 MatFloat RowMajor 3: Avg: 23.7, min: 21.56 MatFloat RowMajor 4: Avg: 22.083, min: 21.384 MatFloat RowMajor Fast: Avg: 4.313, min: 4.031 MatFloat RowMajor SSE: Avg: 6.06, min: 4.896 TrFloat ColMajor: Avg: 4.732, min: 3.947 TrFloat RowMajor: Avg: 22.606, min: 21.592
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1294.