Eigen::Affine3f m; 1) Eigen::Vector3f v(p[0], p[1], p[2]); Eigen::Vector3f res = m*v; 2) Eigen::Vector4f v(p[0], p[1], p[2], 1); Eigen::Vector4f res = m.matrix()*v; First example 4x times slower than the second! Compiled in Visual Studio 2015 upd 2 with full optimization.
I cannot reproduce such a x4 factor with clang or gcc, but I indeed observed that the current evaluation strategy is not optimal. See the attached exhaustive benchmark. The results with clang+SSE2+i7 are: A*v A.mat()*h A.mat()*[v,1] A*[v,1] A*h 1f : 0.00158264 - 0.00184644 - 0.00145647 - 0.0013189 - 0.00158265 2f : 0.00191453 - 0.00263772 - 0.00184641 - 0.00189244 - 0.00237394 3f : 0.00300886 - 0.00211027 - 0.00239747 - 0.00391927 - 0.00342901 4f : 0.00237394 - 0.00369411 - 0.00237395 - 0.00237394 - 0.0026797 5f : 0.00474868 - 0.00527828 - 0.00369278 - 0.00606751 - 0.00527539 1d : 0.00157213 - 0.00131888 - 0.00139887 - 0.00144739 - 0.00158264 2d : 0.00158266 - 0.00237394 - 0.00184642 - 0.00188674 - 0.00184641 3d : 0.00263771 - 0.00266235 - 0.00248774 - 0.00274708 - 0.00290148 4d : 0.00316524 - 0.00456824 - 0.00316524 - 0.00316525 - 0.00369278 5d : 0.00555547 - 0.00481819 - 0.00501457 - 0.00771533 - 0.00637442 According to this benchmark, the best strategy would thus be: typedef Matrix<Scalar,Dim+1,1> VecH; VecH BH, CH; BH << B, 1; CH = A.matrix() * BH; C = CH.template head<Dim>(); The reason is that for appropriate sizes we get vectorization, and if no vectorization is possible, then the compiler knowns that the last entry of the result is not needed and can be removed.
Created attachment 696 [details] benchmark for affine*vector product
My test code: const int iters = 1000000; long long testMat() { vector<Eigen::Vector3f> points; for (int i = 0; i < iters; ++i) { points.push_back(Eigen::Vector3f::Random()); } Eigen::Affine3f m(Eigen::Matrix4f::Random()); auto t1 = high_resolution_clock::now(); for (int i = 0; i < iters; ++i) { auto &p = points[i]; Eigen::Vector4f vv(p[0], p[1], p[2], 1); vv = m.matrix() * vv; p = vv.head<3>() * (1 / vv[3]); } auto t2 = high_resolution_clock::now(); auto cnt = duration_cast<microseconds>(t2 - t1).count(); cout << "Elapsed " << cnt << endl; return cnt; } long long testAff() { vector<Eigen::Vector3f> points; for (int i = 0; i < iters; ++i) { points.push_back(Eigen::Vector3f::Random()); } Eigen::Affine3f m(Eigen::Matrix4f::Random()); auto t1 = high_resolution_clock::now(); for (int i = 0; i < iters; ++i) { auto p = points[i]; points[i] = m * p; } auto t2 = high_resolution_clock::now(); auto cnt = duration_cast<microseconds>(t2 - t1).count(); cout << "Elapsed " << cnt << endl; return cnt; } int main() { long long avgAff = 0, avgMat = 0; const int N = 20; for (int n = 0; n < N; ++n) { cout << endl; cout << "Aff "; avgAff += testAff(); cout << "Mat "; avgMat += testMat(); } cout << endl; cout << "Aff " << avgAff / N << endl; cout << "Mat " << avgMat / N << endl; cout << "Diff " << (float)avgAff / avgMat << endl; return 0; }
Another variant would be: CH = A.matrix().template leftCols<Dim>()*v+A.matrix().col(Dim); C = CH.template head<Dim>(); This actually often generates the exact same code as your foo3-variant on gcc, i.e. sometimes gcc was already smart enough to optimize away the multiplication by 1. This revealed however that we should add (at least) these two product selectors: template<int M> struct product_type_selector<M, 1, 1> { enum { ret = LazyCoeffBasedProductMode }; }; template<int N> struct product_type_selector<1, N, 1> { enum { ret = LazyCoeffBasedProductMode }; }; Alternatively, make product_type_selector<M,N,1> = LazyCoeffBasedProductMode and specialize product_type_selector<Large,Large,1> = OuterProduct Otherwise, Matrix<S,X,1>*Matrix<S,1,1> is evaluated using an inefficient (non-inlined) outer product. It would be even better, if this could be reduced to a Vector*Scalar product
For the 1x1 case: https://bitbucket.org/eigen/eigen/commits/e0b52ff4af0b/
After several tries, this seems to provide a good tradeoff across different compilers: https://bitbucket.org/eigen/eigen/commits/d0e244c462f0
Stil 4 times slower. The bottleneck is rhs << other,1; operation. I suggest to replace it to rhs.template head<Dim>() = other; rhs[Dim] = 1; In that case performance is equal.
Alright (with gcc or clang it does not make any difference) https://bitbucket.org/eigen/eigen/commits/c9f83e3a9bbd/ Summary: Bug 1201: improve code generation of affine*vec with MSVC
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1201.