New user self-registration is currently disabled. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Bug 1201 - Very slow Affine3f to Vector3f multiplication
Very slow Affine3f to Vector3f multiplication
Status: RESOLVED FIXED
Product: Eigen
Classification: Unclassified
Component: Geometry
3.3 (current stable)
All All
: Normal Unknown
Assigned To: Nobody
:
Depends on:
Blocks: 3.3
  Show dependency treegraph
 
Reported: 2016-04-14 15:00 UTC by Nikolai
Modified: 2016-06-06 14:47 UTC (History)
5 users (show)



Attachments
benchmark for affine*vector product (1.94 KB, text/x-csrc)
2016-04-15 10:16 UTC, Gael Guennebaud
no flags Details

Description Nikolai 2016-04-14 15:00:47 UTC
Eigen::Affine3f m;
1)
                Eigen::Vector3f v(p[0], p[1], p[2]);
                Eigen::Vector3f res = m*v;

2)
                Eigen::Vector4f v(p[0], p[1], p[2], 1);
                Eigen::Vector4f res = m.matrix()*v;

First example 4x times slower than the second! Compiled in Visual Studio 2015 upd 2 with full optimization.
Comment 1 Gael Guennebaud 2016-04-15 10:16:06 UTC
I cannot reproduce such a x4 factor with clang or gcc, but I indeed observed that the current evaluation strategy is not optimal.

See the attached exhaustive benchmark. The results with clang+SSE2+i7 are:

        A*v       A.mat()*h   A.mat()*[v,1]   A*[v,1]        A*h
1f : 0.00158264 - 0.00184644 - 0.00145647 - 0.0013189 - 0.00158265
2f : 0.00191453 - 0.00263772 - 0.00184641 - 0.00189244 - 0.00237394
3f : 0.00300886 - 0.00211027 - 0.00239747 - 0.00391927 - 0.00342901
4f : 0.00237394 - 0.00369411 - 0.00237395 - 0.00237394 - 0.0026797
5f : 0.00474868 - 0.00527828 - 0.00369278 - 0.00606751 - 0.00527539
1d : 0.00157213 - 0.00131888 - 0.00139887 - 0.00144739 - 0.00158264
2d : 0.00158266 - 0.00237394 - 0.00184642 - 0.00188674 - 0.00184641
3d : 0.00263771 - 0.00266235 - 0.00248774 - 0.00274708 - 0.00290148
4d : 0.00316524 - 0.00456824 - 0.00316524 - 0.00316525 - 0.00369278
5d : 0.00555547 - 0.00481819 - 0.00501457 - 0.00771533 - 0.00637442

According to this benchmark, the best strategy would thus be:

  typedef Matrix<Scalar,Dim+1,1> VecH;
  VecH BH, CH;
  BH << B, 1;
  CH = A.matrix() * BH;
  C = CH.template head<Dim>();

The reason is that for appropriate sizes we get vectorization, and if no vectorization is possible, then the compiler knowns that the last entry of the result is not needed and can be removed.
Comment 2 Gael Guennebaud 2016-04-15 10:16:54 UTC
Created attachment 696 [details]
benchmark for affine*vector product
Comment 3 Nikolai 2016-04-15 12:47:38 UTC
My test code:


const int iters = 1000000;

long long testMat() {
  vector<Eigen::Vector3f> points;
  for (int i = 0; i < iters; ++i) {
    points.push_back(Eigen::Vector3f::Random());
  }

  Eigen::Affine3f m(Eigen::Matrix4f::Random());

  auto t1 = high_resolution_clock::now();

  for (int i = 0; i < iters; ++i) {
    auto &p = points[i];
    Eigen::Vector4f vv(p[0], p[1], p[2], 1);
    vv = m.matrix() * vv;
    p = vv.head<3>() * (1 / vv[3]);
  }

  auto t2 = high_resolution_clock::now();
  auto cnt = duration_cast<microseconds>(t2 - t1).count();
  cout << "Elapsed " << cnt << endl;
  return cnt;
}

long long testAff() {
  vector<Eigen::Vector3f> points;
  for (int i = 0; i < iters; ++i) {
    points.push_back(Eigen::Vector3f::Random());
  }

  Eigen::Affine3f m(Eigen::Matrix4f::Random());

  auto t1 = high_resolution_clock::now();

  for (int i = 0; i < iters; ++i) {
    auto p = points[i];
    points[i] = m * p;
  }

  auto t2 = high_resolution_clock::now();
  auto cnt = duration_cast<microseconds>(t2 - t1).count();
  cout << "Elapsed " << cnt << endl;
  return cnt;
}

int main() {
  long long avgAff = 0, avgMat = 0;
  const int N = 20;

  for (int n = 0; n < N; ++n) {
    cout << endl;

    cout << "Aff ";
    avgAff += testAff();

    cout << "Mat ";
    avgMat += testMat();
  }

  cout << endl;
  cout << "Aff    " << avgAff / N << endl;
  cout << "Mat    " << avgMat / N << endl;
  cout << "Diff   " << (float)avgAff / avgMat << endl;
  return 0;
}
Comment 4 Christoph Hertzberg 2016-04-15 13:07:35 UTC
Another variant would be:
  CH = A.matrix().template leftCols<Dim>()*v+A.matrix().col(Dim);
  C  = CH.template head<Dim>();

This actually often generates the exact same code as your foo3-variant on gcc, i.e. sometimes gcc was already smart enough to optimize away the multiplication by 1.

This revealed however that we should add (at least) these two product selectors:

template<int M> struct product_type_selector<M, 1, 1> { enum { ret = LazyCoeffBasedProductMode }; };
template<int N> struct product_type_selector<1, N, 1> { enum { ret = LazyCoeffBasedProductMode }; };

Alternatively, make 
  product_type_selector<M,N,1> = LazyCoeffBasedProductMode
and specialize
  product_type_selector<Large,Large,1> = OuterProduct

Otherwise, Matrix<S,X,1>*Matrix<S,1,1> is evaluated using an inefficient (non-inlined) outer product. It would be even better, if this could be reduced to a Vector*Scalar product
Comment 5 Gael Guennebaud 2016-04-16 19:49:01 UTC
For the 1x1 case: https://bitbucket.org/eigen/eigen/commits/e0b52ff4af0b/
Comment 6 Gael Guennebaud 2016-05-19 14:11:11 UTC
After several tries, this seems to provide a good tradeoff across different compilers:
https://bitbucket.org/eigen/eigen/commits/d0e244c462f0
Comment 7 Nikolai 2016-06-06 10:27:07 UTC
Stil 4 times slower. The bottleneck is rhs << other,1; operation. I suggest to replace it to 	rhs.template head<Dim>() = other; rhs[Dim] = 1; In that case performance is equal.
Comment 8 Gael Guennebaud 2016-06-06 14:47:55 UTC
Alright (with gcc or clang it does not make any difference)

https://bitbucket.org/eigen/eigen/commits/c9f83e3a9bbd/
Summary:     Bug 1201: improve code generation of affine*vec with MSVC

Note You need to log in before you can comment on or make changes to this bug.