New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 404 - Optimize products for small objects
Summary: Optimize products for small objects
Status: RESOLVED FIXED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - matrix products (show other bugs)
Version: 3.1
Hardware: All All
: Normal Optimization
Assignee: Nobody
URL:
Whiteboard:
Keywords:
: 469 (view as bug list)
Depends on: 359 469
Blocks: 3.3
  Show dependency treegraph
 
Reported: 2012-01-12 08:45 UTC by Gael Guennebaud
Modified: 2016-01-28 21:34 UTC (History)
3 users (show)



Attachments
Test case for small products (1.33 KB, text/plain)
2014-03-06 16:12 UTC, Christoph Hertzberg
no flags Details
Helper program to find a good threshold to fallback to lazyProduct (2.42 KB, application/octet-stream)
2014-04-28 16:46 UTC, Gael Guennebaud
no flags Details
Assembly for Eigen 3.2.7 vs. 3.3 in Dynamic lazyProduct benchmark (45.32 KB, application/x-gzip)
2015-11-30 12:42 UTC, Daniel Vollmer
no flags Details

Description Gael Guennebaud 2012-01-12 08:45:52 UTC
Both gemv and gemm like operations could be significantly improved for small objects. First thing to do is implement an exhaustive performance test comparing various implementations and try it on different compiler/architecture.

see also "Eigen 2 to Eigen 3 performance regressions with mapped matrices" and "Blas performance on mapped matrices" threads in the ML
Comment 1 Christoph Hertzberg 2012-03-09 13:00:21 UTC
Another thing which can be drastically improved is products involving Matrix<Scalar, FIXED, Dynamic, ColMajor>. If I see it correctly they all fall back to the generic product at the moment -- which causes a lot of overhead. Usually, a naive loop accessing one (unrolled) column at a time should be more efficient. The same is true for Matrix<Scalar, Dynamic, FIXED, RowMajor>, obviously.
Comment 2 Christoph Hertzberg 2013-10-21 14:14:32 UTC
A suggestion to reduce code duplication:
If the destination is RowMajor we could essentially call:
  dst.transpose() = rhs.transpose() * lhs.transpose();

I don't know where in the product logic this preferably done.
Comment 3 Christoph Hertzberg 2013-11-01 13:44:44 UTC
In many small-size dynamic cases Eigen is significantly slower than straight-forward textbook implementations. Maybe an option to switch to the trivial version would be nice. Alternatively, this could also be detected by the routine at runtime, basically something like this:
  if(lhs.rows() < EIGEN_SMALL_PRODUCT_THRESHOLD) 
    small_product(lhs, rhs, res);  // simple 
  else 
    big_product(lhs, rhs, res);

Because this concerns many users or might even prevent them from using Eigen, I mark this blocking 3.3.
Comment 4 Christoph Hertzberg 2014-03-06 16:12:12 UTC
Created attachment 422 [details]
Test case for small products

I attached a simple test program showing that for most small matrices a trivial implementation is much faster than Eigen's implementation (which is optimized for big matrices). 'ratio'>1 means the trivial implementation is faster, after 'ratio' the MFLOPS for the trivial and Eigen's implementation are displayed.

Theoretically, some 'trivial' SIMD-optimizations could be made, e.g., if the inner dimension of the lhs (and result) is a multiple of the packet-size.
I guess it's hard though, to decide platform-independent thresholds when to switch to a trivial implementation.
Comment 5 Daniel Vollmer 2014-04-27 22:38:19 UTC
Note that the gemm also has a higher overhead if EIGEN_ALLOCA is not used / available on the platform. In that case small dynamic matrix * matrix products essentially turn into a malloc benchmark.

There is already a small fix for this in https://bitbucket.org/eigen/eigen/commits/2dd06ce4a7370df01535472caf8aa3621bb4721f

I'm not sure whether it's worthwhile to do something similar for gemv as well.
Comment 6 Gael Guennebaud 2014-04-28 16:46:17 UTC
Created attachment 457 [details]
Helper program to find a good threshold to fallback to lazyProduct

For the record, I attach the program I used to find a good metric and threshold to switch between GEMM and lazyProduct. This program compares three metrics:
- min (rows,cols,depth)
- max (rows,cols,depth)
- rows+cols+depth
- rows*cols*depth
Then we test all configuration in [1:32]^3 and for each metric we add 1 for the respective pair of value/algorithm when one is clearly better than the other (I used a factor of 0.85).

Maybe, instead of adding 1, we could add weights proportional to the speedup.

This program also record the number of false positive/negative for a given pair of metric/threshold to ease testing of more complicated heuristics.

From my experiment, the sum of the sizes worked the best while being quite simple.


Note that we are still working on making GEMM naturally faster on small matrices.
Comment 7 Christoph Hertzberg 2014-04-28 17:35:55 UTC
(In reply to comment #6)
> Created attachment 457 [details]
> Helper program to find a good threshold to fallback to lazyProduct

We should also compare to a straight-forward, non-vectorized 3-loop implementation (trivial_product in attachment 422 [details]). I experienced it to be faster than lazyProduct by a factor of up-to 1.5 especially when lazyProduct was faster than the default product.
It might be also worth checking if certain sizes are multiples of the Packetsize (most promising would be the inner-size of the result, I guess) -- of course we must ensure not spending more time on size testing than on the actual product ...

And I think we should consider partial unrolling, if some (but not all) dimensions are fixed. Of course, this comes at the cost of more generated code.
Comment 8 Daniel Vollmer 2014-04-28 17:42:43 UTC
What might also be worthwhile is to dispatch to gemv from gemm if one of the dimensions turns out to be 1, but maybe the "low-overhead" implementation (whatever it looks like) will already be "good enough".
Comment 9 Gael Guennebaud 2014-04-28 17:59:56 UTC
I observed that considering vectors (i.e., one dimension is 1) introduced a lot of noise in the analysis. Vectors are usually known at compile times, and if not we should call gemv anyways.

So after discarding them, the product of the dimensions seems to be the most relevant choice to switch between gemm and lazyProduct. With rows*cols*depth<196 I obtain only 3 significant failures (x2) compared to 40 with the sum of the dimensions.

Now, if we add trivial_product into the competition, then none of the previous metrics is good (we have complete overlaps), and checking for multiples of packet-size is probably needed. However, I'd rather first try to make both lazyProduct and gemm naturally as good as possible!
Comment 10 Christoph Hertzberg 2014-06-20 16:41:08 UTC
*** Bug 469 has been marked as a duplicate of this bug. ***
Comment 11 Daniel Vollmer 2015-11-20 16:37:24 UTC
I just retested my test-case with Eigen 3.2.7, and I no longer require lazyProduct to obtain decent performance for fixed-size matrices, but when doing the same with Dynamic matrices of the same size, lazyProduct is still 50% faster.


***** Using lazyProduct *****
[----------] 1 test from BlockMatrixMultiplyPerformance/0, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 5, 5, 1, 5, 5>, Eigen::Matrix<double, 5, 1, 0, 5, 1>, Eigen::Matrix<double, 5, 1, 0, 5, 1> >
[ RUN      ] BlockMatrixMultiplyPerformance/0.MatrixVector
[       OK ] BlockMatrixMultiplyPerformance/0.MatrixVector (5647 ms)
[----------] 1 test from BlockMatrixMultiplyPerformance/0 (5647 ms total)

[----------] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, -1, -1, 1, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >
[ RUN      ] BlockMatrixMultiplyPerformance/1.MatrixVector
[       OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (28054 ms)
[----------] 1 test from BlockMatrixMultiplyPerformance/1 (28054 ms total)


***** Using operator* *****
[----------] 1 test from BlockMatrixMultiplyPerformance/0, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 5, 5, 1, 5, 5>, Eigen::Matrix<double, 5, 1, 0, 5, 1>, Eigen::Matrix<double, 5, 1, 0, 5, 1> >
[ RUN      ] BlockMatrixMultiplyPerformance/0.MatrixVector
[       OK ] BlockMatrixMultiplyPerformance/0.MatrixVector (5665 ms)
[----------] 1 test from BlockMatrixMultiplyPerformance/0 (5665 ms total)

[----------] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, -1, -1, 1, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >
[ RUN      ] BlockMatrixMultiplyPerformance/1.MatrixVector
[       OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (45689 ms)
[----------] 1 test from BlockMatrixMultiplyPerformance/1 (45689 ms total)
Comment 12 Gael Guennebaud 2015-11-22 21:59:21 UTC
The related changes have not been backported to the 3.2 branch. So you have to try with the default branch to see what's the current situation.
Comment 13 Daniel Vollmer 2015-11-27 14:54:03 UTC
Ok, I've tried it with the tip from just now (b922ac1b9e69) and this gives the same for fixed-size matrices and only a slight slow-down for using operator* vs. lazyProduct:

***** Using lazyProduct *****
[----------] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, -1, -1, 1, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >
[ RUN      ] BlockMatrixMultiplyPerformance/1.MatrixVector
[       OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (17523 ms)
[----------] 1 test from BlockMatrixMultiplyPerformance/1 (17523 ms total)

***** Using operator* *****
[----------] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, -1, -1, 1, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >
[ RUN      ] BlockMatrixMultiplyPerformance/1.MatrixVector
[       OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (18289 ms)
[----------] 1 test from BlockMatrixMultiplyPerformance/1 (18289 ms total)

So the small dynamic-size matrix-vector product is quite a bit faster for both operator* and lazyProduct (on OS X, using g++ 5.2.0 -Ofast -DNDEBUG) comparing 3.2.7 vs. default-branch, going from 28secs to 18.3secs, but operator* is still slightly slower than lazyProduct for this benchmark (17.5secs vs 18.3secs).

There is no longer a huge gulf between the two, which is great.

I'm not sure where the general speed-up comes from, though. The benchmark is essential repeatedly computing a lot of fixed- or dynamic-sized 5x5 matrix with a 5x1 vector and summing up the results. Not that I'm complaining... :)
Comment 14 Gael Guennebaud 2015-11-27 15:34:56 UTC
The speedup comes from a runtime branching to lazyProduct if the problem is "small":

https://bitbucket.org/eigen/eigen/src/797b2a823bf73ea94267639afa6ad739494c3e47/Eigen/src/Core/products/GeneralMatrixMatrix.h?at=default&fileviewer=file-view-default#GeneralMatrixMatrix.h-431

btw, for dynamic vectors, you should really use a Matrix<double,Dynamic,1> (aka VectorXd), and not a general MatrixXd!
Comment 15 Daniel Vollmer 2015-11-27 15:52:42 UTC
The runtime branching to lazyProduct should be the speedup from 45secs to 28secs, so with that it no longer matters that much whether I use lazyProduct or operator* for small matrices, but there is an additional speedup for the dynamic case (for both lazyProduct and operator* cases) when going from Eigen 3.2.7 to the development branch which is the one from 28secs to ~18secs).

i.e. for my small, dynamic-size matrix-vector multiply
lazyProduct in 3.2.7 takes 28secs
operator* in   3.2.7 takes 45secs
lazyProduct    -dev  takes 17.5secs
operator* in   -dev  takes 18.3secs
At least this is what my numbers seem to indicate. So I was wondering why lazyProduct itself got faster...

And yes, I will make sure to fix as many dimensions as we can once we actually use this... :)
Comment 16 Gael Guennebaud 2015-11-28 20:04:58 UTC
I cannot reproduce such a difference, and the underlying code is rather the same, so the only way to see what's going on is to look at the generated assembly (-S with gcc/clang). 

BTW, which compiler and which flags are you using?
Comment 17 Daniel Vollmer 2015-11-30 12:42:10 UTC
Created attachment 629 [details]
Assembly for Eigen 3.2.7 vs. 3.3 in Dynamic lazyProduct benchmark

After remeasuring, I still see a consistent improvement of about 5-6% for Dynamic-sized matrices for Eigen 3.3 versus 3.2.7 in a benchmark that largely relies on summing lazyProducts of (small) matrices with (small) vectors.

The difference is not as stark as I indicated in http://eigen.tuxfamily.org/bz/show_bug.cgi?id=404#c15 (where the improvement seemed to be ~35%). I must've measured bogus there. The hopefully correct timings are now (only for Matrix<double, Dynamic, Dynamic>):
BlockMatrixMultiplyPerformance/0.MatrixVector/operator*_Eigen327   (41877 ms)
BlockMatrixMultiplyPerformance/0.MatrixVector/lazyProduct_Eigen327 (18742 ms)
BlockMatrixMultiplyPerformance/0.MatrixVector/operator*_Eigen33    (18634 ms)
BlockMatrixMultiplyPerformance/0.MatrixVector/lazyProduct_Eigen33  (17763 ms)


Anyway, I've attached the assembly output of my test-case for both versions, but I wouldn't investigate too much now that the difference is less pronounced.

Compiler g++-5.2 (OS X 10.11.1, homebrew)
Invocation: g++-5 -std=c++11 -Wall -DNDEBUG -Ofast -fdiagnostics-color=always -Wno-deprecated -Wno-unused -fopenmp 


The actual method being benchmarked looks like

template<typename BlockType, class MatrixIndexMappingType>
class BlockMatrix
{
  // [snip lots of stuff]

  template<typename OtherBlockT, bool OtherIsRowMajor, typename ResultBlockT, bool ResultIsRowMajor,
    typename ResultZeroT = ResultBlockT,
    class Block = BlockT, class = typename std::enable_if<IsMatrixType<Block>::value>::type>
  void Multiply(const BlockMatrix<OtherBlockT, DenseMatrixIndexMapping<OtherIsRowMajor, IndexT> > &other,
    BlockMatrix<ResultBlockT, DenseMatrixIndexMapping<ResultIsRowMajor, IndexT> > &result,
    const ResultZeroT &resultZero = ResultBlockT::Zero()) const
  {
    if (GetNumCols() != other.GetNumRows())
      throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of columns in this != # of rows in other!"));
    if (GetNumRows() != result.GetNumRows())
      throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of rows in this != # of rows in result!"));
    if (other.GetNumCols() != result.GetNumCols())
      throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of columns in other != # of columns in result!"));
    if (other.GetNumCols() != 1 || result.GetNumCols() != 1)
      throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of columns in other or result != 1!"));

    if (IsRowMajorT::value)
    { // outer = row: each result-entry only touched once, no before-hand zero-fill, local sum
      ResultBlockT sum;
      for (IndexT outer = 0; outer < GetNumOuter(); ++outer)
      {
        sum = resultZero;
        const auto end = InnerIteratorEnd(outer);
        for (auto innerIt = InnerIteratorBegin(outer); innerIt != end; ++innerIt)
        {
          // Eigen's GEMM and GEMV routines (GEneralMatrixMatrix / GEneralMatrixVector product) are optimized for large
          // matrices (and gemm even dynamically allocates memory). These get used automatically by the normal
          // Matrix::operator*(...) when the dimensions of the rhs-matrix are Eigen::Dynamic.
          // For our rather small matrices, these routines occur a huge overhead, also see
          // http://eigen.tuxfamily.org/bz/show_bug.cgi?id=404
          // The use of lazyProduct forces a "simple" implementation that has less overhead (no temporaries, no set-up
          // for alignment / packetization / SIMD) but obviously is slower for large matrices.
          sum.noalias() += innerIt->lazyProduct(other(innerIt.col(), 0));
        }
        result(outer, 0) = sum;
      }
    }
    else
    { // outer = col
      result.Fill(resultZero);
      for (IndexT outer = 0; outer < GetNumOuter(); ++outer)
      {
        const auto end = InnerIteratorEnd(outer);
        for (auto innerIt = InnerIteratorBegin(outer); innerIt != end; ++innerIt)
          result(innerIt.row(), 0).noalias() += innerIt->lazyProduct(other(innerIt.col(), 0));
      }
    }
  }
// ...
};
Comment 18 Gael Guennebaud 2016-01-28 21:34:19 UTC
(In reply to Christoph Hertzberg from comment #7)
> We should also compare to a straight-forward, non-vectorized 3-loop
> implementation (trivial_product in attachment 422 [details]). I experienced
> it to be faster than lazyProduct by a factor of up-to 1.5 especially when
> lazyProduct was faster than the default product.


Actually, if we compile this test program with -DEIGEN_DONT_VECTORIZE, then Eigen's lazyProduct is consistently faster (almost). Since we cannot predict the size of a MatrixX?, the only way to address this issue would be to offer an easy way to control vectorization on a per expression level (or block of expression). Same for multi-threading, and unrolling and any fine tuning parameters.

I'll create another entry for that.

Note You need to log in before you can comment on or make changes to this bug.