Summary:  Optimize products for small objects  

Product:  Eigen  Reporter:  Gael Guennebaud <gael.guennebaud>  
Component:  Core  matrix products  Assignee:  Nobody <eigen.nobody>  
Status:  RESOLVED FIXED  
Severity:  Optimization  CC:  chtz, daniel.vollmer, laurent.deniau  
Priority:  Normal  
Version:  3.1  
Hardware:  All  
OS:  All  
Whiteboard:  
Bug Depends on:  359, 469  
Bug Blocks:  558  
Attachments: 

Description
Gael Guennebaud
20120112 08:45:52 UTC
Another thing which can be drastically improved is products involving Matrix<Scalar, FIXED, Dynamic, ColMajor>. If I see it correctly they all fall back to the generic product at the moment  which causes a lot of overhead. Usually, a naive loop accessing one (unrolled) column at a time should be more efficient. The same is true for Matrix<Scalar, Dynamic, FIXED, RowMajor>, obviously. A suggestion to reduce code duplication: If the destination is RowMajor we could essentially call: dst.transpose() = rhs.transpose() * lhs.transpose(); I don't know where in the product logic this preferably done. In many smallsize dynamic cases Eigen is significantly slower than straightforward textbook implementations. Maybe an option to switch to the trivial version would be nice. Alternatively, this could also be detected by the routine at runtime, basically something like this: if(lhs.rows() < EIGEN_SMALL_PRODUCT_THRESHOLD) small_product(lhs, rhs, res); // simple else big_product(lhs, rhs, res); Because this concerns many users or might even prevent them from using Eigen, I mark this blocking 3.3. Created attachment 422 [details]
Test case for small products
I attached a simple test program showing that for most small matrices a trivial implementation is much faster than Eigen's implementation (which is optimized for big matrices). 'ratio'>1 means the trivial implementation is faster, after 'ratio' the MFLOPS for the trivial and Eigen's implementation are displayed.
Theoretically, some 'trivial' SIMDoptimizations could be made, e.g., if the inner dimension of the lhs (and result) is a multiple of the packetsize.
I guess it's hard though, to decide platformindependent thresholds when to switch to a trivial implementation.
Note that the gemm also has a higher overhead if EIGEN_ALLOCA is not used / available on the platform. In that case small dynamic matrix * matrix products essentially turn into a malloc benchmark. There is already a small fix for this in https://bitbucket.org/eigen/eigen/commits/2dd06ce4a7370df01535472caf8aa3621bb4721f I'm not sure whether it's worthwhile to do something similar for gemv as well. Created attachment 457 [details]
Helper program to find a good threshold to fallback to lazyProduct
For the record, I attach the program I used to find a good metric and threshold to switch between GEMM and lazyProduct. This program compares three metrics:
 min (rows,cols,depth)
 max (rows,cols,depth)
 rows+cols+depth
 rows*cols*depth
Then we test all configuration in [1:32]^3 and for each metric we add 1 for the respective pair of value/algorithm when one is clearly better than the other (I used a factor of 0.85).
Maybe, instead of adding 1, we could add weights proportional to the speedup.
This program also record the number of false positive/negative for a given pair of metric/threshold to ease testing of more complicated heuristics.
From my experiment, the sum of the sizes worked the best while being quite simple.
Note that we are still working on making GEMM naturally faster on small matrices.
(In reply to comment #6) > Created attachment 457 [details] > Helper program to find a good threshold to fallback to lazyProduct We should also compare to a straightforward, nonvectorized 3loop implementation (trivial_product in attachment 422 [details]). I experienced it to be faster than lazyProduct by a factor of upto 1.5 especially when lazyProduct was faster than the default product. It might be also worth checking if certain sizes are multiples of the Packetsize (most promising would be the innersize of the result, I guess)  of course we must ensure not spending more time on size testing than on the actual product ... And I think we should consider partial unrolling, if some (but not all) dimensions are fixed. Of course, this comes at the cost of more generated code. What might also be worthwhile is to dispatch to gemv from gemm if one of the dimensions turns out to be 1, but maybe the "lowoverhead" implementation (whatever it looks like) will already be "good enough". I observed that considering vectors (i.e., one dimension is 1) introduced a lot of noise in the analysis. Vectors are usually known at compile times, and if not we should call gemv anyways. So after discarding them, the product of the dimensions seems to be the most relevant choice to switch between gemm and lazyProduct. With rows*cols*depth<196 I obtain only 3 significant failures (x2) compared to 40 with the sum of the dimensions. Now, if we add trivial_product into the competition, then none of the previous metrics is good (we have complete overlaps), and checking for multiples of packetsize is probably needed. However, I'd rather first try to make both lazyProduct and gemm naturally as good as possible! *** Bug 469 has been marked as a duplicate of this bug. *** I just retested my testcase with Eigen 3.2.7, and I no longer require lazyProduct to obtain decent performance for fixedsize matrices, but when doing the same with Dynamic matrices of the same size, lazyProduct is still 50% faster. ***** Using lazyProduct ***** [] 1 test from BlockMatrixMultiplyPerformance/0, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 5, 5, 1, 5, 5>, Eigen::Matrix<double, 5, 1, 0, 5, 1>, Eigen::Matrix<double, 5, 1, 0, 5, 1> > [ RUN ] BlockMatrixMultiplyPerformance/0.MatrixVector [ OK ] BlockMatrixMultiplyPerformance/0.MatrixVector (5647 ms) [] 1 test from BlockMatrixMultiplyPerformance/0 (5647 ms total) [] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 1, 1, 1, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1> > [ RUN ] BlockMatrixMultiplyPerformance/1.MatrixVector [ OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (28054 ms) [] 1 test from BlockMatrixMultiplyPerformance/1 (28054 ms total) ***** Using operator* ***** [] 1 test from BlockMatrixMultiplyPerformance/0, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 5, 5, 1, 5, 5>, Eigen::Matrix<double, 5, 1, 0, 5, 1>, Eigen::Matrix<double, 5, 1, 0, 5, 1> > [ RUN ] BlockMatrixMultiplyPerformance/0.MatrixVector [ OK ] BlockMatrixMultiplyPerformance/0.MatrixVector (5665 ms) [] 1 test from BlockMatrixMultiplyPerformance/0 (5665 ms total) [] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 1, 1, 1, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1> > [ RUN ] BlockMatrixMultiplyPerformance/1.MatrixVector [ OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (45689 ms) [] 1 test from BlockMatrixMultiplyPerformance/1 (45689 ms total) The related changes have not been backported to the 3.2 branch. So you have to try with the default branch to see what's the current situation. Ok, I've tried it with the tip from just now (b922ac1b9e69) and this gives the same for fixedsize matrices and only a slight slowdown for using operator* vs. lazyProduct: ***** Using lazyProduct ***** [] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 1, 1, 1, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1> > [ RUN ] BlockMatrixMultiplyPerformance/1.MatrixVector [ OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (17523 ms) [] 1 test from BlockMatrixMultiplyPerformance/1 (17523 ms total) ***** Using operator* ***** [] 1 test from BlockMatrixMultiplyPerformance/1, where TypeParam = std::tuple<DX::SparseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, DX::DenseMatrixIndexMapping<true, unsigned int>, Eigen::Matrix<double, 1, 1, 1, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1>, Eigen::Matrix<double, 1, 1, 0, 1, 1> > [ RUN ] BlockMatrixMultiplyPerformance/1.MatrixVector [ OK ] BlockMatrixMultiplyPerformance/1.MatrixVector (18289 ms) [] 1 test from BlockMatrixMultiplyPerformance/1 (18289 ms total) So the small dynamicsize matrixvector product is quite a bit faster for both operator* and lazyProduct (on OS X, using g++ 5.2.0 Ofast DNDEBUG) comparing 3.2.7 vs. defaultbranch, going from 28secs to 18.3secs, but operator* is still slightly slower than lazyProduct for this benchmark (17.5secs vs 18.3secs). There is no longer a huge gulf between the two, which is great. I'm not sure where the general speedup comes from, though. The benchmark is essential repeatedly computing a lot of fixed or dynamicsized 5x5 matrix with a 5x1 vector and summing up the results. Not that I'm complaining... :) The speedup comes from a runtime branching to lazyProduct if the problem is "small": https://bitbucket.org/eigen/eigen/src/797b2a823bf73ea94267639afa6ad739494c3e47/Eigen/src/Core/products/GeneralMatrixMatrix.h?at=default&fileviewer=fileviewdefault#GeneralMatrixMatrix.h431 btw, for dynamic vectors, you should really use a Matrix<double,Dynamic,1> (aka VectorXd), and not a general MatrixXd! The runtime branching to lazyProduct should be the speedup from 45secs to 28secs, so with that it no longer matters that much whether I use lazyProduct or operator* for small matrices, but there is an additional speedup for the dynamic case (for both lazyProduct and operator* cases) when going from Eigen 3.2.7 to the development branch which is the one from 28secs to ~18secs). i.e. for my small, dynamicsize matrixvector multiply lazyProduct in 3.2.7 takes 28secs operator* in 3.2.7 takes 45secs lazyProduct dev takes 17.5secs operator* in dev takes 18.3secs At least this is what my numbers seem to indicate. So I was wondering why lazyProduct itself got faster... And yes, I will make sure to fix as many dimensions as we can once we actually use this... :) I cannot reproduce such a difference, and the underlying code is rather the same, so the only way to see what's going on is to look at the generated assembly (S with gcc/clang). BTW, which compiler and which flags are you using? Created attachment 629 [details] Assembly for Eigen 3.2.7 vs. 3.3 in Dynamic lazyProduct benchmark After remeasuring, I still see a consistent improvement of about 56% for Dynamicsized matrices for Eigen 3.3 versus 3.2.7 in a benchmark that largely relies on summing lazyProducts of (small) matrices with (small) vectors. The difference is not as stark as I indicated in http://eigen.tuxfamily.org/bz/show_bug.cgi?id=404#c15 (where the improvement seemed to be ~35%). I must've measured bogus there. The hopefully correct timings are now (only for Matrix<double, Dynamic, Dynamic>): BlockMatrixMultiplyPerformance/0.MatrixVector/operator*_Eigen327 (41877 ms) BlockMatrixMultiplyPerformance/0.MatrixVector/lazyProduct_Eigen327 (18742 ms) BlockMatrixMultiplyPerformance/0.MatrixVector/operator*_Eigen33 (18634 ms) BlockMatrixMultiplyPerformance/0.MatrixVector/lazyProduct_Eigen33 (17763 ms) Anyway, I've attached the assembly output of my testcase for both versions, but I wouldn't investigate too much now that the difference is less pronounced. Compiler g++5.2 (OS X 10.11.1, homebrew) Invocation: g++5 std=c++11 Wall DNDEBUG Ofast fdiagnosticscolor=always Wnodeprecated Wnounused fopenmp The actual method being benchmarked looks like template<typename BlockType, class MatrixIndexMappingType> class BlockMatrix { // [snip lots of stuff] template<typename OtherBlockT, bool OtherIsRowMajor, typename ResultBlockT, bool ResultIsRowMajor, typename ResultZeroT = ResultBlockT, class Block = BlockT, class = typename std::enable_if<IsMatrixType<Block>::value>::type> void Multiply(const BlockMatrix<OtherBlockT, DenseMatrixIndexMapping<OtherIsRowMajor, IndexT> > &other, BlockMatrix<ResultBlockT, DenseMatrixIndexMapping<ResultIsRowMajor, IndexT> > &result, const ResultZeroT &resultZero = ResultBlockT::Zero()) const { if (GetNumCols() != other.GetNumRows()) throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of columns in this != # of rows in other!")); if (GetNumRows() != result.GetNumRows()) throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of rows in this != # of rows in result!")); if (other.GetNumCols() != result.GetNumCols()) throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of columns in other != # of columns in result!")); if (other.GetNumCols() != 1  result.GetNumCols() != 1) throw std::invalid_argument(DX_MSG_ADD_CALL_SITE("# of columns in other or result != 1!")); if (IsRowMajorT::value) { // outer = row: each resultentry only touched once, no beforehand zerofill, local sum ResultBlockT sum; for (IndexT outer = 0; outer < GetNumOuter(); ++outer) { sum = resultZero; const auto end = InnerIteratorEnd(outer); for (auto innerIt = InnerIteratorBegin(outer); innerIt != end; ++innerIt) { // Eigen's GEMM and GEMV routines (GEneralMatrixMatrix / GEneralMatrixVector product) are optimized for large // matrices (and gemm even dynamically allocates memory). These get used automatically by the normal // Matrix::operator*(...) when the dimensions of the rhsmatrix are Eigen::Dynamic. // For our rather small matrices, these routines occur a huge overhead, also see // http://eigen.tuxfamily.org/bz/show_bug.cgi?id=404 // The use of lazyProduct forces a "simple" implementation that has less overhead (no temporaries, no setup // for alignment / packetization / SIMD) but obviously is slower for large matrices. sum.noalias() += innerIt>lazyProduct(other(innerIt.col(), 0)); } result(outer, 0) = sum; } } else { // outer = col result.Fill(resultZero); for (IndexT outer = 0; outer < GetNumOuter(); ++outer) { const auto end = InnerIteratorEnd(outer); for (auto innerIt = InnerIteratorBegin(outer); innerIt != end; ++innerIt) result(innerIt.row(), 0).noalias() += innerIt>lazyProduct(other(innerIt.col(), 0)); } } } // ... }; (In reply to Christoph Hertzberg from comment #7) > We should also compare to a straightforward, nonvectorized 3loop > implementation (trivial_product in attachment 422 [details]). I experienced > it to be faster than lazyProduct by a factor of upto 1.5 especially when > lazyProduct was faster than the default product. Actually, if we compile this test program with DEIGEN_DONT_VECTORIZE, then Eigen's lazyProduct is consistently faster (almost). Since we cannot predict the size of a MatrixX?, the only way to address this issue would be to offer an easy way to control vectorization on a per expression level (or block of expression). Same for multithreading, and unrolling and any fine tuning parameters. I'll create another entry for that.  GitLab Migration Automatic Message  This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/404. 