EigenInternals

From Eigen
Revision as of 20:25, 13 June 2008 by Ggael (Talk | contribs)

Jump to: navigation, search

Cache friendly product

If you feel the code of the fast matrix-matrix product in CacheFriendlyProduct.h looks scary, don't worry, here are all the secrets.

First of all, let's assume we are doing a product A * B = C of size: MxK * KxN = MxN.

  • The first concept is to perform the product by large blocks fitting into L2 cache (currently we deal with blocks of 256x256). The most outer-loop loops over the rows of A per block of 256 rows. At each iteration the block of A of size 256xK is copied to a temporary buffer with some reorganization/interleaving such that the product kernel is 100% cache friendly (see latter). This reorganization also allows to guaranty that the data will be correctly aligned for a fast vectorized implementation.
  • Next, for each macro block C' of 256x256 of the result C, we loop over each macro block A' (resp. B') of A (resp. B) that participates to C', and we apply the following product kernel:
    • Now the problem is to perform a matrix product: C' += A' * B' of 256x256x256. This product is performed per micro block C" of C' of size Wx1. This allows to keep the Wx1 coefficients in W vector registers (in an expanded way). More precisely, and assuming a hardware vector size of 4, the most inner-loop performs at each iteration:
      • load a column-vector of B' (4x1), multiply it to the W row-vectors A' and accumulate each product to W vector registers.
      • At the end of the inner loop, the elements of each register vector have to be summed together and accumulated to the micro block C".
    • This last accumulation step requires "horizontal" computations that can be performed one vector at a time (then C is accessed is a scalar way that is good if it is not aligned or to handle boundaries), or we can perform 4 accumulations/reductions at a time that is slightly faster (typically only 3 instructions with SSE3)

The current implementation of this algorithm in Eigen looks like this:

  • One of the goal is to have only one instantiation of that code per scalar type, therefore:
    • C is assumed to be column major (otherwise the product is transposed).
    • The code to copy A to a blocky version is duplicated to handle the different storage orders (transposition) of A.
    • B is assumed to be column major, otherwise it is evaluated to a ColumnMajor matrix.
  • When computing the dot products of W sub-rows of A times one sub-column of B, if the sub column is not aligned, then it is copied to a small aligned buffer. This is much faster than using unaligned loads and allows to have a single instance of the inner loop.
  • The alignment of C is managed dynamically without overhead as explain in the algorithm.

Cost Model

Eigen internally implements a cost model to control at compile time both the evaluation of nested expressions and explicit unrolling. The read cost (RC) associated to an expression represents an approximation of the effective number of instructions needed to read or evaluate a single coefficient of the expression. Given the following notations:

  • SC = NumTraits<Scalar>::ReadCost
  • MC = NumTraits<Scalar>::MulCost
  • AC = NumTraits<Scalar>::AddCost

the cost of the expression 2*m1+m2 will be RC = (MC + SC) + AC + SC. For native types, we have SC==MC==AC==1, and then, in our example, we have RC = 4.

Such a mechanism is particularly useful to accurately control the generated code size during explicit unrolling. What we have simply have to do is to divide a user defined maximal number of instructions per unrolling (EIGEN_UNROLLING_LIMIT) by RC to get the number of iterations to unroll.

This cost model is also used to decide whether a nested expression has to be evaluated or not. Indeed, for instance in (m1+m2)*m3 it might be preferable to first evaluate tmp = m1+m2 before doing the matrix product tmp * m3. This is because each coefficient of the left hand side expression (m1+m2) will be evaluated m3.cols() times. This automatic evaluation mechanism is implemented by the ei_nested<> structure. Assuming the coefficients of a nested expression of cost NC (NestedCost) are read R times each, then not evaluating it will cost:

  • R * NC.

On the other hand, evaluating the expression will cost NC + SC to evaluate and store it, plus R loads:

  • NC + (R+1)*SC

Finally, it is cheaper to evaluate the nested expression as soon as the following condition vanishes:

  • (R+1)*SC < (R-1) * NC

Of course, when we have an equality it might still be preferable to evaluate the expression, therefore the following condition is reasonable too:

  • (R+1)*SC <= (R-1) * NC

Further experiments will tell us which version is better in practice.

Vectorization

TODO: discuss when and how an expression gets vectorized.