New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 717 - Use vectorization to pack matrix operands
Summary: Use vectorization to pack matrix operands
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: 3.2
Hardware: All All
: Normal Unknown
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-17 19:57 UTC by Benoit Steiner
Modified: 2014-03-08 21:08 UTC (History)
3 users (show)



Attachments
patch against eigen 3.2 (2.57 KB, patch)
2013-12-17 19:57 UTC, Benoit Steiner
no flags Details | Diff

Description Benoit Steiner 2013-12-17 19:57:28 UTC
Created attachment 408 [details]
patch against eigen 3.2

gemm_pack_rhs<Scalar, Index, nr, RowMajor, Conjugate, PanelMode>::operator() could leverage vectorized instructions when nr == PacketSize. The attached patch does this, which leads to a small speedup when multiplying a large col major matrix with a large row major matrix.
Comment 1 Gael Guennebaud 2013-12-17 20:31:01 UTC
I cannot observe any speed up. Could you precise what you used for benchmarking: CPU/compiler/flags/scalar/sizes.

On my side I tried with gcc 4.7, on a Core2 P8700, with floats and various squared matrices going from 24x24 to 2048x2048.

thanks.
Comment 2 Christoph Hertzberg 2013-12-18 01:15:24 UTC
If I see it correctly, blockB is always aligned (and therefore blockB[count] if nr==PacketSize); so maybe using an aligned store makes this worth it?
Furthermore, if nr == 2*PacketSize this could be done using two packet load/stores. Only for nr < PacketSize vectorization would depend on offset, stride and depth.
If further assumptions about rhsStride can be made, loading can also be aligned.
Comment 3 Benoit Steiner 2013-12-19 02:15:02 UTC
(In reply to comment #1)
I am using a real life application in which most of the cpu time is spent in the loop:
for (int i = 0; i < N, ++i) {
  C += Ai * Bi.transpose();
}

Typical values of N range from 3 to 11. C, Ai and Bi are dynamic row major matrices of floating point values. The matrix sizes we use are varied. Here are some examples:
 1/ C is 32x48, Ai are 32x33, Bi are 48x33 
 2/ C is 32x128, Ai are 32x240, Bi are 128x240
 3/ C is 32x384, Ai are 32x384, Bi are 384x384
 4/ C is 32x192, Ai are 32x1152, Bi are 192x1152

I am seeing speedups of 5% on Intel Xeon X5550 @ 2.67GHz, 1% on a Intel Xeon CPU E5-1650 0 @ 3.20GHz. I am using gcc 4.8 to compile, with the -O2, -msse2 -mfpmath=both -fno-common -ffast-math options and 
the EIGEN_NO_DEBUG define.
Comment 4 Benoit Steiner 2013-12-19 02:37:51 UTC
(In reply to comment #2)
I also thought that I could use aligned stores, but this leads to crashes in some of the eigen tests (in particular when dealing with triangular matrices). Using aligned store made the code almost 1% faster on Nehalem, but had no noticeable impact on Sandy Bridge.

I am looking into using aligned loads for gemm_pack_rhs and gemm_pack_lhs, and aligned stores in the gebp_kernel whenever possible. In our case this is almost always the case and this makes the code a few percent faster. If I get it to work properly I'll submit this in a separate patch.
Comment 5 Gael Guennebaud 2014-03-08 21:08:57 UTC
Patch applied:

https://bitbucket.org/eigen/eigen/commits/12f73491e3f0/
Changeset:   12f73491e3f0
User:        benoitsteiner
Date:        2013-12-17 19:49:43
Summary:     Use vectorization when packing row-major rhs matrices. (bug 717)

Other pack routines could be vectorized too. So let's keep this entry open. (I changed the title)

Note You need to log in before you can comment on or make changes to this bug.