This bugzilla service is closed. All entries have been migrated to

Bug 717

Summary: Use vectorization to pack matrix operands
Product: Eigen Reporter: Benoit Steiner <>
Component: Core - vectorizationAssignee: Nobody <eigen.nobody>
Status: NEW ---    
Severity: Unknown CC: chtz, gael.guennebaud, jacob.benoit.1
Priority: Normal    
Version: 3.2   
Hardware: All   
OS: All   
Description Flags
patch against eigen 3.2 none

Description Benoit Steiner 2013-12-17 19:57:28 UTC
Created attachment 408 [details]
patch against eigen 3.2

gemm_pack_rhs<Scalar, Index, nr, RowMajor, Conjugate, PanelMode>::operator() could leverage vectorized instructions when nr == PacketSize. The attached patch does this, which leads to a small speedup when multiplying a large col major matrix with a large row major matrix.
Comment 1 Gael Guennebaud 2013-12-17 20:31:01 UTC
I cannot observe any speed up. Could you precise what you used for benchmarking: CPU/compiler/flags/scalar/sizes.

On my side I tried with gcc 4.7, on a Core2 P8700, with floats and various squared matrices going from 24x24 to 2048x2048.

Comment 2 Christoph Hertzberg 2013-12-18 01:15:24 UTC
If I see it correctly, blockB is always aligned (and therefore blockB[count] if nr==PacketSize); so maybe using an aligned store makes this worth it?
Furthermore, if nr == 2*PacketSize this could be done using two packet load/stores. Only for nr < PacketSize vectorization would depend on offset, stride and depth.
If further assumptions about rhsStride can be made, loading can also be aligned.
Comment 3 Benoit Steiner 2013-12-19 02:15:02 UTC
(In reply to comment #1)
I am using a real life application in which most of the cpu time is spent in the loop:
for (int i = 0; i < N, ++i) {
  C += Ai * Bi.transpose();

Typical values of N range from 3 to 11. C, Ai and Bi are dynamic row major matrices of floating point values. The matrix sizes we use are varied. Here are some examples:
 1/ C is 32x48, Ai are 32x33, Bi are 48x33 
 2/ C is 32x128, Ai are 32x240, Bi are 128x240
 3/ C is 32x384, Ai are 32x384, Bi are 384x384
 4/ C is 32x192, Ai are 32x1152, Bi are 192x1152

I am seeing speedups of 5% on Intel Xeon X5550 @ 2.67GHz, 1% on a Intel Xeon CPU E5-1650 0 @ 3.20GHz. I am using gcc 4.8 to compile, with the -O2, -msse2 -mfpmath=both -fno-common -ffast-math options and 
the EIGEN_NO_DEBUG define.
Comment 4 Benoit Steiner 2013-12-19 02:37:51 UTC
(In reply to comment #2)
I also thought that I could use aligned stores, but this leads to crashes in some of the eigen tests (in particular when dealing with triangular matrices). Using aligned store made the code almost 1% faster on Nehalem, but had no noticeable impact on Sandy Bridge.

I am looking into using aligned loads for gemm_pack_rhs and gemm_pack_lhs, and aligned stores in the gebp_kernel whenever possible. In our case this is almost always the case and this makes the code a few percent faster. If I get it to work properly I'll submit this in a separate patch.
Comment 5 Gael Guennebaud 2014-03-08 21:08:57 UTC
Patch applied:
Changeset:   12f73491e3f0
User:        benoitsteiner
Date:        2013-12-17 19:49:43
Summary:     Use vectorization when packing row-major rhs matrices. (bug 717)

Other pack routines could be vectorized too. So let's keep this entry open. (I changed the title)
Comment 6 Nobody 2019-12-04 12:53:10 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: