Summary: | Use vectorization to pack matrix operands | ||||||
---|---|---|---|---|---|---|---|
Product: | Eigen | Reporter: | Benoit Steiner <benoit.steiner.goog> | ||||
Component: | Core - vectorization | Assignee: | Nobody <eigen.nobody> | ||||
Status: | NEW --- | ||||||
Severity: | Unknown | CC: | chtz, gael.guennebaud, jacob.benoit.1 | ||||
Priority: | Normal | ||||||
Version: | 3.2 | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Attachments: |
|
I cannot observe any speed up. Could you precise what you used for benchmarking: CPU/compiler/flags/scalar/sizes. On my side I tried with gcc 4.7, on a Core2 P8700, with floats and various squared matrices going from 24x24 to 2048x2048. thanks. If I see it correctly, blockB is always aligned (and therefore blockB[count] if nr==PacketSize); so maybe using an aligned store makes this worth it? Furthermore, if nr == 2*PacketSize this could be done using two packet load/stores. Only for nr < PacketSize vectorization would depend on offset, stride and depth. If further assumptions about rhsStride can be made, loading can also be aligned. (In reply to comment #1) I am using a real life application in which most of the cpu time is spent in the loop: for (int i = 0; i < N, ++i) { C += Ai * Bi.transpose(); } Typical values of N range from 3 to 11. C, Ai and Bi are dynamic row major matrices of floating point values. The matrix sizes we use are varied. Here are some examples: 1/ C is 32x48, Ai are 32x33, Bi are 48x33 2/ C is 32x128, Ai are 32x240, Bi are 128x240 3/ C is 32x384, Ai are 32x384, Bi are 384x384 4/ C is 32x192, Ai are 32x1152, Bi are 192x1152 I am seeing speedups of 5% on Intel Xeon X5550 @ 2.67GHz, 1% on a Intel Xeon CPU E5-1650 0 @ 3.20GHz. I am using gcc 4.8 to compile, with the -O2, -msse2 -mfpmath=both -fno-common -ffast-math options and the EIGEN_NO_DEBUG define. (In reply to comment #2) I also thought that I could use aligned stores, but this leads to crashes in some of the eigen tests (in particular when dealing with triangular matrices). Using aligned store made the code almost 1% faster on Nehalem, but had no noticeable impact on Sandy Bridge. I am looking into using aligned loads for gemm_pack_rhs and gemm_pack_lhs, and aligned stores in the gebp_kernel whenever possible. In our case this is almost always the case and this makes the code a few percent faster. If I get it to work properly I'll submit this in a separate patch. Patch applied: https://bitbucket.org/eigen/eigen/commits/12f73491e3f0/ Changeset: 12f73491e3f0 User: benoitsteiner Date: 2013-12-17 19:49:43 Summary: Use vectorization when packing row-major rhs matrices. (bug 717) Other pack routines could be vectorized too. So let's keep this entry open. (I changed the title) -- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/717. |
Created attachment 408 [details] patch against eigen 3.2 gemm_pack_rhs<Scalar, Index, nr, RowMajor, Conjugate, PanelMode>::operator() could leverage vectorized instructions when nr == PacketSize. The attached patch does this, which leads to a small speedup when multiplying a large col major matrix with a large row major matrix.