The following two routines are functionally equivalent, though the latter generates better x86-SSE code. The main (only) difference is the use of the 'haddps' instruction. void func1(Matrix4f& m, Vector4f& v, Vector4f& r) { r = m.transpose() * v; } void func2(Matrix4f& m, Vector4f& v, Vector4f& r) { __m128 v0 = _mm_load_ps(&v(0)); __m128 m0 = _mm_load_ps(&m(0, 0)); __m128 m1 = _mm_load_ps(&m(0, 1)); __m128 m2 = _mm_load_ps(&m(0, 2)); __m128 m3 = _mm_load_ps(&m(0, 3)); __m128 t0 = _mm_mul_ps(v0, m0); __m128 t1 = _mm_mul_ps(v0, m1); __m128 t2 = _mm_mul_ps(v0, m2); __m128 t3 = _mm_mul_ps(v0, m3); __m128 t01 = _mm_hadd_ps(t0, t1); __m128 t23 = _mm_hadd_ps(t2, t3); __m128 t0123 = _mm_hadd_ps(t01, t23); _mm_store_ps(&r(0), t0123); } func1(Eigen::Matrix<float, 4, 4, 0, 4, 4>&, Eigen::Matrix<float, 4, 1, 0, 4, 1>&, Eigen::Matrix<float, 4, 1, 0, 4, 1>&): movaps (%rsi), %xmm0 movaps (%rdi), %xmm1 mulps %xmm0, %xmm1 haddps %xmm1, %xmm1 haddps %xmm1, %xmm1 movss %xmm1, -24(%rsp) movaps 16(%rdi), %xmm1 mulps %xmm0, %xmm1 haddps %xmm1, %xmm1 haddps %xmm1, %xmm1 movss %xmm1, -20(%rsp) movaps 32(%rdi), %xmm1 mulps %xmm0, %xmm1 mulps 48(%rdi), %xmm0 haddps %xmm1, %xmm1 haddps %xmm0, %xmm0 haddps %xmm1, %xmm1 haddps %xmm0, %xmm0 movss %xmm1, -16(%rsp) movss %xmm0, -12(%rsp) movaps -24(%rsp), %xmm0 movaps %xmm0, (%rdx) ret func2(Eigen::Matrix<float, 4, 4, 0, 4, 4>&, Eigen::Matrix<float, 4, 1, 0, 4, 1>&, Eigen::Matrix<float, 4, 1, 0, 4, 1>&): movaps (%rsi), %xmm2 movaps (%rdi), %xmm0 movaps 16(%rdi), %xmm3 mulps %xmm2, %xmm0 mulps %xmm2, %xmm3 movaps 32(%rdi), %xmm1 haddps %xmm3, %xmm0 mulps %xmm2, %xmm1 mulps 48(%rdi), %xmm2 haddps %xmm2, %xmm1 haddps %xmm1, %xmm0 movaps %xmm0, (%rdx) ret
As noticed in bug 359, the same issue occurs for haddpd. It also occurs when haddp* is emulated (SSE2). The reason is that for such "transposed" products, the expression is evaluated one coeff at once. Therefore, each dot product has to be fully evaluated independently and the calls to haddp* cannot be factorized. The solution would be to mark such an expression "vectorizable" to reduce multiple rows (or columns) at once. The most tricky part is that for matrix-matrix products we can arbitrarily choose to evaluate the result rowwise of columnwise, so we need a way to say the the expression has no favorite storage order. Actually, the same is also true for matrix * matrix.transpose() which can be vectorized both rowwise and columnwise.
In Bug 404 I suggested to make products always evaluate to ColumnMajor and if the result shall be RowMajor, essentially replace it by: res.transposed() = rhs.transposed() * lhs.transposed(); Theoretically, this could be done for all assignments, but maybe this requires the expression evaluator to be implemented (bug 99) (and things get complicated if different Majorness is mixed in bigger expressions). As for using the haddp* reduction, I think the cleanest way would be to implement it via the meta-package idea (I don't think we have a bz entry for that yet)
I'm not sure meta-packets are needed here. This can be done entirely internally to CoeffBasedProduct, something like: Packet pres[PacketSize]; for(int k=0; k<PacketSize; ++k) product_coeff_vectorized_unroller<UnrollingIndex+1-PacketSize, Lhs, Rhs, Packet >::run(row+k, col, lhs, rhs, pres[k]); Packet res = preduxp(pres); return res; Of course it would be better to factorize this for loop to explicitly load packets of the rhs only once - as we do for dynamic-sized matrix vector products - but that's more tricky to write in a general fashion.
Yes, that's indeed possible reasonably efficient without meta-packets. We need to check if the above construct does not lead to unnecessary store/load operations (but that would be the same for meta-packets, I assume). So the main problem is to decide if the expression needs to be unrolled row-wise or column-wise? I previously assumed this already had been solved for matrix*matrix.transposed() products (but I never checked, since I hardly have row-major result matrices).
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/312.