I'm calculating "scalar = vector1.transpose() * matrix * vector2". This is the inner most time critical portion of my code, the expression is evaluated using a temporary matrix on the heap. Vectors 1 & 2 are compile-time fixed size, and the matrix is a block to a dynamically allocated matrix. There should be no need to allocate a temporary on the heap in this case. The heap allocation attributes to over 50% of my total runtime.
I would very much like if this could be optimized to not cause a temporary on the heap.
The following code illustrates my use case:
Eigen::ArrayXXd data(1000, 1000);
Eigen::Ref<const Eigen::Array<double, Eigen::Dynamic, Eigen::Dynamic>> m_coeff(data.block(30,30,4, 4));
double x0 = 0.5;
double x1 = 0.5;
X0 = x0*(4 - 3 * x0) - 1;
X0 = x0*(9 * x0 - 10);
X0 = x0*(8 - 9 * x0) + 1;
X0 = x0*(3 * x0 - 2);
X1 = x1*((2 - x1)*x1 - 1);
X1 = x1*x1*(3 * x1 - 5) + 2;
X1 = x1*((4 - 3 * x1)*x1 + 1);
X1 = x1*x1*(x1 - 1);
// Will trigger assert as transpose() is evaluated to a temporary on the heap.
double ans = double(X0.transpose() * (m_coeff.matrix() * X1)) / 4.0;
X0.transpose() is not allocated on the heap, but m_coeff.matrix() * X1 is because it results in a Dynamic-sized vector.
This will eventually be fixed with evaluator-tree optimizations, but certainly not in 3.2.
If you know the size of the block (and the Ref) at compile-time, write:
Ref<const Array<double, 4, 4> > m_coeff(data.block<4,4>(30,30));
If you access m_coeff always in a matrix-like manner, you can also use
Ref<const Matrix4d> m_coeff(data.block<4,4>(30,30));
Furthermore, you may want to replace the last line by
X0.dot(m_coeff.matrix() * X1) / 4.0;
Thanks, those tips gave me another 30% on top of the 50% or so I gained from removing the temporary by hand (evaluated into a vector4d temporary).
I love you guys!
I renamed the bug to describe the remaining issue.
Further examples, that might eventually be influenced:
MatrixXd A, B;
C+A*B; // A*B must be 4x4
x+A*y; // A must be 4x4, A*y must be 4x1
Probably not important enough to block 3.3
I'm not sure this is worth the effort: increase in internal logic complexity, increase compilation time for very little gain in practice because this is something that is expected to be handled by user code.
There are many more interesting optimization opportunities, e.g., mat*mat*vec just to name one.