Consider this function:
auto constexpr Aligned = ::Eigen::Aligned16;
auto constexpr Unaligned = ::Eigen::Unaligned;
void apply_kernels // a specialized gemv
value_t * __restrict const output,
value_t const * __restrict const kernel_weights,
value_t const * __restrict const biases,
value_t const * __restrict const patch,
float const kernel_dim,
std::uint8_t const num_kernels
using namespace MB::Math;
auto const kernels ( rowmajor_matrix( kernel_weights, num_kernels, kernel_dim ) );
auto const v_patch ( colvector< Aligned>( patch , kernel_dim ) );
auto const v_biases( colvector< Aligned>( biases, num_kernels ) );
auto v_output( colvector<Unaligned>( output, num_kernels ) );
v_output.noalias() = kernels * v_patch;
v_output.noalias() += v_biases;
} // apply_kernels()
AFAIK, this snippet should not allocate any memory, yet we observe memory allocation. Is that a bug or normal behaviour?
We are using commit 246af84e462d from mercurial.
Could you try making a self-contained example? E.g., what is colvector<>? (I assume it is a function which takes a pointer and a size and returns a Map ...) -- same question for rowmajor_matrix.
Also, what is value_t?
Our gemm/gemv implementations do blocking which requires temporary memory. By default this should happen on the stack (using alloca, if available) -- what OS/compiler are you using?
For debugging, you can try defining either EIGEN_NO_MALLOC or EIGEN_RUNTIME_NO_MALLOC, to see where the allocation happens. See here for details:
If memory allocation is the biggest concern, you can try using
kernels * v_patch;
For bigger matrices this is considerably slower, however.
Actually this is the version that does not insert allocation calls.
If we write the final calculation as a single expression:
v_output.noalias() = kernels * v_patch + v_biases;
then it will insert allocation calls.
ps. kernel_dim is actually an integer (its a typo in the snippet above)
pps. rowmajor_matrix() and colvector() are just utility functions that create maps/views of the respective types from the raw pointers and sizes...
value_t = float
Our biggest concern is performance: trying to equal or exceed Apple's Accelerate framework/BLAS implementation, currently Eigen is slightly slower here even though it has a lot more compile-time information available (alignment and aliasing) so it should go through less setup code for each invocation.
Notice these are relatively small data patches (kernel_dim and num_kernels are 8 bit integers).
Mem allocation is obviously a performance killer + it makes the function no longer noexcept (which "isn't nice")...
This with latest Apple Clang, fast-math, LTO, march=native and all that...
ps. FWIW with MSVC the codegen is significantly worse then with Clang as expected (but you do have some probably stale MSVC workarounds in code which are probably no longer necessary with MSVC14u3 and hinder performance)...
If it helps I can paste you the produced codegen...
(In reply to Domagoj Saric from comment #2)
> v_output.noalias() = kernels * v_patch + v_biases;
That might not be optimized correctly yet, indeed. Does the following work?
v_output.noalias() = v_biases + kernels * v_patch;
Otherwise, just use the workaround with two separate lines for now.
Regarding the performance, how much is "slightly"?
What CPU are you running on?
Does compiling with EIGEN_UNALIGNED_VECTORIZE=0 make a difference?
If dimensions are "very" small, have you tried using lazyProduct?
(In reply to Domagoj Saric from comment #5)
> If it helps I can paste you the produced codegen...
Generally, a minimal, complete, verifiable example would be helpful (ideally, a single .cpp file and some info about how you compile) -- don't past generated code.
See also: http://stackoverflow.com/help/mcve
I confirm that in 3.3, we rewrite only expressions matching:
"Dense ?= xpr + Product<>"
where "?=" is either =, +=, or -=.
With many more specializations, we could easily support the following patterns too:
"Dense ?= xpr - Product<>"
"Dense ?= Product<> + xpr"
"Dense ?= Product<> - xpr"
Do you confirm that once you have rewritten your expression to get rid of the temp, then the performance matches your expectations? What are the typical sizes for kernel_dim and num_kernels?
yes the v_output.noalias() = v_biases + kernels * v_patch; version of the expression does produce better code (i.e. it no longer allocates memory but it still does generate some calls to allocation functions that just get skipped by runtime checks). Ideally we should be able to get alloc-free codegen for a simple gemv like operation, no (AFAICT the algorithm involves simple linear scans through memory)?
Typical sizes for kernel_dim and num_kernels are ~200 and 32 respectively.
OTOH, this function is now history, I've moved to using bigger patches so that I can use matrix*matrix multiplication (and benefit from auto parallelization).
So, now I esentially have:
auto const kernels ( rowmajor_matrix<Aligned>( p_kernel_weights, num_kernels, kernel_dim ) );
auto const row_patches( colmajor_matrix<Aligned>( p_row_patches , kernel_dim , patches_per_row ) );
auto output ( rowmajor_matrix<Aligned>( p_output , num_kernels, out_h * out_w ) );
And then, once a row of patches is extracted, I do:
auto output_block( output.block( 0, row * patches_per_row, num_kernels, patches_per_row ) );
output_block.noalias() = kernels * row_patches;
Is this the optimal way to approach this?
Notice that row_patches is 'colmajor', i.e. it is used transposed...AFAICT this should make the operation faster as it can simply scan memory linearly so it shouldn't need to create blocks or am I still missing something? Would it make things better if even the output matrix was made transposed? (FWIW we use the default-rowmajor order).
Finally, if blocking is unavoidable and considering that we have to prepare/transform all this data in appropriate matrix form...maybe Eigen can expose a public API for creating 'pre blocked data' to avoid the mem allocs and additional data/passes transformation?
ps. I'm sorry if these questions no longer belong to this bug report, I can move them to an Eigen support forum if you'd like?
@Christoph Hertzberg: Sorry forgot to answer your question about 'slightly': well in this particular case it was about 3% but that is for the overall test where this operation was the main bottleneck but still didn't count for all of the elapsed time...
-- GitLab Migration Automatic Message --
This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1267.