New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1568 - Add a fast stack-like allocator for tempories
Summary: Add a fast stack-like allocator for tempories
Status: NEW
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - general (show other bugs)
Version: 3.4 (development)
Hardware: All All
: Normal Optimization
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 1364
  Show dependency treegraph
 
Reported: 2018-07-09 14:35 UTC by Gael Guennebaud
Modified: 2018-07-09 14:37 UTC (History)
3 users (show)



Attachments

Description Gael Guennebaud 2018-07-09 14:35:09 UTC
This is a feature I've in mind for a while and that I've mentioned at few occasions in the past. Basically, the idea would be to implement our own stack for allocating temporaries with dynamic sizes, and thus save a huge amount of running time for small matrices. 

For temporaries that are local to a function, then alloca + Map<> should be enough, as I just implemented there:

https://bitbucket.org/eigen/eigen/commits/5007ff66c9f6/

In theory, it only remains to replace

  nested_eval<T,N>::type V(X)

by

  ei_declare_local_nested_eval(T,X,N,V)

:)

But some temporaries require a longer lifetime, typically the one created in the ctor of the evaluators. One possibility would be to keep using alloca by delegating the allocation in the function that creates/destroys the evaluator (e.g. in the assignment) and then propagate the allocated buffer down to the tree:

  evaluator<Src> src_eval(src);
  void* buffer = alloca(src_eval.nbTemporyBytes());
  src_eval.init(buffer);
  for(i=...)
    dst_eval.coeffRef(i) = src_eval.coeff(i);


nbTemporyBytes() would sum up the requires bytes, and init(void* &buffer) would pass the remaining buffer to the leaves, e.g.:

binary_evaluator::init(void* &buffer) {
  lhs_eval.init(buffer);
  rhs_eval.init(buffer);
}

and an allocator that requested a temporary buffer would do:

product_evaluator::init(void* &buffer) {
  ::new(&m_tmp) (buffer, rows, cols);
  generic_product_impl(m_tmp, lhs, rhs);
  buffer += sizeof(Scalar)*rows*cols;
}

Of course if the total amount of requested memory is too large, then we would allocate on the heap, but only once, which is still a win if there are multiple temporaries.



Another option would be to implement our own stack-based allocator that would have the advantages to be cross-platform and that would possibly be much larger than the stack. It could even be dynamically expanded to amortize heap allocations across expressions. Another advantage is that the above three pass approach (ctor/nbTemporyBytes/init) could be avoided by allocating directly in the ctors.

The main difficulty of this approach is multi-threading for which I have no clue how to deal with.

Note You need to log in before you can comment on or make changes to this bug.