It turns out that Intel Composer 14 does NOT respect __forcedinline directive present at out-of-class member function definitions, but only those at member declarations. For example, DenseBase::lazyAssign, defined separately in assign.h is NOT inlined for any but the simplest expression templates (which makes a horrible mess of performance for expressions with large number of scalars, as they're all pushed into stack). Visual Studio, on the other hand, requires __forcedinline at definitions (but has no problem with them present at declarations also), so, please, consider duplicating EIGEN_STRONG_INLINE into member declarations.
Sorry for not looking at this issue earlier. ICC is indeed that stupid, and pretty bad at inlining in general. I have examples where it fails to inline the trivial copy-constructor that it generated itself. For instance for CwiseUnaryOp, it introduces calls to functions with a body as trivial as:
movq (%rsi), %rax
movq %rax, (%rdi)
movq 8(%rsi), %rdx
movq %rdx, 8(%rdi)
I'll try to fix as many of them as possible, but I guess that we should also recommend users to compile with -inline-forceinline (or use gcc or clang ;).
Regarding the discrepancies between declarations and definitions, since there are more than 2000 occurences of EIGEN_STRONG_INLINE we would need an automatic way to detect them... any ideas?
Here is a first bunch of fixes limiting the damages:
I haven't included the explicit copy-ctor because I'd prefer to find another workaround, hopefully...