For sufficiently expensive operations it could be worth to vectorize even if the input or output is not consecutively stored in memory.
For a very trivial case of an unary operator with SSE4.1 and only the input having a dynamic stride, this would cost 1 movss and 3 insertps instructions instead of one movups to load the data -- but could accelerate the computation by almost a factor of 4: https://godbolt.org/z/xKZcVM
Further optimizations are possible if stride is known at compile-time to be exactly 2 (load two times 16 byte and shuffle them together).
Storing could equivalently be done using extractps (or for stride=2, two shuffles and masked-stores, if they are available)
-- GitLab Migration Automatic Message --
This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1737.