I changed the code as follows to reduce the cost shared memory.
//__shared__ volatile Scalar lhs_shmem[72 * 64];
//__shared__ volatile Scalar rhs_shmem[72 * 64];
__shared__ volatile Scalar lhs_shmem[72 * 32];
__shared__ volatile Scalar rhs_shmem[72 * 32];
And then I can compile it without errors. Is it possible to redesign the kernel to reduce the cost of shared memeory? usually nv's GPU has a most 48k shared memory.
I guess Benoit should have a look at this.
Maybe add a compile time define to declare the size of available shared memory?
You want to pack as many values in shared memory as possible in order to maximize the performance. Since the amount of shared memory is fixed, this number depend on the size of the scalar used in the contraction. This means that we need to specialize the kernels for each possible input type. Unfortunately we haven't had time to do this so far.
One good strategy would be to write a fallback kernel that does a decent job on the biggest scalar we're likely to encounter (probably complex<double> and use this unless we have an optimized kernel for the type we care about.
Another strategy would be to call cuBlas directly whenever possible (i.e. when the input data for the 2 operand is directly addressable by pointer.)
-- GitLab Migration Automatic Message --
This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1212.