Summary: | Products where RHS is narrow perform better with non-default blocking sizes | ||
---|---|---|---|
Product: | Eigen | Reporter: | Benoit Jacob <jacob.benoit.1> |
Component: | Core - matrix products | Assignee: | Nobody <eigen.nobody> |
Status: | RESOLVED FIXED | ||
Severity: | Unknown | CC: | benoit.steiner.goog, chtz, gael.guennebaud |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Bug Depends on: | |||
Bug Blocks: | 937 |
Description
Benoit Jacob
2015-01-25 00:37:04 UTC
Sorry, rather there are 4 classes of cases: kc == 64 -> 7470 L1 cache read misses, 69.5 GFlop/s kc == 128 and mc <= 32 -> 7590 L1 cache read misses, 68.5 GFlop/s kc == 128 and mc >= 64 -> 9610 L1 cache read misses, 64.6 GFlop/s kc == 256 -> 10900 L1 cache read misses, 61 GFlops/s So we see a very clear correlation between performance and L1 cache read misses. There is also an isolated special case: kc == mc == 64 --> 7500 L1 cache misses, 67.7 GFlop/s First of all, make sure you disabled turbo-boost to get reliable benchmark values. My guess is that since the rhs is very narrow, if kc is small enough then the packed version of the rhs might also fit in the L1 cache. Given your numbers, even with kc=256, both should fit in L1 but my guess is that when moving to the next 12 x KC horizontal panel of the LHS, the values of the rhs gets removed from L1. Yes, I agree that this is likely the explanation. But since CPUs are so smart about prefetching, it's not easy to figure the best choice of blocking sizes, even with this observation. See bug 937 comment 8. Let's accept that it's OK to have ad-hoc logic for each CPU architecture to determine blocking sizes. The present bug can then easily be addressed in a ad-hoc way. Blocking on the lhs while keeping kc=256 is also performing well assuming we don't pack the rhs multiple times. I think that the following change-sets permit to close this bug. https://bitbucket.org/eigen/eigen/commits/c8c042f286b2/ Changeset: c8c042f286b2 User: ggael Date: 2015-02-26 16:01:33+00:00 Summary: Avoid packing rhs multiple-times when blocking on the lhs only. and https://bitbucket.org/eigen/eigen/commits/52572e60b5d3/ Changeset: 52572e60b5d3 User: ggael Date: 2015-02-26 15:04:35+00:00 Summary: Implement a more generic blocking-size selection algorithm. -- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/938. |