This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen

Bug 938

Summary: Products where RHS is narrow perform better with non-default blocking sizes
Product: Eigen Reporter: Benoit Jacob <jacob.benoit.1>
Component: Core - matrix productsAssignee: Nobody <eigen.nobody>
Status: RESOLVED FIXED    
Severity: Unknown CC: benoit.steiner.goog, chtz, gael.guennebaud
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on:    
Bug Blocks: 937    

Description Benoit Jacob 2015-01-25 00:37:04 UTC
For example, MatrixXf products of size (256 x 256) times (256 x 16), so the RHS is narrow.

The attachment in bug 937 comment 3 shows that on a Nexus 4, the default blocking parameter kc=256 gives only 3.8 GFlop/s, while the lower value kc=128 gives 9.2 GFlops/s, more than 2x faster!

On a Core i7, the attachment in bug 937 comment 1 shows that the default blocking parameter kc=256 gives 61 GFlop/s, while the lower value kc=128 gives 68.5 GFlop/s for sufficiently small mc, and kc=64 even gives 69.5 GFlop/s for all values of mc!

The Core i7 results also have L1 cache miss counts, which show that the performance degradation comes with an increase of the L1 cache read miss count.

size (256,256,16), block (64,32,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (64,64,16), l1_misses (r: 7.5e+03, w: 3.03e+03), time=3.1e-05s, achieved 67.7 GFlop/s
size (256,256,16), block (64,128,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (64,256,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (128,16,16), l1_misses (r: 7.59e+03, w: 2.88e+03), time=3.06e-05s, achieved 68.5 GFlop/s
size (256,256,16), block (128,32,16), l1_misses (r: 7.59e+03, w: 2.88e+03), time=3.06e-05s, achieved 68.5 GFlop/s
size (256,256,16), block (128,64,16), l1_misses (r: 9.61e+03, w: 2.88e+03), time=3.25e-05s, achieved 64.6 GFlop/s
size (256,256,16), block (128,128,16), l1_misses (r: 9.61e+03, w: 2.88e+03), time=3.24e-05s, achieved 64.7 GFlop/s
size (256,256,16), block (128,256,16), l1_misses (r: 9.6e+03, w: 2.88e+03), time=3.24e-05s, achieved 64.6 GFlop/s
size (256,256,16), block (256,16,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.3 GFlop/s
size (256,256,16), block (256,32,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.43e-05s, achieved 61.1 GFlop/s
size (256,256,16), block (256,64,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.2 GFlop/s
size (256,256,16), block (256,128,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.4 GFlop/s
size (256,256,16), block (256,256,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.43e-05s, achieved 61.2 GFlop/s

Thus we see two classes of cases:

kc == 64 or (kc == 128 and mc <= 32)  ->  7590 L1 cache read misses, 69.5 GFlop/s
kc == 128 and mc >= 64  ->  9610 L1 cache read misses, 68.5 GFlop/s
kc == 256  ->  10900 cache read misses, 61 GFlops/s

Given that this CPU has a L1 data cache of 32k, can you make sense of these results?
Comment 1 Benoit Jacob 2015-01-25 00:41:21 UTC
Sorry, rather there are 4 classes of cases:

kc == 64 -> 7470 L1 cache read misses, 69.5 GFlop/s
kc == 128 and mc <= 32 -> 7590 L1 cache read misses, 68.5 GFlop/s
kc == 128 and mc >= 64  ->  9610 L1 cache read misses, 64.6 GFlop/s
kc == 256 -> 10900 L1 cache read misses, 61 GFlops/s

So we see a very clear correlation between performance and L1 cache read misses.
Comment 2 Benoit Jacob 2015-01-25 01:48:14 UTC
There is also an isolated special case:
kc == mc == 64 --> 7500 L1 cache misses, 67.7 GFlop/s
Comment 3 Gael Guennebaud 2015-01-26 18:28:55 UTC
First of all, make sure you disabled turbo-boost to get reliable benchmark values.

My guess is that since the rhs is very narrow, if kc is small enough then the packed version of the rhs might also fit in the L1 cache. Given your numbers, even with kc=256, both should fit in L1 but my guess is that when moving to the next 12 x KC horizontal panel of the LHS, the values of the rhs gets removed from L1.
Comment 4 Benoit Jacob 2015-01-28 18:23:24 UTC
Yes, I agree that this is likely the explanation. But since CPUs are so smart about prefetching, it's not easy to figure the best choice of blocking sizes, even with this observation.

See bug 937 comment 8. Let's accept that it's OK to have ad-hoc logic for each CPU architecture to determine blocking sizes. The present bug can then easily be addressed in a ad-hoc way.
Comment 5 Gael Guennebaud 2015-02-26 17:05:50 UTC
Blocking on the lhs while keeping kc=256 is also performing well assuming we don't pack the rhs multiple times.

I think that the following change-sets permit to close this bug.

https://bitbucket.org/eigen/eigen/commits/c8c042f286b2/
Changeset:   c8c042f286b2
User:        ggael
Date:        2015-02-26 16:01:33+00:00
Summary:     Avoid packing rhs multiple-times when blocking on the lhs only.

and

https://bitbucket.org/eigen/eigen/commits/52572e60b5d3/
Changeset:   52572e60b5d3
User:        ggael
Date:        2015-02-26 15:04:35+00:00
Summary:     Implement a more generic blocking-size selection algorithm.
Comment 6 Nobody 2019-12-04 14:08:03 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/938.