This bugzilla service is closed. All entries have been migrated to

Bug 931

Summary: Fix tracking and use of cache sizes
Product: Eigen Reporter: Benoit Jacob <jacob.benoit.1>
Component: Core - matrix productsAssignee: Nobody <eigen.nobody>
Status: NEW ---    
Severity: Unknown CC:, chtz, gael.guennebaud
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 937    
Description Flags
Part 1: refactor the cache sizes tracking
Part 1.1: move the CPUID code to CacheSizes.h
Part 2: fix computeProductBlockingSizes, actually use the cache sizes
Part 2: fix computeProductBlockingSizes, actually use the cache sizes none

Description Benoit Jacob 2015-01-19 22:00:47 UTC
Two parts here:

1) refactor how we track cache sizes with static-storage variables, to make it cleaner and more future-proof. See Patch 1 and 1.1 below. This also makes it easy to provide one's own description of the cache levels, as can be obtained by the new function added in bug 930.

2) Fix computeProductBlockingSizes to: a) actually take cache sizes into account, and b) to avoid applying on ARM an optimization (limiting blocking sizes) that seems to be only beneficial on x86/x86-64 and is actually harmful on ARM.
Comment 1 Benoit Jacob 2015-01-19 22:03:46 UTC
Created attachment 517 [details]
Part 1: refactor the cache sizes tracking

How do you like this API?
Initially I wanted to use std::vector instead of rolling my own arrays. But that didn't play well with EIGEN_RUNTIME_NO_MALLOC. I could have used std::array, but at this point it was looking like I might as well use a plain array rather than increasing Eigen's surface of contact with the STL.
Comment 2 Benoit Jacob 2015-01-19 22:04:53 UTC
Created attachment 518 [details]
Part 1.1: move the CPUID code to CacheSizes.h

This is really part of Part 1, but I kept it a separate patch for better hg history and ease of reviewing: this is just moving a big chunk of existing code.
Comment 3 Benoit Jacob 2015-01-19 22:09:41 UTC
Created attachment 519 [details]
Part 2: fix computeProductBlockingSizes, actually use the cache sizes

Please review carefully: I am not 100% confident...

This gives me near-optimal performance (within a few %) on both a Nexus 4 (ARM) and a Core i7. I found that the existing std::min code clamping k,n,m is actually beneficial on Core i7, but is detrimental on Nexus 4.

Regarding the choice of TopLevel() instead of L2(). Due to the above-mentioned clamping, it doesn't make a huge difference (on desktop, we never use huge blocks, and on mobile, we have no more than 2 levels of cache mostly). Still, this is maybe 2% faster with TopLevel() than with L2(), and my limited understanding was that on Core i7 CPUs where L2 is ridiculously small, TopLevel is what we should want here, but let me know if that's not right.
Comment 4 Benoit Jacob 2015-01-19 22:17:53 UTC
Created attachment 520 [details]
Part 2: fix computeProductBlockingSizes, actually use the cache sizes
Comment 5 Gael Guennebaud 2015-01-22 00:57:10 UTC
If you look at the CPUID code, we already return for the "L2" cache size the size of the L3 cache for X86 architectures having 3 levels. As you noticed, this is because the L2 cache on these architectures is too small to be of any use for us. So perhaps, the API to set the cache sizes should reflect that aspect and only accepts the information which are useful for us which are:
 - the size of the cache to be used for inner blocking
 - the size of the cache to be used for medium blocking
 - optionally the size of the cache to be used for outer-most blocking (default to memory-size)

For instance, on CPU having four levels, shall we use the L2, the L3 or the L4 for our second level of blocking ? The answer depends on the respective cache sizes and speeds...

Regarding the block size computation, it is probably better to make sure that the size of the blocks are aligned with the largest register/loop peeling sizes. I see that you took care of that for mc and nc, but kc should be a multiple of 8 (or even 16 to be safe in the future):

SizeType kc = (CacheSizes::L1() / kdiv) & (-SizeType(15));

Another important aspect is that on x86, the L3 is shared among multiple cores, so in practice, we should divide the L3 cache size by the number of core. The question is where doing this adjustment? When setting the second-level blocking cache size or when computing the cache size? This also probably explain why it is better to arbitrarily clamp the blocking sizes on Core i7.
Comment 6 Benoit Jacob 2015-01-28 18:20:34 UTC
Thanks for this explanation; for the record, we further talked about this today and the outcome is in bug 937 comment 8.

So let's only track 2 cache sizes, "L1" and "Last level, divided by number of cores".

The rationale for the latter is that in most CPUs, the last level of cache is shared by all cores, and even in single-threaded cases, there's not much point in trying to use more than lastlevel/numberofcores:
 - it's typically the case that each core has faster access to a specific portion of the last level cache;
 - it's typically the case (see bug 937 data) that not much extra performance is achieved by using very large blocks anyway, so lastlevel/numberofcores is mostly enough.
Comment 7 Benoit Jacob 2015-01-28 18:21:30 UTC
Cancelled reviews; new patches coming later.
Comment 8 Gael Guennebaud 2015-02-26 16:38:36 UTC
Update of computeClockingSize in this commit:
Summary:     Implement a more generic blocking-size selection algorithm. See explanations inlines.

It performs extremely well on Haswell.

The major remaining issues are:

- reliably and quickly find the actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3).

- test on other platforms

- API to let the user specify the cache size
Comment 9 Nobody 2019-12-04 14:04:23 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: