This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen
Bug 721 - Reduce the pressure on the L1 cache to speedup matrix multiplication
Summary: Reduce the pressure on the L1 cache to speedup matrix multiplication
Status: RESOLVED FIXED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - matrix products (show other bugs)
Version: 3.2
Hardware: All All
: Normal Unknown
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-03 01:22 UTC by Benoit Steiner
Modified: 2019-12-04 12:54 UTC (History)
2 users (show)



Attachments
Results for the matrix_matrix benchmarks with the patch applied (38.75 KB, image/png)
2014-01-03 01:22 UTC, Benoit Steiner
no flags Details
Results for the matrix_matrix benchmarks before applying the pathc (36.50 KB, image/png)
2014-01-03 01:24 UTC, Benoit Steiner
no flags Details
Code changes (32.97 KB, patch)
2014-01-03 01:24 UTC, Benoit Steiner
no flags Details | Diff
Small benchmark of micro kernel strategies (9.86 KB, application/octet-stream)
2014-03-21 15:20 UTC, Gael Guennebaud
no flags Details
Small benchmark of micro kernel strategies (v2) (14.82 KB, application/octet-stream)
2014-03-21 16:50 UTC, Gael Guennebaud
no flags Details
Small benchmark of micro kernel strategies (v3) (20.45 KB, application/octet-stream)
2014-03-26 10:26 UTC, Gael Guennebaud
no flags Details
benchmark on AVX+FMA (125.13 KB, image/png)
2014-03-27 23:47 UTC, Gael Guennebaud
no flags Details

Description Benoit Steiner 2014-01-03 01:22:55 UTC
Created attachment 410 [details]
Results for the matrix_matrix benchmarks with the patch applied

The patch attached speeds up the general block panel multiplication code
Comment 1 Benoit Steiner 2014-01-03 01:24:06 UTC
Created attachment 411 [details]
Results for the matrix_matrix benchmarks before applying the pathc
Comment 2 Benoit Steiner 2014-01-03 01:24:40 UTC
Created attachment 412 [details]
Code changes
Comment 3 Benoit Steiner 2014-01-03 01:26:38 UTC
Improved the efficiency if the block-panel matrix multiplication code: the change reduces the pressure on the L1 cache by removing the calls to gebp_traits::unpackRhs(). Instead the packetization of the rhs blocks is done on the fly in gebp_traits::loadRhs(). This adds numerous calls to pset1<ResPacket> (since we're packetizing on the fly in the inner loop) but this is more than compensated by the fact that we're decreasing the memory transfers by a factor RhsPacketSize.

The benchmarks were run on and Xeon X5550 CPU running at 2.67GHz. The code was compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
Comment 4 Gael Guennebaud 2014-03-06 21:27:21 UTC
The problem is that change kills the performance on older architectures for which pset1 is very slow. I'm also unsure about ARM/Neon.  So this trigger a more general question about the policy we should adopt regarding optimizations:

a) shall we limit the number of variants and favor modern architectures?

b) shall we keep multiple variants and detect the best one at compile time through compiler's preprocessors or user defined preprocessor token?
Comment 5 Christoph Hertzberg 2014-03-07 11:12:19 UTC
What are 'older' architectures? And is the instruction slow or does it just have high latency, which we might compensate somehow by re-arranging?

Generally, I think we should aim for (near) top-performance on modern architectures and 'not-too-bad' performance on older ones.
If we can achieve it with reasonable maintenance efforts, it would certainly be nice to have the option to optimize for concrete architectures (maybe even provide tools to switch at run-time, by reading out the CPU-id).

I'm afraid to maintain that we need to keep various hardware platforms available and set up performance-regression tests.

Maybe we better start a thread on the mailing list on this topic?
Comment 6 Gael Guennebaud 2014-03-07 19:47:55 UTC
Here "old" architectures are core/penryn for which it is fine to sacrifie performance. If someone want high-end performance on x86 platform, then he should also have performant CPUs.

Actually, the problem might be more for ARM/NEON. Sadly, I'm not able to perform such benchmark.
Comment 7 Gael Guennebaud 2014-03-08 22:25:44 UTC
Getting back to the patch, on Haswell (i7-4960HQ) with gcc 4.7 and 4.8, this change yields a slowdown of about 1.5%. For instance, for a product with matrices of size 2000x2000, I get 19.97 GFLOPS with the current Eigen and 19.69 GFLOPS with this change. The theoretical peak is 20.8.

Are you sure the patch is complete and that no other trick is needed to make it worth it?
Comment 8 Gael Guennebaud 2014-03-10 08:46:45 UTC
I also observe a significant slowdown on a Xeon X5570 (Nehalem): 19.6 versus 16.4 GFLOPS. (theoretical peak is 23GFLOPS)
Comment 9 Benoit Steiner 2014-03-17 21:29:30 UTC
(In reply to comment #8)
I reran the benchmark on a vanilla Dell Precision T3500 workstation with a Xeon X5550 cpu. To limit the noise, I disabled frequency scaling in the BIOS. I followed the instructions in http://eigen.tuxfamily.org/index.php?title=How_to_run_the_benchmark_suite to run the benchmarks using OMP_NUM_THREADS=1 BTL_CONFIG="-a matrix_matrix -t 10 --overwrite --nocheck" ctest -V to actually launch the benchmarks. I started from the latest Eigen codebase to generate the reference results, and then applied the patch.

I then tried with 2 versions of gcc. Here are the results for a 3000x3000 matrix:

No patch, gcc 4.6.3:
3:  size = 3000   18171.3  18171.3  18171.3  18171.3  18171.3  18171.3  18171.3  18171.3  18171.3 18171.3        > 18168.8 MFlops    (1/100)

No Patch, gcc 4.8.2:
3:  size = 3000   17896.2  17896.2  17896.2  17896.2  17896.2  17896.2  17896.2  17896.2  17896.2 17896.2 MFlops    (1/100)

With Patch, gcc 4.6.3:
3:  size = 3000   18732.1  18771.5  18771.8  18771.8  18771.8  18771.8  18771.8  18771.8  18772 18772 MFlops    (1/100)

With Patch, gcc 4.8.2:
3:  size = 3000   18264.6  18264.6  18264.6  18264.6  18264.6  18264.6  18264.6  18264.6  18264.6 18264.6 MFlops    (1/100)

Other matrix sizes show the same type of results: gcc 4.8 generates code that's slower than gcc 4.6. However in all cases the patch improved speed somewhat regardless of the compiler
Comment 10 Benoit Steiner 2014-03-17 22:24:03 UTC
Combing through the benchmark results generated with gcc 4.8.3, I found one case where the patch slowed down the binary:
3:  size = 1159   14816.8  17885.7  17886.4  17890.3  17890.3  17890.3  17890.3  17891.2  17891.2 17891.2 MFlops    (16/100)
versus:
3:  size = 1159   14780.7  17806.3  17806.3  17806.3  17806.3  17806.3  17806.3  17806.3  17806.3 17806.3 MFlops    (16/100)

The patch improved the speed for the other matrix sizes, although the improvement are often negligible when using this version of the compiler. I couldn't find a slowdown as dramatic as what Gael noticed though.
Comment 11 Benoit Steiner 2014-03-17 22:51:55 UTC
I just managed to reproduce the slowdown that Gael noticed by using gcc 4.7.3. 

Without the patch:
3:  size = 3000   18080.8  18151.8  18156.9  18157.8  18159.8  18159.8  18159.8  18160  18160 18160 MFlops    (1/100)

With the patch,
3:  size = 3000   15376.4  15376.4  15403  15403  15403  15403  15403  15403  15403 15405.2 MFlops    (1/100)

I get pretty much the same result across all matrix sizes. So it looks like gcc 4.6 generates much faster code than 4.7, and 4.8 started to correct the regressions.
Comment 12 Gael Guennebaud 2014-03-18 11:19:15 UTC
ok, with gcc 4.6.3 on a Xeon X5570 (2.93GHz), I get a similar behavior:

no patch:
size = 3000   19032.1  20198.5  20198.5  20198.5  20213  20213  20502.1  20502.1  20502.1 20502.1 MFlops

with the patch:
size = 3000   19734.3  19762.5  19762.5  19762.5  19762.5  19762.5  19762.7  19762.7  19762.7 19762.7 MFlops


However, even with gcc 4.6, I still have a slowdown of about 2% on Haswell (i7-4960HQ). I tried to adjust the blocking dimensions, but no change. Perhaps it's just a matter of tuning pset1.
Comment 13 Gael Guennebaud 2014-03-18 11:23:41 UTC
On Haswell, with double, I get same or better performance with the patch which is counter intuitive because this patch should be more effective with large packet size. So this indeed suggest a fine tuning of the intrinsics.
Comment 14 Benoit Steiner 2014-03-18 21:19:45 UTC
I have also noticed a slowdown on Haswell after applying the patch. I used a MacBook Air (i7-4650U) and compiled using clang 5.0

I also applied the patch on top of my AVX branch (at http://code.bsteiner.info/eigen): this time I noticed a speedup. I am guessing that either Intel or the compilers didn't focus as much on the performance of the SSE instructions on AVX capable platforms

Without the patch:
3:  size = 3000   36844.7  36996.2  37071.7  37071.7  37071.7  37071.7  37071.7  37071.7  37071.7 37071.7 MFlops    (1/100)
3:  size = 2815   30627.7  35878.3  35878.3  35967.4  35967.4  36199.8  36199.8  36199.8  36199.8 36199.8 MFlops    (2/100)
3:  size = 2642   29927.4  36463.2  36463.2  36463.2  36463.2  36636.1  36636.1  36636.1  36636.1 36636.1 MFlops    (3/100)
3:  size = 2480   30301.7  37103  37103  37103.7  37103.7  37143  37143  37244.1  37244.1 37327.9 MFlops    (4/100)
3:  size = 2328   30876.5  36441  36932  37332  37332  37332  37332  37332  37332 37332 MFlops    (5/100)
3:  size = 2185   30866.4  36726.3  36726.3  36726.3  36726.3  36726.3  36726.3  36726.3  36726.3 36726.3 MFlops    (6/100)
3:  size = 2051   30375.3  34656.2  34772.2  35375.1  35375.1  35375.1  35375.1  35378.8  35386.9 35386.9 MFlops    (7/100)
3:  size = 1925   29257.6  35711.1  35711.1  35711.1  36328.7  36344.1  36344.1  36344.1  36344.1 36344.1 MFlops    (8/100)
3:  size = 1806   30012  35467.1  35467.1  35467.1  35641.1  35641.1  35641.1  35641.1  35641.1 35641.1 MFlops    (9/100)
3:  size = 1695   29465  29465  31189.5  31189.5  31590.1  31590.1  32857.9  32857.9  32857.9 32857.9 MFlops    (10/100)
3:  size = 1591   27173.2  30193.3  31062.8  31069.5  31373.6  33727.6  33727.6  33727.6  33727.6 33727.6 MFlops    (11/100)
3:  size = 1494   27927.1  33837.5  33837.5  33837.5  33837.5  33837.5  33837.5  33837.5  33837.5 34478.4 MFlops    (12/100)

With the patch:
3:  size = 3000   37433.8  38558.8  38641.1  38641.1  38641.1  38641.1  38641.1  38641.1  38641.1 38641.1 MFlops    (1/100)
3:  size = 2815   31924.3  37156.1  37156.1  37156.1  37156.1  37156.1  37156.1  37156.1  37156.1 37156.1 MFlops    (2/100)
3:  size = 2642   30718  31569.1  34301.5  34301.5  34301.5  34301.5  34301.5  35527.1  37127.8 37127.8 MFlops    (3/100)
3:  size = 2480   30708.3  37403.2  37403.2  37403.2  37403.2  37403.2  37403.2  37403.2  37403.2 37403.2 MFlops    (4/100)
3:  size = 2328   30938.8  36678.3  36680.9  36680.9  36680.9  37724.1  37724.1  37724.1  37724.1 37724.1 MFlops    (5/100)
3:  size = 2185   31190.6  36898.2  37009.9  37009.9  37009.9  37009.9  37009.9  37009.9  37009.9 37009.9 MFlops    (6/100)
3:  size = 2051   30609.8  37700.6  37700.6  37722.5  37722.5  37722.5  37722.5  37722.5  37722.5 37722.5 MFlops    (7/100)
3:  size = 1925   31188.6  36931.7  38114.6  38211.1  38211.1  38211.1  38211.1  38211.1  38211.1 38211.1 MFlops    (8/100)
3:  size = 1806   31553.8  35680.9  36092.1  36206.9  36206.9  36206.9  36776.5  37579.3  37579.3 37579.3 MFlops    (9/100)
3:  size = 1695   31067.4  32967.5  37262.4  37474.9  37474.9  37474.9  37474.9  37474.9  37474.9 37474.9 MFlops    (10/100)
3:  size = 1591   30991.4  34727.6  34727.6  36380.7  37403.2  37403.2  37403.2  37403.2  37600.5 37600.5 MFlops    (11/100)
3:  size = 1494   31134  37551.6  37551.6  37560.7  37560.7  37560.7  37560.7  37560.7  37560.7
Comment 15 Gael Guennebaud 2014-03-19 11:26:42 UTC
Actually, compiling with -mavx seems to do the trick even if we are still using 128bits packets. For instance, this makes the compiler to generate:

"vbroadcastss (%rax), %xmm4"

instead of :

movss	(%rax), %xmm13
shufps	$0, %xmm13, %xmm13
Comment 16 Benoit Steiner 2014-03-19 18:34:48 UTC
Indeed. Stack Overflow has a good post on that very topic at http://stackoverflow.com/questions/13218391/is-mm-broadcast-ss-faster-than-mm-set1-ps.

To avoid this kind of problems, maybe we should update the Eigen CMakefile and add -mtune=native or -march=native as Christoph suggested recently in http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2014/03/msg00014.html?
Comment 17 Gael Guennebaud 2014-03-19 18:44:51 UTC
Sadly, -march=native fails on OSX with gcc because the assembler refuse to generate AVX instructions coming from gcc.

Anyway, when AVX is not enabled, implementing pset1 as follow is more efficient on Haswell, and the speed up should even be more important on previous architecture:

template<> EIGEN_STRONG_INLINE Packet4f pset1<Packet4f>(const float&  from) { return _mm_castsi128_ps( _mm_shuffle_epi32(_mm_castps_si128( _mm_load_ss(&from)), 0)); }

Actually, I was pretty sure that pset1 was already implemented using pshufd instead of shufps, but that's not the case!
Comment 18 Gael Guennebaud 2014-03-20 10:19:33 UTC
This was indeed the case a very long time ago, but it seems that the initial code with intrinsics was not always very good (see bug 203). Here is a proper fix:

https://bitbucket.org/eigen/eigen/commits/60ca549abed6/
Changeset:   60ca549abed6
User:        ggael
Date:        2014-03-20 10:14:26
Summary:     Makes gcc to generate a pshufd instruction for pset1

After that fix, the proposed patch does not introduce any regression on Haswell, even without enabling AVX, and even with gcc 4.7. I'll test the Xeon X5570 again.
Comment 19 Gael Guennebaud 2014-03-20 11:45:38 UTC
Damn, on Nehalem the previous fix introduce a regression with respect to this patch. It seems that when using asm(), gcc does a bad job at instruction reordering. For Haswell this is not an issue because the CPU seems to be smart enough to do the reordering itself. Implementing pset1 using intrinsics fixes this issue but introduce the regression mentioned in bug 203.

Why gcc does not generate a pshufd by itself, just like ICC and clang do!

An easy workaround would be to introduce a pload1(const Scalar* ptr) which could be implemented by default as pset1(*ptr) and specialized for GCC/SSE using a pshufd intrinsics.

However, I guess that regarding this patch, the best would probably be to bypass pset1 (or pload1) by introducing new functions broadcasting multiple scalars at once. Indeed, in our case it would be better to perform one movaps followed by 4 pshufd, instead of 4 individual pset1. We could have something like:

 pbroadcast2(const Scalar*, Packet& p0, Packet& p1);
 pbroadcast4(const Scalar*, Packet& p0, Packet& p1, Packet& p2, Packet& p3);

Again, we can easily provide default implementation based on pset1 (or pload1) for them.
Comment 20 Christoph Hertzberg 2014-03-20 12:44:32 UTC
(In reply to comment #19)
>  pbroadcast2(const Scalar*, Packet& p0, Packet& p1);
>  pbroadcast4(const Scalar*, Packet& p0, Packet& p1, Packet& p2, Packet& p3);

For this I would suggest following the meta packet approach (Bug 692). 
Then we only need a single generic:
  template<class Scalar, int N> 
  pbroadcast(const Scalar*, MetaPacket<Scalar, N>&);

We could also provide some kind of Packet::MetaPacketSquared typedef which defaults to have the same number of packets as scalars per packet (i.e. 2 packets for SSE double and 8 packets for AVX float, etc)
Comment 21 Benoit Steiner 2014-03-20 22:24:05 UTC
I very much second the idea of creating a SquaredMetaPacket typedef. We could then create a ptranspose intrinsic that interprets this meta packet as a square matrix and transposes it. This could be used to complete the vectorization of the  gemm_pack_lhs and gemm_pack_rhs for SSE. It would also work for AVX provided that the gebp_kernel is made square (in http://code.bsteiner.info/eigen the kernel is 16x4 for floats).

This has little impact with SSE, but with FMA enabled gemm_pack_lhs and gemm_pack_rhs become a bottleneck
Comment 22 Gael Guennebaud 2014-03-21 15:10:06 UTC
Alright, so after this new fix:

https://bitbucket.org/eigen/eigen/commits/4e97a8ca0ea4/
Changeset:   4e97a8ca0ea4
User:        ggael
Date:        2014-03-20 16:03:46
Summary:     Revert previous change and introduce a new workaround regarding gcc generating a shufps instruction instead of the more efficient pshufd instruction.
The trick consists in introducing a new pload1 function to be used in low level product kernels for which bug 203 does not apply.
Indeed, it turned out that using inline assembly prevents gcc of doing a good job at instruction reordering.


There is no regression anymore for float and arch>=nehalem, and gcc 4.7 generates even better code than gcc 4.6;) 

Unfortunately, I observe strong regression for other scalar types (double, and complex<*>) as well as on sandybridge.
Comment 23 Gael Guennebaud 2014-03-21 15:20:39 UTC
Created attachment 436 [details]
Small benchmark of micro kernel strategies

I attached a small benchmark comparing 4 different strategies to implement the core of the gebp kernel:

1 - one based on individual pload1 for the rhs (typically what the patch does)
2 - one performing a single pload plus some shuffling to mimic multiple pload1
3 - one based on a single pload plus some cyclic permutations
4 - one based on the current kernel implemented in Eigen (with unpacking of the rhs in a temporary buffer)

I tested them with various compiler on Nehalem, Sandybridge, and Haswell. The second version seems to be the best candidate, especially on Nehalem.

Still have to check with 256bits AVX and FMA before doing the move.
Comment 24 Gael Guennebaud 2014-03-21 16:50:18 UTC
Created attachment 437 [details]
Small benchmark of micro kernel strategies (v2)

I extended the micro benchmark with AVX and FMA (you need Benoit's AVX branch to test AVX), as well as with "1 packet size" times 8 register blocking strategy. This strategy would permit 8x8 register blocking with AVX/float.

Surprisingly, this new strategy seems to perform well on Nehalem, Sandybridge and Haswell, with and without AVX. I'm surprised because with this strategy an "expanded" coefficient of the rhs is used only once, instead of twice with the previous "2 packet size" times 4 register blocking. Yes, this means that despite the double number of pload1 (or pset1 or whatever equivalent), there is no performance drop, and potential performance improvement with AVX...
Comment 25 Benoit Steiner 2014-03-21 18:30:24 UTC
I have seen the same results. Although switching the kernel size from 8x4 to 4x8 for SSE floating point matrix multiplication increases the number of loads significantly, it results in an overall speedup since the number of l1 cache misses is reduced by about 40%.
I don't have a good explanation for this surprising result. This may be due to the fact that the LHS is sized to be stored in the l2 cache while the RHS should reside in the l1 cache. A 4x8 kernel loads half as much data from the LHS which may have to be fetched from the l2 cache and twice as much from the RHS, which should already be in the l1 cache.
Comment 26 Gael Guennebaud 2014-03-26 10:26:33 UTC
Created attachment 442 [details]
Small benchmark of micro kernel strategies (v3)

Updated version of the benchmark with more variants and loops over the rhs columns to better reflect actual matrix products.

Results with AVX+FMA (Haswell):

load1:
 float  0.00486323  55.197 GFLOPS
 double 0.00964552  27.8301 GFLOPS

load1 (1pX8):
 float  0.00419106  64.0495 GFLOPS
 double 0.0100661  26.6673 GFLOPS

broadcast4:
 float  0.00486018  55.2316 GFLOPS
 double 0.00982608  27.3187 GFLOPS

broadcast4 (1pX8):
 float  0.00419161  64.0411 GFLOPS
 double 0.0100683  26.6615 GFLOPS

broadcast4 with register prefetching (1pX8):
 float  0.00423694  63.3559 GFLOPS
 double 0.00854273  31.4227 GFLOPS

pre expanded:
 float  0.00680854  39.4263 GFLOPS
 double 0.0136826  19.6188 GFLOPS




For the record, OpenBLAS which is supposed to be slightly faster than MKL with AVX/FMA achieve a peak of 55 GFLOPS on the same computer. So, even though the above numbers do not take into account the cost of packing, it seems that we have a safe amount of marge to become a good challenger on high-end CPUs.
Comment 27 Gael Guennebaud 2014-03-26 22:04:56 UTC
Finally, I've applied the patch and updated the gebp kernel using the "1 packet size" times 8 strategy:

https://bitbucket.org/eigen/eigen/commits/838458776308/
https://bitbucket.org/eigen/eigen/commits/7e7f1458e95c/

However, I hit an issue with the variant which loads one lhs packet in advance (see commented macro): in some rare cases loading outside the bounds led to a segfault.

It also remains to optimize the last rows/columns which cannot be treated by the main peeled loop, and of course to merge the avx branch to fully exploit this new kernel ;)
Comment 28 Benoit Steiner 2014-03-26 22:22:31 UTC
Excellent. I am trying to complete the vectorization of the complex packet primitives in http://code.bsteiner.info/eigen. This will of course speed things up for complex numbers but also make it possible to compile the code with the 4.6 and 4.7 versions of gcc (which don't support accessing the individual words of a __m256 variable). I should have both pmul and pdiv done by the end of the day. Volunteers are welcome for the remaining 3 (predux, pcplxflop, and palign_impl).

With Gael kernel change (https://bitbucket.org/eigen/eigen/commits/7e7f1458e95c/) nr is now equals to 8 when compiling on 64 bit, which will make loop peeling a lot easier. It will also make it a lot simpler to complete the vectorization of the gemm_pack_lhs and gemm_pack_rhs code, which become serious bottlenecks when AVX and FMA are used (In some benchmarks these 2 functions represent about 40% of the total cpu time on Haswell).
Comment 29 Gael Guennebaud 2014-03-27 09:17:37 UTC
Looking forward to merge the AVX changes.

In the meantime, minor moves of block of codes allowed me to handle the last columns through a panel of 4 columns:
https://bitbucket.org/eigen/eigen/commits/116265118ede/
Comment 30 Gael Guennebaud 2014-03-27 17:03:49 UTC
Alright, using the avx/fma branch, I get on Haswell (-mavx -mfma)

L1 cache size     = 32 KB
L2/L3 cache size  = 6144 KB
Register blocking = 8 x 8
Matrix sizes = 1000x1000 * 1000x1000
blocking size (mc x kc) = 768 x 512
blas  cpu         0.0357569s  	55.9332 GFLOPS 	(0.361859s)
eigen cpu         0.0348081s  	57.4578 GFLOPS 	(0.353129s)

where "blas" stands for the git version of OpenBLAS optimized for Haswell. And we still have room for improvements!
Comment 31 Gael Guennebaud 2014-03-27 23:47:23 UTC
Created attachment 446 [details]
benchmark on AVX+FMA

made using this version: https://bitbucket.org/benoitsteiner/eigen/commits/bb6b5be853794ff7f8e5b799deb95b4ebfaa4f56 and default OSX's clang.
Comment 32 Benoit Steiner 2014-04-03 22:35:25 UTC
Closing this bug since the patch was committed to Eigen in https://bitbucket.org/eigen/eigen/commits/838458776308717af536f368b257d8b745461a73
Comment 33 Nobody 2019-12-04 12:54:54 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/721.

Note You need to log in before you can comment on or make changes to this bug.