Difference between revisions of "Eigen2 benchmark Intel"
From Eigen
(5 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | Out of curiosity, I have performed BTL tests with Eigen2 compiled with 4 different compilers: | + | Out of curiosity, I have performed BTL tests with Eigen2 compiled with 4 different compilers on Intel Pentium D CPU: |
* GCC 4.3.3: -O3 -march=native -DNDEBUG | * GCC 4.3.3: -O3 -march=native -DNDEBUG | ||
* GCC 4.1.3: -O3 -march=nocona -msse2 -msse3 -DNDEBUG | * GCC 4.1.3: -O3 -march=nocona -msse2 -msse3 -DNDEBUG | ||
Line 5: | Line 5: | ||
* Intel(R) C++ 11.0: -O3 -DNDEBUG -no-ipo -xHOST -ip -static -no-prec-div | * Intel(R) C++ 11.0: -O3 -DNDEBUG -no-ipo -xHOST -ip -static -no-prec-div | ||
Although from on my experience the ''-ipo'' option (interprocedural optimization) provides good performance benefits, it was explicitly disabled for Intel, because it failed to work (numerically). | Although from on my experience the ''-ipo'' option (interprocedural optimization) provides good performance benefits, it was explicitly disabled for Intel, because it failed to work (numerically). | ||
+ | ---- | ||
+ | Rookie conclusions: | ||
+ | # The benefit of using newer GCC versions is pretty clear. | ||
+ | # In most cases gcc 4.4 is comparable with gcc 4.3, but in some it's almost 2 times faster. (For my experience gcc 4.2 performs as well as 4.4, and gcc 4.3 is known to miss an optimization in some matrix-scalar products: the copy of the scalar to a four scalar register is not removed out of the inner loop) | ||
+ | # Except (anomalous) LU decomposition, gcc 4.1 is nowhere near newer versions of gcc: this is in part because Eigen automatically disable vectorization for gcc < 4.2, but the difference is still huge without that as soon as complex expressions are involved. | ||
+ | # Intel C++ does not provide any performance benefits here. This is somewhat surprising as I was expecting at least some advantage on this CPU. That could be due to disabled IPO, though. However, speaking from experience I had with Intel Fortran, -ipo would give about 10-15% speedup. But this can be totally unrelated to C++. | ||
+ | |||
+ | |||
+ | ---- | ||
+ | [[Image:Axpy_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Axpby_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Atv_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Matrix_vector_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Matrix_matrix_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Symv_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Syr2_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Aat_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Ata_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Trisolve_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Cholesky_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Hessenberg_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Tridiagonalization_compare_intel.png]] | ||
+ | ---- | ||
+ | [[Image:Lu_decomp_compare_intel.png]] |
Latest revision as of 16:07, 14 December 2009
Out of curiosity, I have performed BTL tests with Eigen2 compiled with 4 different compilers on Intel Pentium D CPU:
- GCC 4.3.3: -O3 -march=native -DNDEBUG
- GCC 4.1.3: -O3 -march=nocona -msse2 -msse3 -DNDEBUG
- GCC 4.4.0: -O3 -march=native -DNDEBUG
- Intel(R) C++ 11.0: -O3 -DNDEBUG -no-ipo -xHOST -ip -static -no-prec-div
Although from on my experience the -ipo option (interprocedural optimization) provides good performance benefits, it was explicitly disabled for Intel, because it failed to work (numerically).
Rookie conclusions:
- The benefit of using newer GCC versions is pretty clear.
- In most cases gcc 4.4 is comparable with gcc 4.3, but in some it's almost 2 times faster. (For my experience gcc 4.2 performs as well as 4.4, and gcc 4.3 is known to miss an optimization in some matrix-scalar products: the copy of the scalar to a four scalar register is not removed out of the inner loop)
- Except (anomalous) LU decomposition, gcc 4.1 is nowhere near newer versions of gcc: this is in part because Eigen automatically disable vectorization for gcc < 4.2, but the difference is still huge without that as soon as complex expressions are involved.
- Intel C++ does not provide any performance benefits here. This is somewhat surprising as I was expecting at least some advantage on this CPU. That could be due to disabled IPO, though. However, speaking from experience I had with Intel Fortran, -ipo would give about 10-15% speedup. But this can be totally unrelated to C++.