Difference between revisions of "User:Rmlarsen/3.4"

From Eigen
Jump to: navigation, search
(Changes to Tensor module)
(Changes to Tensor module)
Line 236: Line 236:
 
** Support for fusing element-wise operations into contraction during evaluation. Example:  
 
** Support for fusing element-wise operations into contraction during evaluation. Example:  
 
<source lang="cpp">
 
<source lang="cpp">
// Apply Sqrt to all output elements from a tensor contraction. The optional OutputKernel
+
// This example applies std::sqrt to all output elements from a tensor contraction.  
// argument to the contraction in this example is a functor over a 2-dimensional buffer.  
+
// The optional OutputKernel argument to the contraction in this example is a functor over a  
// The functor is called once for each output block of the results, to perform the elementwise
+
// 2-dimensional buffer. The functor is called once for each output block of the contraction
// sqrt operation while the block is hot in cache.
+
// result, to perform the elementwise sqrt operation while the block is hot in cache.
 
struct SqrtOutputKernel {
 
struct SqrtOutputKernel {
 
   template <typename Index, typename Scalar>
 
   template <typename Index, typename Scalar>

Revision as of 19:10, 18 August 2021

Changes to supported modules

Changes that might break existing code

  • Using float or double for indexing matrices, vectors and array will now fail to compile, ex.:
MatrixXd A(10,10);
float one = 1;
double a11 = A(one,1.); // compilation error here

New Major Features in Core

  • Add c++11 initializer_list constructors to Matrix and Array [doc]:
MatrixXi a {      // construct a 2x3 matrix
      {1,2,3},    // first row
      {4,5,6}     // second row
};
VectorXd v{{1, 2, 3, 4, 5}};    // construct a dynamic-size vector with 5 elements
Array<int,1,5> a{1,2, 3, 4, 5}; // initialize a fixed-size 1D array of size 5.
  • Add STL-compatible iterators for dense expressions [doc]. Some examples:
VectorXd v = ...;
MatrixXd A = ...;
// range for loop over all entries of v then A
for(auto x : v) { cout << x << " "; }
for(auto x : A.reshaped()) { cout << x << " "; }
// sort v then each column of A
std::sort(v.begin(), v.end());
for(auto c : A.colwise())
    std::sort(c.begin(), c.end());
  • Add C++11 template aliases for Matrix, Vector, and Array of common sizes, including generic Vector<Type,Size> and RowVector<Type,Size> aliases [doc].
MatrixX<double> M;  // Instead of MatrixXd or Matrix<Dynamic, Dynamic, double>
Vector4<MyType> V;  // Instead of Vector<4, MyType>
  • New support for bfloat16. The 16-bit Brain floating point format is now available as Eigen::bfloat16. The constructor must be called explicitly, but it can otherwise be used as any other scalar type. To convert back-and-forth between uint16_t to extract the bit representation, use Eigen::numext::bit_cast.
  bfloat16 s(0.25);                                 // explicit construction
  uint16_t s_bits = numext::bit_cast<uint16_t>(s);  // bit representation
 
  using MatrixBf16 = Matrix<bfloat16, Dynamic, Dynamic>;
  MatrixBf16 X = s * MatrixBf16::Random(3, 3);

New backends

Improvements to Eigen Core

  • Eigen now uses c++11 the alignas keyword for static alignment. Users targeting C++17 only and recent compilers (e.g., GCC>=7, clang>=5, MSVC>=19.12) will thus be able to completely forget about all issues related to static alignment, including EIGEN_MAKE_ALIGNED_OPERATOR_NEW.
  • Various performance improvements for products and Eigen's GEBP and GEMV kernels have been implemented:
    • By using half- and quater-packets the performance of matrix multiplications of small to medium sized matrices has been improved
    • Eigen's GEMM now falls back to GEMV if it detects that a matrix is a run-time vector
    • The performance of matrix products using Arm Neon has been drastically improved (up to 20%)
    • Performance of many special cases of matrix products has been improved
  • Large speed up from blocked algorithm for transposeInPlace.
  • Speed up misc. operations by propagating compile-time sizes (col/row-wise reverse, PartialPivLU, and others)
  • Faster specialized SIMD kernels for small fixed-size inverse, LU decomposition, and determinant.
  • Improved or added vectorization of partial or slice reductions along the outer-dimension, for instance: colmajor_mat.rowwise().mean()

Elementwise math functions

  • Many functions are now implemented and vectorized in generic (backend-agnostic) form.
  • Many improvements to correctness, accuracy, and compatibility with c++ standard library.
    • Much improved implementation of ldexp.
    • Misc. fixes for corner cases, NaN/Inf inputs and singular points of many functions.
    • New Payne-Hanek argument reduction algorithm for sin and cos with huge arguments.
    • New vectorized faithfully rounded algorithm for pow(x,y).
  • Speedups from (new or improved) vectorized versions of pow, log, sin, cos, arg, pow, log2, complex sqrt, erf, expm1, logp1, logistic, rint, gamma and bessel functions, and more.
  • Improved special function support (Bessel and gamma functions, ndtri, erfc, inverse hyperbolic functions and more)
  • New elementwise functions for absolute_difference, rint.

Dense matrix decompositions and solvers

  • All dense linear solvers (i.e., Cholesky, *LU, *QR, CompleteOrthogonalDecomposition, *SVD) now inherit SolverBase and thus support .transpose(), .adjoint() and .solve() APIs.
  • SVD implementations now have an info() method for checking convergence.
  #include <Eigen/SVD>
  MatrixXf m = MatrixXf::Random(3,2);
  JacobiSVD<MatrixXf> svd(m, ComputeThinU | ComputeThinV);
  if (svd.info() == ComputationInfo::Success) {
    // SVD computation was successful.
    VectorXf x = svd.solve(b);
  }
  • Most decompositions now fail quickly when invalid inputs are detected.
  • Optimized the product of a HouseholderSequence with the identity, as well as the evaluation of a HouseholderSequence to a dense matrix using faster blocked product.
  • Fixed aliasing issues with in-place small matrix inversions.
  • Fixed several edge-cases with empty or zero inputs.

Sparse matrix support, decompositions and solvers

  • Enabled assignment and addition with diagonal matrix expressions.
  SparseMatrix<float> A(10, 10);
  VectorXf x = VectorXf::Random(10);
  A = x.asDiagonal();
  A += x.asDiagonal();
  • Support added for SuiteSparse KLU routines via the KLUSupport module. SuiteSparse must be installed to use this module.
  #include <Eigen/KLUSupport>
  A.makeCompressed();   // Recommendation is to compress input before calling sparse solvers.
  KLU<SparseMatrix<T> > klu(A);
  if (klu.info() == ComputationInfo::Success) {
    VectorXf x = klu.solve(b);
  }
  • SparseCholesky now works with row-major matrices.
  • Various bug fixes and performance improvements.

Type support

  • Improved support for half
    • Native support added for ARM __fp16, CUDA/HIP __half, F16C.
    • Better vectorization support added across all backends.
  • Improved bool support
    • Partial vectorization support added for boolean operations.
    • Significantly improved performance (x25) for logical operations with Matrix or Tensor of bool.
  • Improved support for custom types
    • More custom types work out-of-the-box (see #2201).

Improved Geometry Module

  • Behavioral change: Transform::computeRotationScaling() and Transform::computeScalingRotation() are now more continuous across degeneracies (see !349).
  • New partial vectorization support added for Quaternion.
  • Generic vectorized 4x4 matrix inversion.

Backend-specific improvements

  • Arm NEON
    • Now provides vectorization for uint64_t, int64_t, uint32_t, int16_t, uint16_t, int16_t, int8_t, and uint8_t
    • Emulates bfloat16 support when using Eigen::bfloat16
    • Supports emulated and native float16 when using Eigen::half
  • SSE/AVX/AVX512
    • General performance improvements and bugfixes.
    • Enabled AVX512 instructions by default if available.
    • New std::complex, half, and bfloat16 vectorization support added.
    • Many missing packet functions added.
  • Altivec/Power
    • General performance improvement and bugfixes.
    • Enhanced vectorization of current real and complex scalars.
    • Changes to the gebp_kernel specific to Altivec, using VSX implementation of the MMA instructions that gain speed improvements up to 4x for matrix-matrix products.
    • Dynamic dispatch for GCC greater than 10 enabling selection of MMA or VSX instructions based on __builtin_cpu_supports.
  • GPU (CUDA and HIP)
    • Several optimized math functions added, better support for std::complex.
    • Added option to disable CUDA entirely by defining EIGEN_NO_CUDA.
    • Many more functions can now be used in device code (e.g. comparisons, small matrix inversion).
  • ZVector
    • Vectorized float and std::complex<float> support added.
    • Added z14 support.
  • SYCL
    • Redesigned SYCL implementation for use with the Tensor module, which can be enabled by defining EIGEN_USE_SYCL.
    • New generic memory model introduced used by TensorDeviceSycl.
    • Better integration with OpenCL devices.
    • Added many math function specializations.

Miscellaneous API Changes

  • New setConstant(...) methods for preserving one dimension of a matrix by passing in NoChange.
  MatrixXf A(10, 5);               // 10x5  matrix.
  A.setConstant(NoChange, 10, 2);  // 10x10 matrix of 2s.
  A.setConstant(5, NoChange, 3);   //  5x10 matrix of 3s.
  A.setZero(NoChange, 20);         //  5x20 matrix of 0s.
  A.setZero(20, NoChange);         // 20x20 matrix of 0s.
  A.setOnes(NoChange, 5);          // 20x5  matrix of 1s.
  A.setOnes(5, NoChange);          //  5x5  matrix of 1s.
  A.setRandom(NoChange, 10);       //  5x10 random matrix.
  A.setRandom(10, NoChange);       // 10x10 random matrix.
  • Added setUnit(Index i) for vectors that sets the i th coefficient to one and all others to zero.
  VectorXf v(5);
  v.setUnit(3);   // { 0, 0, 0, 1, 0}
  • Added transpose(), adjoint(), conjugate() methods to SelfAdjointView.
  • Added shiftLeft<N>() and shiftRight<N>() coefficient-wise arithmetic shift functions to Arrays.
  ArrayXXi A = ArrayXXi::Random(2, 3);
  ArrayXXi B = A.shiftRight<2>();
  ArrayXXi C = A.shiftLeft<6>();
  • Enabled adding and subtracting of diagonal expressions.
  VectorXf x = VectorXf::Random(5);
  VectorXf y = VectorXf::Random(5);
  MatrixXf A = MatrixXf::Identity(5, 5);
  A += x.asDiagonal() - y.asDiagonal();
  • Allow user-defined default cache sizes via defining EIGEN_DEFAULT_L1_CACHE_SIZE, ..., EIGEN_DEFAULT_L3_CACHE_SIZE.
  • Added EIGEN_ALIGNOF(X) macro for determining alignment of a provided variable.
  • Allow plugins for VectorwiseOp by defining a file EIGEN_VECTORWISEOP_PLUGIN (e.g. -DEIGEN_VECTORWISEOP_PLUGIN=my_vectorwise_op_plugins.h).
  • Allow disabling of IO operations by defining EIGEN_NO_IO.

Improvement to NaN propagation

  • Improvements to NaN correctness for elementwise functions.
  • New NaNPropagation template argument to control whether NaNs are propagated or suppressed in elementwise min/max and corresponding reductions on Array, Matrix, and Tensor. Example for max:
// Elementwise maximum
Eigen::MatrixXf left, right, r0, r1, r2;
r0 = left.cwiseMax(right); // Implementation defined behavior.
// Propagate NaN if either argument is NaN.
r1 = left.template cwiseMax<PropagateNaN>(right);
// Suppress NaN if at least one argument is not a NaN.
r2 = left.template cwiseMax<PropagateNumbers>(right);
 
// Max reductions
Eigen::MatrixXf m;
float nan_or_max = m.maxCoeff(); // Implementation defined behavior.
float nan_if_any_or_max = m.template maxCoeff<PropagateNaN>();
float nan_if_all_or_max = m.template maxCoeff<PropagateNumbers>();

Changes to unsupported modules

New low-latency non-blocking ThreadPool module

  • Originally a part of the Tensor module, Eigen::ThreadPool is now separate and more portable, and forms the basis for multi-threading in TensorFlow, for example. Example:
  #include <Eigen/CXX11/ThreadPool>
 
  const int num_threads = 42;
  Eigen::ThreadPool tp(num_threads);
  auto do_stuff = []() { ... };
  tp.Schedule(do_stuff);

Changes to Tensor module

  • Support for c++03 was officially dropped in Tensor module, since most of the code was written in c++11 anyway. This will prevent building the code for CUDA with older version of nvcc.
  • Performance optimizations of Tensor contraction
    • Speed up "outer-product-like" operations by parallelizing over the contraction dimension, using thread_local buffers and recursive work splitting.
    • Improved threading heuristics.
    • Support for fusing element-wise operations into contraction during evaluation. Example:
// This example applies std::sqrt to all output elements from a tensor contraction. 
// The optional OutputKernel argument to the contraction in this example is a functor over a 
// 2-dimensional buffer. The functor is called once for each output block of the contraction 
// result, to perform the elementwise sqrt operation while the block is hot in cache.
struct SqrtOutputKernel {
  template <typename Index, typename Scalar>
  EIGEN_ALWAYS_INLINE void operator()(
      const internal::blas_data_mapper<Scalar, Index, ColMajor>& output_mapper,
      const TensorContractionParams&, Index, Index, Index num_rows,
      Index num_cols) const {
    for (int i = 0; i < num_rows; ++i) {
      for (int j = 0; j < num_cols; ++j) {
        output_mapper(i, j) = std::sqrt(output_mapper(i, j));
      }
    }
  }
};
 
Tensor<float, 4, DataLayout> left(30, 50, 8, 31);
Tensor<float, 5, DataLayout> right(8, 31, 7, 20, 10);
Tensor<float, 5, DataLayout> result(30, 50, 7, 20, 10);
Eigen::array<DimPair, 2> dims({{DimPair(2, 0), DimPair(3, 1)}});
 
result = left.contract(right, dims, SqrtOutputKernel());
  • Performance optimizations of other Tensor operator
    • Speedups from improved vectorization, block evaluation, and multi-threading for most operators.
    • Significant speedup to broadcasting.
    • Reduction of index computation overhead, e.g. using fast divisors in TensorGenerator, squeezing dimensions in TensorPadding.
  • Complete rewrite of the block (tiling) evaluation framework for tensor expressions lead to significant speedups and reduced number of memory allocations.
  • Added new API for asynchronous evaluation of tensor expressions. Example:
  Tensor<float, 3> in1(200, 30, 70);
  Tensor<float, 3> in2(200, 30, 70);
  Tensor<float, 3> out(200, 30, 70);
 
  Eigen::ThreadPool tp(internal::random<int>(3, 11));
  Eigen::ThreadPoolDevice thread_pool_device(&tp, internal::random<int>(3, 11));
 
  Eigen::Barrier b(1);
  auto done = [&b]() { b.Notify(); };
  out.device(thread_pool_device, std::move(done)) = in1 + in2 * 3.14f;
  b.Wait();
  • Misc. minor behavior changes & fixes:
    • Fix const correctness for TensorMap.
    • Modify tensor argmin/argmax to always return first occurrence.
    • More numerically stable tree reduction.
    • Improve randomness of the tensor random generator.
    • Update the padding computation for PADDING_SAME to be consistent with TensorFlow.
    • Support static dimensions (aka IndexList) in resizing/reshape/broadcast.
    • Improved accuracy of Tensor FFT.

Improvements to FFT module

  • Faster and more accurate twiddle factor computation.

Improvements to EulerAngles

  • EulerAngles can now be directly constructed from 3D vectors
  • EulerAngles now provide isApprox() and cast() functions

Changes to sparse iterative solvers

  • Added new IRDS iterative linear solver.
  #include <unsupported/Eigen/IterativeSolvers>
  A.makeCompressed();   // Recommendation is to compress input before calling sparse solvers.
  IDRS<SparseMatrix<float>, DiagonalPreconditioner<float> > idrs(A);
  if (idrs.info() == ComputationInfo::Success) {
    VectorXf x = idrs.solve(b);
  }

Improvements to Polynomials

  • PolynomialSolver can now be used with complex numbers
  • The solver will automatically choose between EigenSolver and ComplexEigenSolver depending on the scalar type used

Other relevant changes

  • Eigen now provides an option to test with an external BLAS library
  • Eigen can now be used with the PGI Compiler
  • Printing when using GDB has been improved
  • Eigen can now detect if a platform supports int128 intrinsics

Testing

The full Eigen test suite was built and run successfully (in c++03 and c++11 mode) with the following compiler/platform/OS combinations.

Compiler Version Platform Operating system
Microsoft Visual Studio 2015 Update 3 x86-64 Windows
Microsoft Visual Studio Community 2017 - 15.9.38 x86-64 Windows
Microsoft Visual Studio Community 2019 - 16.11 x86-64 Windows
GCC 4.8 x86-64 Linux
GCC 9 x86-64 Linux
GCC 10 x86-64 Linux
Clang 6.0 x86-64 Linux
Clang 10 x86-64 Linux
Clang 11 x86-64 Linux
GCC 10 armv8.2-a Linux
Clang 6 armv8.2-a Linux
Clang 9 armv8.2-a Linux
Clang 10 armv8.2-a Linux
Clang 11 armv8.2-a Linux
AppleClang 12.0.5 x86-64 macOS
GCC 10 ppc64le Linux
Clang 10 ppc64le Linux

Additional information

  • A curated list of commits, approximately organized by the same topics as the release notes above, and sorted in reverse chronological order can be found here.