Testcase test_minmax_nan_propagation from unsupported/test/cxx11_tensor_expr fails on Power (ppc64le Linux) with following error - [root@4614895ff275 test]# ./cxx11_tensor_expr Initializing random number generator with seed 1514789803 Repeating each test 10 times For tensor of size: 1 NaN compared to 0 outputs: nan For tensor of size: 2 NaN compared to 0 outputs: nan nan For tensor of size: 3 NaN compared to 0 outputs: nan nan nan For tensor of size: 4 NaN compared to 0 outputs: 0 Test test_minmax_nan_propagation() failed in /root/.cache/bazel/_bazel_root/7d6c9d3c54087a4af55f3ebce8d864ef/external/eigen_archive/unsupported/test/cxx11_tensor_expr.cpp (320) (numext::isnan)(vec_res(i)) For tensors of size 1, 2 and 3 containing NaNs, when compared to zero returns nan as expected. However when the size of tensor starts increasing after 3, it starts giving zero. This behavior is weird. On x86, it consistently returns NaN for any sized tensor. This functionality although marked as unsupported on Eigen but is used by tensorflow and is covered by tensorflow's test case for Relu operator (which uses maximum operator). I found an existing merged PR of Eigen addressing this issue on x86 http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1373. Referring that I tried making similar change in Altivec related PacketMath.h but no luck. In context of tensorflow, there is an issue logged for tensorflow's test case https://github.com/tensorflow/tensorflow/issues/11603 where I've put my findings about vec_max function on PPC and its behavior. Could anyone please provide me some pointers on this?
*** Bug 1495 has been marked as a duplicate of this bug. ***
*** Bug 1496 has been marked as a duplicate of this bug. ***
*** Bug 1501 has been marked as a duplicate of this bug. ***
Any pointers on what could be the problem or fix would be really helpful.
Created attachment 831 [details] Patch to Altivec Arch PacketMath to provide same behavior as x86 The attached patch changes the behavior of PacketMath's pmax and pmin functions for the Altivec (Power) architecture and provides the same behavior as http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1373 provided to the x86 archs. This allows the cxx11_tensor_expr and redux_? tests to pass. While this will make Altivec/Power match x86 behavior, it leaves the CUDA/GPU architecture having different behavior from x86 and Altivec/Power. Here is a cross post from investigation into Eigen's pmin/pmax from a TensorFlow issue where the difference between CPU and GPU computation was noted: https://github.com/tensorflow/tensorflow/issues/12659 The differences seen in 11603 and here likely come down to the architecture specific implementations of Eigen's PacketMath pmax function. Per IEEE-754R the behavior should be: min(x,QNaN) == min(QNaN,x) == x (same for max) In practice with std::max we see: std::max(nan, x) = nan std::max(x, nan) = x In Eigen's pmax functions it differs per architecture: x86 Eigen PacketMath pmax Through Eigen bug 1373 the x86 archs under Eigen/src/Core/arch were changed to match std::max's behavior of always returning the first argument when comparing NaNs. Previous to this it followed the Intel SIMD intrinsics' behavior of always returning the second argument. CUDA Eigen PacketMath pmax Float and double are implemented using fmaxf and fmax respectively. Per CUDA's API doc these handle NaNs like this: "If one argument is NaN, returns the numeric argument." So the Eigen CUDA impl is following IEEE-754R in this respect. Half math is a bit different and is using > and < operators on the respective halves. ALTIVEC (used by ppc64le) Eigen PacketMath pmax The pmax functions are implemented using the vec_max intrinsic which follows IEEE-754R and thus returns numeric for numeric vs NaN comparisons. So to resolve the issue of differing functionality between std::max, x86 Eigen, and CUDA/ALTIVEC Eigen a decision has to be made which behavior is the desired behavior. The changes under Eigen bug 1373 were done because it "makes the Eigen Tensor library properly propagate NaNs for the Relu operator, which is performance critical for neural net models."
I moved this to 'Core - Vectorization', since it does not appear to be a Tensor specific problem (correct me if I'm wrong). I don't have any Altivec/Power machine to check your patch, though. Regarding the discussion on the behavior of max/min, I'd suggest continuing it at Bug 564.
I tested and applied a slightly different version of the patch because xvcmp* require VSX. https://bitbucket.org/eigen/eigen/commits/52ff7dddccfaad55 (default) https://bitbucket.org/eigen/eigen/commits/d0af83f82b1912e6 (3.3) I measured a 10% of performance hit on a POWER7. The problem is still open regarding CUDA
What is wrong with using VSX? And a 10% performance hit is very significant. This is a severe problem and the proposed patch needs to be improved.
Created attachment 846 [details] Performance test case for ppc64 With the attached performance test program I am seeing a 6% performance hit on Power9 and a 8% performance GAIN on Power8 between the vec_min/max and the nan propagating assembly in the PacketMath patch. I did nothing to adjust performance governers, SMT, etc. The Power9 had SMT-4, the Power8 had SMT-8. The test program also tests the idea of bounding the inline assembly version with vec_any_nan checks and only call the assembly inline if either input has nans and call the existing vec_min or vec_max intrinsic in the no-nan case. The test program shows there is no performance impact to adding this. So to state another way, we could add vec_any_nan checks around the assembly and only call it if nans are involved. This would only incur the performance hit in the nan case. However, it would negate the performance gain we're seeing on Power8 as well.
Created attachment 847 [details] mini benchmark for pmin I attached a more practical mini benchmark measuring the use of pmin in coefficient-wise operations and reductions. Here is what I get (I used the best run over thousands of runs because the POWER servers I have access to are pretty busy): POWER7 vec_min asm coeffwise: 2.6218e-05s 2.9966e-05s redux: 2.2514e-05s 4.3434e-05s POWER 8 vec_min asm coeffwise: 2.2198e-05s 2.8444e-05s redux: 1.5742e-05s 2.3142e-05s The impact is much more than the 10% I initially measured on a POWER7 for coeff-wise operations.
I can confirm that I see similar degradation numbers to Gael on POWER8 and POWER9 using his mini benchmark. I further modified the benchmark to go guard the assembly calls with nan checks and call vec_min if NaNs are not present. Unlike in my test program, Gael's test program shows that this has a performance hit and it actually makes it worse than just calling the assembly. #ifdef __VSX__ if (vec_any_nan(a) || vec_any_nan(b)) { Packet4f ret; __asm__ ("xvcmpgesp %x0,%x1,%x2\n\txxsel %x0,%x1,%x2,%x0" : "=&wa" (ret) : "wa" (a), "wa" (b)); return ret; } else return vec_min(a, b); #else return vec_min(a, b); #endif
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1494.