New user self-registration is disabled due to spam. Please email eigen-core-team @ lists.tuxfamily.org if you need an account.
Before reporting a bug, please make sure that your Eigen version is up-to-date!
Bug 1494 - Test case test_minmax_nan_propagation from cxx11_tensor_expr fails on PPC
Summary: Test case test_minmax_nan_propagation from cxx11_tensor_expr fails on PPC
Status: REVIEWNEEDED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Core - vectorization (show other bugs)
Version: 3.3 (current stable)
Hardware: PPC - AltiVec Linux
: Normal Failed Unit Test
Assignee: Nobody
URL:
Whiteboard:
Keywords:
: 1495 1496 1501 (view as bug list)
Depends on:
Blocks: 1555
  Show dependency treegraph
 
Reported: 2018-01-01 08:12 UTC by Nishidha Panpaliya
Modified: 2018-10-08 20:59 UTC (History)
7 users (show)



Attachments
Patch to Altivec Arch PacketMath to provide same behavior as x86 (2.54 KB, patch)
2018-03-01 17:51 UTC, Samuel Matzek
no flags Details | Diff
Performance test case for ppc64 (7.11 KB, text/x-csrc)
2018-04-04 18:56 UTC, Samuel Matzek
no flags Details
mini benchmark for pmin (1.03 KB, text/x-csrc)
2018-04-04 21:12 UTC, Gael Guennebaud
no flags Details

Description Nishidha Panpaliya 2018-01-01 08:12:19 UTC
Testcase test_minmax_nan_propagation from unsupported/test/cxx11_tensor_expr fails on Power (ppc64le Linux) with following error -

[root@4614895ff275 test]# ./cxx11_tensor_expr
Initializing random number generator with seed 1514789803
Repeating each test 10 times
For tensor of size: 1 NaN compared to 0 outputs:
nan

For tensor of size: 2 NaN compared to 0 outputs:
nan     nan

For tensor of size: 3 NaN compared to 0 outputs:
nan     nan     nan

For tensor of size: 4 NaN compared to 0 outputs:
0
Test test_minmax_nan_propagation() failed in /root/.cache/bazel/_bazel_root/7d6c9d3c54087a4af55f3ebce8d864ef/external/eigen_archive/unsupported/test/cxx11_tensor_expr.cpp (320)
    (numext::isnan)(vec_res(i))

For tensors of size 1, 2 and 3 containing NaNs, when compared to zero returns nan as expected. However when the size of tensor starts increasing after 3, it starts giving zero. This behavior is weird.
On x86, it consistently returns NaN for any sized tensor. 
This functionality although marked as unsupported on Eigen but is used by tensorflow and is covered by tensorflow's test case for Relu operator (which uses maximum operator). 
I found an existing merged PR of Eigen addressing this issue on x86 http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1373. Referring that I tried making similar change in Altivec related PacketMath.h but no luck. 
In context of tensorflow, there is an issue logged for tensorflow's test case https://github.com/tensorflow/tensorflow/issues/11603 where I've put my findings about vec_max function on PPC and its behavior. 
Could anyone please provide me some pointers on this?
Comment 1 Gael Guennebaud 2018-01-04 17:03:21 UTC
*** Bug 1495 has been marked as a duplicate of this bug. ***
Comment 2 Gael Guennebaud 2018-01-04 17:03:43 UTC
*** Bug 1496 has been marked as a duplicate of this bug. ***
Comment 3 Gael Guennebaud 2018-01-04 17:04:08 UTC
*** Bug 1501 has been marked as a duplicate of this bug. ***
Comment 4 Nishidha Panpaliya 2018-01-25 15:16:49 UTC
Any pointers on what could be the problem or fix would be really helpful.
Comment 5 Samuel Matzek 2018-03-01 17:51:59 UTC
Created attachment 831 [details]
Patch to Altivec Arch PacketMath to provide same behavior as x86

The attached patch changes the behavior of PacketMath's pmax and pmin functions for the Altivec (Power) architecture and provides the same behavior as http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1373 provided to the x86 archs.

This allows the cxx11_tensor_expr and redux_? tests to pass.

While this will make Altivec/Power match x86 behavior, it leaves the CUDA/GPU architecture having different behavior from x86 and Altivec/Power.

Here is a cross post from investigation into Eigen's pmin/pmax from a TensorFlow issue where the difference between CPU and GPU computation was noted:
https://github.com/tensorflow/tensorflow/issues/12659

The differences seen in 11603 and here likely come down to the architecture specific implementations of Eigen's PacketMath pmax function.

Per IEEE-754R the behavior should be:

min(x,QNaN) == min(QNaN,x) == x (same for max)

In practice with std::max we see:
std::max(nan, x) = nan
std::max(x, nan) = x

In Eigen's pmax functions it differs per architecture:

x86 Eigen PacketMath pmax
Through Eigen bug 1373 the x86 archs under Eigen/src/Core/arch were changed to match std::max's behavior of always returning the first argument when comparing NaNs. Previous to this it followed the Intel SIMD intrinsics' behavior of always returning the second argument.

CUDA Eigen PacketMath pmax
Float and double are implemented using fmaxf and fmax respectively. Per CUDA's API doc these handle NaNs like this: "If one argument is NaN, returns the numeric argument."

So the Eigen CUDA impl is following IEEE-754R in this respect.

Half math is a bit different and is using > and < operators on the respective halves.

ALTIVEC (used by ppc64le) Eigen PacketMath pmax

The pmax functions are implemented using the vec_max intrinsic which follows IEEE-754R and thus returns numeric for numeric vs NaN comparisons.

So to resolve the issue of differing functionality between std::max, x86 Eigen, and CUDA/ALTIVEC Eigen a decision has to be made which behavior is the desired behavior.

The changes under Eigen bug 1373 were done because it "makes the Eigen Tensor library properly propagate NaNs for the Relu operator, which is performance critical for neural net models."
Comment 6 Christoph Hertzberg 2018-03-01 18:05:41 UTC
I moved this to 'Core - Vectorization', since it does not appear to be a Tensor specific problem (correct me if I'm wrong).
I don't have any Altivec/Power machine to check your patch, though.

Regarding the discussion on the behavior of max/min, I'd suggest continuing it at Bug 564.
Comment 7 Gael Guennebaud 2018-04-04 10:05:33 UTC
I tested and applied a slightly different version of the patch because xvcmp* require VSX.

https://bitbucket.org/eigen/eigen/commits/52ff7dddccfaad55 (default)
https://bitbucket.org/eigen/eigen/commits/d0af83f82b1912e6 (3.3)

I measured a 10% of performance hit on a POWER7.

The problem is still open regarding CUDA
Comment 8 David Edelsohn 2018-04-04 12:44:44 UTC
What is wrong with using VSX?

And a 10% performance hit is very significant.  This is a severe problem and the proposed patch needs to be improved.
Comment 9 Samuel Matzek 2018-04-04 18:56:44 UTC
Created attachment 846 [details]
Performance test case for ppc64

With the attached performance test program I am seeing a 6% performance hit on Power9 and a 8% performance GAIN on Power8 between the vec_min/max and the nan propagating assembly in the PacketMath patch.

I did nothing to adjust performance governers, SMT, etc.  The Power9 had SMT-4, the Power8 had SMT-8.

The test program also tests the idea of bounding the inline assembly version with vec_any_nan checks and only call the assembly inline if either input has nans and call the existing vec_min or vec_max intrinsic in the no-nan case.

The test program shows there is no performance impact to adding this.  So to state another way, we could add vec_any_nan checks around the assembly and only call it if nans are involved. This would only incur the performance hit in the nan case.  However, it would negate the performance gain we're seeing on Power8 as well.
Comment 10 Gael Guennebaud 2018-04-04 21:12:09 UTC
Created attachment 847 [details]
mini benchmark for pmin

I attached a more practical mini benchmark measuring the use of pmin in coefficient-wise operations and reductions.

Here is what I get (I used the best run over thousands of runs because the POWER servers I have access to are pretty busy):

POWER7
            vec_min         asm
coeffwise:  2.6218e-05s     2.9966e-05s
redux:      2.2514e-05s     4.3434e-05s

POWER 8
            vec_min         asm
coeffwise:  2.2198e-05s     2.8444e-05s
redux:      1.5742e-05s     2.3142e-05s

The impact is much more than the 10% I initially measured on a POWER7 for coeff-wise operations.
Comment 11 Samuel Matzek 2018-04-05 19:17:33 UTC
I can confirm that I see similar degradation numbers to Gael on POWER8 and POWER9 using his mini benchmark.

I further modified the benchmark to go guard the assembly calls with nan checks and call vec_min if NaNs are not present.  Unlike in my test program, Gael's test program shows that this has a performance hit and it actually makes it worse than just calling the assembly.

  #ifdef __VSX__
  if (vec_any_nan(a) || vec_any_nan(b)) {
      Packet4f ret;
      __asm__ ("xvcmpgesp %x0,%x1,%x2\n\txxsel %x0,%x1,%x2,%x0" : "=&wa" (ret) : "wa" (a), "wa" (b));
      return ret;
  }
  else
      return vec_min(a, b);
  #else
  return vec_min(a, b);
  #endif

Note You need to log in before you can comment on or make changes to this bug.