This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen

Bug 1652

Summary: Corner cases in SIMD sin/cos
Product: Eigen Reporter: Gael Guennebaud <gael.guennebaud>
Component: Core - vectorizationAssignee: Nobody <eigen.nobody>
Status: RESOLVED FIXED    
Severity: Accuracy Problem CC: afrunze, chtz, gael.guennebaud, jacob.benoit.1, markos, rmlarsen
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on:    
Bug Blocks: 814    
Attachments:
Description Flags
proof-of-concept
none
Improved version of psincos none

Description Gael Guennebaud 2018-12-27 15:35:48 UTC
This is an entry to discuss how to handle corner cases in SIMD sin/cos. Let's call x the input, there are 3 different corner cases:

(p1) x==NaN -> returns NaN
(p2) x==+/-INF -> returns NaN
(p3) x is "large" -> returns ??

There is nothing special about x==NaN as NaN is naturally propagated.

The case x==INF is also naturally handled by the default implementation but preserving this property while handling x=="large" might require special treatment. See below.

Regarding x==large, in 3.3 sin/cos might return numbers outside the [-1:1] range which is not satisfactory. Nonetheless it is important to note that we are talking about values of x such that the distance between two successive floats is larger than 2pi, so basically all you can expect is random numbers. This means there is no need to be too pedantic regarding accuracy, but we might still want to preserve some properties like:

(p4) sin/cos(huge) in [-1:1]
(p5) sin^2(huge) + cos^2(huge) == 1

Those properties are easy to preserve but so far I did not find a way to enforce all of them without a significant overhead:

                       Time    (p2) | (p4) | (p5)
3.3:                   0.39    YES     NO     NO
ada7f22 (head):        0.41    YES     YES    NO
MSA-like:              0.46    YES     YES    YES
clamp to [-pi/4,pi/4]: 0.43    NO      YES    YES
 " + handling of NaN:  0.44    YES     YES    YES

The last version is implemented by adding the following lines prior to the sin/cos polynomial evaluations:

  x = padd(x,psub(_x,_x)); // convert +INF to NaN (borrowed from the MSA implementation)
  x = pmin(x,pset1<Packet>( EIGEN_PI/4.f));
  x = pmax(x,pset1<Packet>(-EIGEN_PI/4.f));

but it implies a +12% of overhead compared to no special treatment of large inputs.


The MSA-like version is implemented by adding the following lines right after taking the absolute value of the input:

  x = padd(x,psub(_x,_x));
  Packet small_or_nan_mask = pcmp_lt_or_nan(x, pset1<Packet>(13176795.f));
  x = pand(x, small_or_nan_mask);

The overhead is even larger.

I'm not 100% sure what to do on that matter. Opinions welcome.
Comment 1 Gael Guennebaud 2018-12-27 15:43:05 UTC
For completeness, the MSA-like variant can also be simplified as:

  x = padd(x,psub(_x,_x));
  x = pmin(x, pset1<Packet>(13176795.f));

but it's still as costly as clamping to [-pi/4,pi/4] + handling of NaN
Comment 2 Christoph Hertzberg 2018-12-27 22:01:13 UTC
Created attachment 911 [details]
proof-of-concept

I played around a bit with using the
  y= (x+float(1<<23)) - float(1<<23);
rounding trick. That means, there are now two floating point add/subs instead of two conversion-casts.
Since this already rounds to nearest, it saves the (y+1)&(~1) operation, and infinities will immediately result in NaN, when calculating x-= y*pi/2

With some effort, the int-operations may even be removed, allowing full AVX1-compatibility (without splitting the packet).

See the attachment. There is still some debug code (and I did not benchmark that yet)
Comment 3 Christoph Hertzberg 2018-12-28 00:43:08 UTC
I was further testing/debugging. The current implementation results in NaN if the reduced x*x is infinity and the cos-polynomial is evaluated.

This happens in
  y1 = psub(y1, pmul(x2, cst_half));
because y1 and x2*0.5f are both positive infinity and are subtracted from each other. Is there a reason this part of the polynomial is not evaluated using Horner's method like the first three terms? (Precision loss in some corner cases?)
Of course, this is no problem, if x is clamped.


Independent of that, a minor possible improvement: 
If we have a "highest-bit"-pselect function, we can choose the poly_mask by a simple left-shift (instead of pand and pcomp, although the throughput of the latter might actually be better -- and latency is not an issue here)
Comment 4 Christoph Hertzberg 2018-12-29 14:28:13 UTC
Here are some benchmarking results for 3.3/master and my proposal for different architectures:

ns per calculation for 1024 elements
         sse42   avx  avx2   fma  avx2+fma
3.3 (3011dc2762b2)
  sin     3.09  1.85  1.76  1.39  1.31
  cos     3.15 76.98 77.06 77.06 77.10
master (591f3f24fec6)
  sin     3.34  2.18  1.88  1.80  1.50
  cos     3.43  5.23  1.86  1.91  1.50
my-patch
  sin     3.15  1.87  1.74  1.45  1.32
  cos     3.16  2.16  1.76  1.92  1.32
Comment 5 Gael Guennebaud 2018-12-29 22:20:46 UTC
Nice trick. but this decreases accuracy for inputs larger than 8*pi. Max absolute errors:

input range   3.3            this patch
[-4pi,4pi]    5.96046e-08    5.96046e-08
[-8pi,8pi]    5.96046e-08    4.76837e-07
[-16pi,16pi]  5.96046e-08    7.7486e-07
[-32pi,32pi]  5.96046e-08    1.66893e-06
[-64pi,64pi]  5.96046e-08    7.27177e-06

(I used bench/math_func_prec.cpp)
Comment 6 Christoph Hertzberg 2018-12-30 00:39:39 UTC
(In reply to Gael Guennebaud from comment #5)
> Nice trick. but this decreases accuracy for inputs larger than 8*pi. 

That trick should be accurate as long as 
  abs(x)*(2/pi) < 2^23
Maybe the clamping to [-pi/4,pi/4] introduces inaccuracies (if pi/4 is rounded down). Can you try deactivating that? (Should only be necessary for very large values)

Did you test with SSE/AVX1/AVX2/FMA?

> (I used bench/math_func_prec.cpp)

I can't find that file in
https://bitbucket.org/eigen/eigen/src/591f3f24f/bench/?at=default
Do you have that only locally? And should it maybe part of the unit-test suite?
Comment 7 Gael Guennebaud 2018-12-30 00:40:16 UTC
One of the difficulty in fixing this accuracy issue is that adding 2^23 for rounding is not deterministic for inputs of the form X.5. For some X you get X, for others you get X+1. In practice this means that after the ""Extended precision modular arithmetic" step, we get some entries slightly outside the expected range. Those entries gets truncated, hence the accuracy loss. Actually, the sine/cosine polynomials are valid in a larger range than [-pi/4,pi/4], so a simple fix seems to be to extend the range for clamping.
Comment 8 Gael Guennebaud 2018-12-30 00:44:02 UTC
my bad, I never committed/pushed that bench/math_func_prec.cpp file. I agree that should be part of the unit tests, at least for tracking regressions.
Comment 9 Christoph Hertzberg 2019-01-04 16:32:17 UTC
Interesting alternative to our Cephes-based implementations:
  https://sleef.org/
The implementation looks much better optimized for SIMD, e.g., the use just one polynomial (with only one more coefficient) which is valid on [-pi/2:pi/2] (instead of two polynomials valid only on [-pi/4:pi/4]).
Comment 10 Gael Guennebaud 2019-01-04 22:02:08 UTC
I quickly tested sleef with SSE4. The 1ULP version is twice slower. The 3.5ULP version is x1.33 times faster. Regarding accuracy, the 3.5ULP version is less accurate than our current implementation for inputs in the range [-2000,2000] (roughly). For this range our implementation achieve 0.5ULP. For larger inputs, however, the sleef 3.5ULP version remains more accurate.
Comment 11 Gael Guennebaud 2019-01-07 15:27:58 UTC
Created attachment 913 [details]
Improved version of psincos

Playing a bit more with sleef and our cephes based code, I found that our code was pretty good in terms of absolute error, but very bad in terms of relative precision (ULP). The original cephes code exhibits the same issue.

Sleef use a single higher order polynomial approximation over the [-pi/2,pi/2] range, but to get 1ULP accuracy (even within the [-pi/2,pi/2] range) this polynomial needs to be evaluated with extra precision in the last 3 operations. Tracking the residuals in a second variable is very costly and requires true FMA.

Therefore, using 2 lower order polynomials over two different ranges is not that bad, and even faster.

To improve our current code, I borrowed the modulo coefficients of this sinf implementation:

https://stackoverflow.com/questions/30463616/payne-hanek-algorithm-implementation-in-c/30465751#30465751

and recomputed the sin and cos polynomial coefficients to improve relative accuracy.

With true FMA, I managed to get a max error of 1ULP for x<117435 while sticking to a low-order sin polynomial (the post above use an additional x^9 term).

Without true FMA, getting accurate modulo is more tricky, but using one more iteration as the sleef code, we can get 1ULP error for x<2pi (which is a welcome property IMO), and 2ULP for x<51981.

Regarding speed, we're still pretty good:

        SSE2  SSE4  AVX  AVX2+FMA
sleef35   32    32   20   17
sleef10   96    96   62   32
old       44    44   30   21
new       42    42   25   16

This new version is both more accurate and faster than what we currently have, and with true AVX2+FMA we get same speed as the 3.5ULP version of sleef but with 1ULP precision.

We still need to improve the handling of larger input, and for this the sleef approach of branching on the max input value sounds pretty reasonable (low overhead for likely inputs, significant penalty for unlikely ones).

I haven't measured the accuracy of cos yet, the valid ranges are likely to be slightly different.
Comment 12 Gael Guennebaud 2019-01-07 23:13:44 UTC
I've improved the sine-polynomial coefficients to maintain an error of 1ULP for cos(x) up to x<71476, which is good news.

For larger inputs, on x86, the simplest seems to fall back to a scalar path. This is faster than both the 1ULP and 3.5ULP vectorized versions in sleef! This is because on x86 we have fast support for double precision, and implementing sinf(float) using double precision is much simpler (as done in glibc). We could be even faster by implementing a vectorized version of sinf/cosf based on double precision. On other architectures, this might be different though.
Comment 13 Gael Guennebaud 2019-01-07 23:15:30 UTC
btw, the overhead of the falling-back logic is 0 (for not too huge inputs) because it is compensated by the removal of the clamp.
Comment 14 Gael Guennebaud 2019-01-09 17:06:07 UTC
I applied the changes I discussed previously:

https://bitbucket.org/eigen/eigen/commits/8d8e63a86550/
Summary:     Bug 1652: implements a much more accurate version of vectorized sin/cos. This new version achieve same speed for SSE/AVX, and is slightly faster with FMA. Guarantees are as follows:
  - no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs
  - FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs

There are more details in the code itself.
Comment 15 Christoph Hertzberg 2019-01-10 13:40:00 UTC
Only one minor issue:  `alignas(Packet)` breaks compilation on g++4.7 with c++11 mode enabled (yes c++11 support of g++4.7 is incomplete, even though it supports -std=c++11).
I'd suggest adding a macro EIGEN_ALIGNAS in ConfigureVectorization which is either `alignas(x)` or falls back to `EIGEN_ALIGN_TO_BOUNDARY(EIGEN_ALIGNOF(x))`. (this would also separate preprocessor-logic from the implementation)

Alternatively, we could just disable EIGEN_HAS_CXX11 for g++ < 4.8
Comment 16 Nobody 2019-12-04 18:19:54 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1652.