This is an entry to discuss how to handle corner cases in SIMD sin/cos. Let's call x the input, there are 3 different corner cases:
(p1) x==NaN -> returns NaN
(p2) x==+/-INF -> returns NaN
(p3) x is "large" -> returns ??
There is nothing special about x==NaN as NaN is naturally propagated.
The case x==INF is also naturally handled by the default implementation but preserving this property while handling x=="large" might require special treatment. See below.
Regarding x==large, in 3.3 sin/cos might return numbers outside the [-1:1] range which is not satisfactory. Nonetheless it is important to note that we are talking about values of x such that the distance between two successive floats is larger than 2pi, so basically all you can expect is random numbers. This means there is no need to be too pedantic regarding accuracy, but we might still want to preserve some properties like:
(p4) sin/cos(huge) in [-1:1]
(p5) sin^2(huge) + cos^2(huge) == 1
Those properties are easy to preserve but so far I did not find a way to enforce all of them without a significant overhead:
Time (p2) | (p4) | (p5)
3.3: 0.39 YES NO NO
ada7f22 (head): 0.41 YES YES NO
MSA-like: 0.46 YES YES YES
clamp to [-pi/4,pi/4]: 0.43 NO YES YES
" + handling of NaN: 0.44 YES YES YES
The last version is implemented by adding the following lines prior to the sin/cos polynomial evaluations:
x = padd(x,psub(_x,_x)); // convert +INF to NaN (borrowed from the MSA implementation)
x = pmin(x,pset1<Packet>( EIGEN_PI/4.f));
x = pmax(x,pset1<Packet>(-EIGEN_PI/4.f));
but it implies a +12% of overhead compared to no special treatment of large inputs.
The MSA-like version is implemented by adding the following lines right after taking the absolute value of the input:
x = padd(x,psub(_x,_x));
Packet small_or_nan_mask = pcmp_lt_or_nan(x, pset1<Packet>(13176795.f));
x = pand(x, small_or_nan_mask);
The overhead is even larger.
I'm not 100% sure what to do on that matter. Opinions welcome.
For completeness, the MSA-like variant can also be simplified as:
x = padd(x,psub(_x,_x));
x = pmin(x, pset1<Packet>(13176795.f));
but it's still as costly as clamping to [-pi/4,pi/4] + handling of NaN
Created attachment 911 [details]
I played around a bit with using the
y= (x+float(1<<23)) - float(1<<23);
rounding trick. That means, there are now two floating point add/subs instead of two conversion-casts.
Since this already rounds to nearest, it saves the (y+1)&(~1) operation, and infinities will immediately result in NaN, when calculating x-= y*pi/2
With some effort, the int-operations may even be removed, allowing full AVX1-compatibility (without splitting the packet).
See the attachment. There is still some debug code (and I did not benchmark that yet)
I was further testing/debugging. The current implementation results in NaN if the reduced x*x is infinity and the cos-polynomial is evaluated.
This happens in
y1 = psub(y1, pmul(x2, cst_half));
because y1 and x2*0.5f are both positive infinity and are subtracted from each other. Is there a reason this part of the polynomial is not evaluated using Horner's method like the first three terms? (Precision loss in some corner cases?)
Of course, this is no problem, if x is clamped.
Independent of that, a minor possible improvement:
If we have a "highest-bit"-pselect function, we can choose the poly_mask by a simple left-shift (instead of pand and pcomp, although the throughput of the latter might actually be better -- and latency is not an issue here)
Here are some benchmarking results for 3.3/master and my proposal for different architectures:
ns per calculation for 1024 elements
sse42 avx avx2 fma avx2+fma
sin 3.09 1.85 1.76 1.39 1.31
cos 3.15 76.98 77.06 77.06 77.10
sin 3.34 2.18 1.88 1.80 1.50
cos 3.43 5.23 1.86 1.91 1.50
sin 3.15 1.87 1.74 1.45 1.32
cos 3.16 2.16 1.76 1.92 1.32
Nice trick. but this decreases accuracy for inputs larger than 8*pi. Max absolute errors:
input range 3.3 this patch
[-4pi,4pi] 5.96046e-08 5.96046e-08
[-8pi,8pi] 5.96046e-08 4.76837e-07
[-16pi,16pi] 5.96046e-08 7.7486e-07
[-32pi,32pi] 5.96046e-08 1.66893e-06
[-64pi,64pi] 5.96046e-08 7.27177e-06
(I used bench/math_func_prec.cpp)
(In reply to Gael Guennebaud from comment #5)
> Nice trick. but this decreases accuracy for inputs larger than 8*pi.
That trick should be accurate as long as
abs(x)*(2/pi) < 2^23
Maybe the clamping to [-pi/4,pi/4] introduces inaccuracies (if pi/4 is rounded down). Can you try deactivating that? (Should only be necessary for very large values)
Did you test with SSE/AVX1/AVX2/FMA?
> (I used bench/math_func_prec.cpp)
I can't find that file in
Do you have that only locally? And should it maybe part of the unit-test suite?
One of the difficulty in fixing this accuracy issue is that adding 2^23 for rounding is not deterministic for inputs of the form X.5. For some X you get X, for others you get X+1. In practice this means that after the ""Extended precision modular arithmetic" step, we get some entries slightly outside the expected range. Those entries gets truncated, hence the accuracy loss. Actually, the sine/cosine polynomials are valid in a larger range than [-pi/4,pi/4], so a simple fix seems to be to extend the range for clamping.
my bad, I never committed/pushed that bench/math_func_prec.cpp file. I agree that should be part of the unit tests, at least for tracking regressions.
Interesting alternative to our Cephes-based implementations:
The implementation looks much better optimized for SIMD, e.g., the use just one polynomial (with only one more coefficient) which is valid on [-pi/2:pi/2] (instead of two polynomials valid only on [-pi/4:pi/4]).
I quickly tested sleef with SSE4. The 1ULP version is twice slower. The 3.5ULP version is x1.33 times faster. Regarding accuracy, the 3.5ULP version is less accurate than our current implementation for inputs in the range [-2000,2000] (roughly). For this range our implementation achieve 0.5ULP. For larger inputs, however, the sleef 3.5ULP version remains more accurate.
Created attachment 913 [details]
Improved version of psincos
Playing a bit more with sleef and our cephes based code, I found that our code was pretty good in terms of absolute error, but very bad in terms of relative precision (ULP). The original cephes code exhibits the same issue.
Sleef use a single higher order polynomial approximation over the [-pi/2,pi/2] range, but to get 1ULP accuracy (even within the [-pi/2,pi/2] range) this polynomial needs to be evaluated with extra precision in the last 3 operations. Tracking the residuals in a second variable is very costly and requires true FMA.
Therefore, using 2 lower order polynomials over two different ranges is not that bad, and even faster.
To improve our current code, I borrowed the modulo coefficients of this sinf implementation:
and recomputed the sin and cos polynomial coefficients to improve relative accuracy.
With true FMA, I managed to get a max error of 1ULP for x<117435 while sticking to a low-order sin polynomial (the post above use an additional x^9 term).
Without true FMA, getting accurate modulo is more tricky, but using one more iteration as the sleef code, we can get 1ULP error for x<2pi (which is a welcome property IMO), and 2ULP for x<51981.
Regarding speed, we're still pretty good:
SSE2 SSE4 AVX AVX2+FMA
sleef35 32 32 20 17
sleef10 96 96 62 32
old 44 44 30 21
new 42 42 25 16
This new version is both more accurate and faster than what we currently have, and with true AVX2+FMA we get same speed as the 3.5ULP version of sleef but with 1ULP precision.
We still need to improve the handling of larger input, and for this the sleef approach of branching on the max input value sounds pretty reasonable (low overhead for likely inputs, significant penalty for unlikely ones).
I haven't measured the accuracy of cos yet, the valid ranges are likely to be slightly different.
I've improved the sine-polynomial coefficients to maintain an error of 1ULP for cos(x) up to x<71476, which is good news.
For larger inputs, on x86, the simplest seems to fall back to a scalar path. This is faster than both the 1ULP and 3.5ULP vectorized versions in sleef! This is because on x86 we have fast support for double precision, and implementing sinf(float) using double precision is much simpler (as done in glibc). We could be even faster by implementing a vectorized version of sinf/cosf based on double precision. On other architectures, this might be different though.
btw, the overhead of the falling-back logic is 0 (for not too huge inputs) because it is compensated by the removal of the clamp.
I applied the changes I discussed previously:
Summary: Bug 1652: implements a much more accurate version of vectorized sin/cos. This new version achieve same speed for SSE/AVX, and is slightly faster with FMA. Guarantees are as follows:
- no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs
- FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs
There are more details in the code itself.
Only one minor issue: `alignas(Packet)` breaks compilation on g++4.7 with c++11 mode enabled (yes c++11 support of g++4.7 is incomplete, even though it supports -std=c++11).
I'd suggest adding a macro EIGEN_ALIGNAS in ConfigureVectorization which is either `alignas(x)` or falls back to `EIGEN_ALIGN_TO_BOUNDARY(EIGEN_ALIGNOF(x))`. (this would also separate preprocessor-logic from the implementation)
Alternatively, we could just disable EIGEN_HAS_CXX11 for g++ < 4.8