Summary:  Corner cases in SIMD sin/cos  

Product:  Eigen  Reporter:  Gael Guennebaud <gael.guennebaud>  
Component:  Core  vectorization  Assignee:  Nobody <eigen.nobody>  
Status:  RESOLVED FIXED  
Severity:  Accuracy Problem  CC:  afrunze, chtz, gael.guennebaud, jacob.benoit.1, markos, rmlarsen  
Priority:  Normal  
Version:  unspecified  
Hardware:  All  
OS:  All  
Whiteboard:  
Bug Depends on:  
Bug Blocks:  814  
Attachments: 

Description
Gael Guennebaud
20181227 15:35:48 UTC
For completeness, the MSAlike variant can also be simplified as: x = padd(x,psub(_x,_x)); x = pmin(x, pset1<Packet>(13176795.f)); but it's still as costly as clamping to [pi/4,pi/4] + handling of NaN Created attachment 911 [details]
proofofconcept
I played around a bit with using the
y= (x+float(1<<23))  float(1<<23);
rounding trick. That means, there are now two floating point add/subs instead of two conversioncasts.
Since this already rounds to nearest, it saves the (y+1)&(~1) operation, and infinities will immediately result in NaN, when calculating x= y*pi/2
With some effort, the intoperations may even be removed, allowing full AVX1compatibility (without splitting the packet).
See the attachment. There is still some debug code (and I did not benchmark that yet)
I was further testing/debugging. The current implementation results in NaN if the reduced x*x is infinity and the cospolynomial is evaluated. This happens in y1 = psub(y1, pmul(x2, cst_half)); because y1 and x2*0.5f are both positive infinity and are subtracted from each other. Is there a reason this part of the polynomial is not evaluated using Horner's method like the first three terms? (Precision loss in some corner cases?) Of course, this is no problem, if x is clamped. Independent of that, a minor possible improvement: If we have a "highestbit"pselect function, we can choose the poly_mask by a simple leftshift (instead of pand and pcomp, although the throughput of the latter might actually be better  and latency is not an issue here) Here are some benchmarking results for 3.3/master and my proposal for different architectures: ns per calculation for 1024 elements sse42 avx avx2 fma avx2+fma 3.3 (3011dc2762b2) sin 3.09 1.85 1.76 1.39 1.31 cos 3.15 76.98 77.06 77.06 77.10 master (591f3f24fec6) sin 3.34 2.18 1.88 1.80 1.50 cos 3.43 5.23 1.86 1.91 1.50 mypatch sin 3.15 1.87 1.74 1.45 1.32 cos 3.16 2.16 1.76 1.92 1.32 Nice trick. but this decreases accuracy for inputs larger than 8*pi. Max absolute errors: input range 3.3 this patch [4pi,4pi] 5.96046e08 5.96046e08 [8pi,8pi] 5.96046e08 4.76837e07 [16pi,16pi] 5.96046e08 7.7486e07 [32pi,32pi] 5.96046e08 1.66893e06 [64pi,64pi] 5.96046e08 7.27177e06 (I used bench/math_func_prec.cpp) (In reply to Gael Guennebaud from comment #5) > Nice trick. but this decreases accuracy for inputs larger than 8*pi. That trick should be accurate as long as abs(x)*(2/pi) < 2^23 Maybe the clamping to [pi/4,pi/4] introduces inaccuracies (if pi/4 is rounded down). Can you try deactivating that? (Should only be necessary for very large values) Did you test with SSE/AVX1/AVX2/FMA? > (I used bench/math_func_prec.cpp) I can't find that file in https://bitbucket.org/eigen/eigen/src/591f3f24f/bench/?at=default Do you have that only locally? And should it maybe part of the unittest suite? One of the difficulty in fixing this accuracy issue is that adding 2^23 for rounding is not deterministic for inputs of the form X.5. For some X you get X, for others you get X+1. In practice this means that after the ""Extended precision modular arithmetic" step, we get some entries slightly outside the expected range. Those entries gets truncated, hence the accuracy loss. Actually, the sine/cosine polynomials are valid in a larger range than [pi/4,pi/4], so a simple fix seems to be to extend the range for clamping. my bad, I never committed/pushed that bench/math_func_prec.cpp file. I agree that should be part of the unit tests, at least for tracking regressions. Interesting alternative to our Cephesbased implementations: https://sleef.org/ The implementation looks much better optimized for SIMD, e.g., the use just one polynomial (with only one more coefficient) which is valid on [pi/2:pi/2] (instead of two polynomials valid only on [pi/4:pi/4]). I quickly tested sleef with SSE4. The 1ULP version is twice slower. The 3.5ULP version is x1.33 times faster. Regarding accuracy, the 3.5ULP version is less accurate than our current implementation for inputs in the range [2000,2000] (roughly). For this range our implementation achieve 0.5ULP. For larger inputs, however, the sleef 3.5ULP version remains more accurate. Created attachment 913 [details] Improved version of psincos Playing a bit more with sleef and our cephes based code, I found that our code was pretty good in terms of absolute error, but very bad in terms of relative precision (ULP). The original cephes code exhibits the same issue. Sleef use a single higher order polynomial approximation over the [pi/2,pi/2] range, but to get 1ULP accuracy (even within the [pi/2,pi/2] range) this polynomial needs to be evaluated with extra precision in the last 3 operations. Tracking the residuals in a second variable is very costly and requires true FMA. Therefore, using 2 lower order polynomials over two different ranges is not that bad, and even faster. To improve our current code, I borrowed the modulo coefficients of this sinf implementation: https://stackoverflow.com/questions/30463616/paynehanekalgorithmimplementationinc/30465751#30465751 and recomputed the sin and cos polynomial coefficients to improve relative accuracy. With true FMA, I managed to get a max error of 1ULP for x<117435 while sticking to a loworder sin polynomial (the post above use an additional x^9 term). Without true FMA, getting accurate modulo is more tricky, but using one more iteration as the sleef code, we can get 1ULP error for x<2pi (which is a welcome property IMO), and 2ULP for x<51981. Regarding speed, we're still pretty good: SSE2 SSE4 AVX AVX2+FMA sleef35 32 32 20 17 sleef10 96 96 62 32 old 44 44 30 21 new 42 42 25 16 This new version is both more accurate and faster than what we currently have, and with true AVX2+FMA we get same speed as the 3.5ULP version of sleef but with 1ULP precision. We still need to improve the handling of larger input, and for this the sleef approach of branching on the max input value sounds pretty reasonable (low overhead for likely inputs, significant penalty for unlikely ones). I haven't measured the accuracy of cos yet, the valid ranges are likely to be slightly different. I've improved the sinepolynomial coefficients to maintain an error of 1ULP for cos(x) up to x<71476, which is good news. For larger inputs, on x86, the simplest seems to fall back to a scalar path. This is faster than both the 1ULP and 3.5ULP vectorized versions in sleef! This is because on x86 we have fast support for double precision, and implementing sinf(float) using double precision is much simpler (as done in glibc). We could be even faster by implementing a vectorized version of sinf/cosf based on double precision. On other architectures, this might be different though. btw, the overhead of the fallingback logic is 0 (for not too huge inputs) because it is compensated by the removal of the clamp. I applied the changes I discussed previously: https://bitbucket.org/eigen/eigen/commits/8d8e63a86550/ Summary: Bug 1652: implements a much more accurate version of vectorized sin/cos. This new version achieve same speed for SSE/AVX, and is slightly faster with FMA. Guarantees are as follows:  no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs  FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs There are more details in the code itself. Only one minor issue: `alignas(Packet)` breaks compilation on g++4.7 with c++11 mode enabled (yes c++11 support of g++4.7 is incomplete, even though it supports std=c++11). I'd suggest adding a macro EIGEN_ALIGNAS in ConfigureVectorization which is either `alignas(x)` or falls back to `EIGEN_ALIGN_TO_BOUNDARY(EIGEN_ALIGNOF(x))`. (this would also separate preprocessorlogic from the implementation) Alternatively, we could just disable EIGEN_HAS_CXX11 for g++ < 4.8  GitLab Migration Automatic Message  This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1652. 