This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen
Bug 1652 - Corner cases in SIMD sin/cos
Corner cases in SIMD sin/cos
 Status: RESOLVED FIXED None Eigen Unclassified Core - vectorization (show other bugs) unspecified All All Normal Accuracy Problem Nobody 3.4 Show dependency tree / graph

 Reported: 2018-12-27 15:35 UTC by Gael Guennebaud 2019-12-04 18:19 UTC (History) 6 users (show) afrunze chtz gael.guennebaud jacob.benoit.1 markos rmlarsen

Attachments
proof-of-concept (5.49 KB, text/x-c++src)
2018-12-27 22:01 UTC, Christoph Hertzberg
no flags Details
Improved version of psincos (4.74 KB, text/plain)
2019-01-07 15:27 UTC, Gael Guennebaud
no flags Details

 Gael Guennebaud 2018-12-27 15:35:48 UTC ```This is an entry to discuss how to handle corner cases in SIMD sin/cos. Let's call x the input, there are 3 different corner cases: (p1) x==NaN -> returns NaN (p2) x==+/-INF -> returns NaN (p3) x is "large" -> returns ?? There is nothing special about x==NaN as NaN is naturally propagated. The case x==INF is also naturally handled by the default implementation but preserving this property while handling x=="large" might require special treatment. See below. Regarding x==large, in 3.3 sin/cos might return numbers outside the [-1:1] range which is not satisfactory. Nonetheless it is important to note that we are talking about values of x such that the distance between two successive floats is larger than 2pi, so basically all you can expect is random numbers. This means there is no need to be too pedantic regarding accuracy, but we might still want to preserve some properties like: (p4) sin/cos(huge) in [-1:1] (p5) sin^2(huge) + cos^2(huge) == 1 Those properties are easy to preserve but so far I did not find a way to enforce all of them without a significant overhead: Time (p2) | (p4) | (p5) 3.3: 0.39 YES NO NO ada7f22 (head): 0.41 YES YES NO MSA-like: 0.46 YES YES YES clamp to [-pi/4,pi/4]: 0.43 NO YES YES " + handling of NaN: 0.44 YES YES YES The last version is implemented by adding the following lines prior to the sin/cos polynomial evaluations: x = padd(x,psub(_x,_x)); // convert +INF to NaN (borrowed from the MSA implementation) x = pmin(x,pset1( EIGEN_PI/4.f)); x = pmax(x,pset1(-EIGEN_PI/4.f)); but it implies a +12% of overhead compared to no special treatment of large inputs. The MSA-like version is implemented by adding the following lines right after taking the absolute value of the input: x = padd(x,psub(_x,_x)); Packet small_or_nan_mask = pcmp_lt_or_nan(x, pset1(13176795.f)); x = pand(x, small_or_nan_mask); The overhead is even larger. I'm not 100% sure what to do on that matter. Opinions welcome.``` Gael Guennebaud 2018-12-27 15:43:05 UTC ```For completeness, the MSA-like variant can also be simplified as: x = padd(x,psub(_x,_x)); x = pmin(x, pset1(13176795.f)); but it's still as costly as clamping to [-pi/4,pi/4] + handling of NaN``` Christoph Hertzberg 2018-12-27 22:01:13 UTC ```Created attachment 911 [details] proof-of-concept I played around a bit with using the y= (x+float(1<<23)) - float(1<<23); rounding trick. That means, there are now two floating point add/subs instead of two conversion-casts. Since this already rounds to nearest, it saves the (y+1)&(~1) operation, and infinities will immediately result in NaN, when calculating x-= y*pi/2 With some effort, the int-operations may even be removed, allowing full AVX1-compatibility (without splitting the packet). See the attachment. There is still some debug code (and I did not benchmark that yet)``` Christoph Hertzberg 2018-12-28 00:43:08 UTC ```I was further testing/debugging. The current implementation results in NaN if the reduced x*x is infinity and the cos-polynomial is evaluated. This happens in y1 = psub(y1, pmul(x2, cst_half)); because y1 and x2*0.5f are both positive infinity and are subtracted from each other. Is there a reason this part of the polynomial is not evaluated using Horner's method like the first three terms? (Precision loss in some corner cases?) Of course, this is no problem, if x is clamped. Independent of that, a minor possible improvement: If we have a "highest-bit"-pselect function, we can choose the poly_mask by a simple left-shift (instead of pand and pcomp, although the throughput of the latter might actually be better -- and latency is not an issue here)``` Christoph Hertzberg 2018-12-29 14:28:13 UTC ```Here are some benchmarking results for 3.3/master and my proposal for different architectures: ns per calculation for 1024 elements sse42 avx avx2 fma avx2+fma 3.3 (3011dc2762b2) sin 3.09 1.85 1.76 1.39 1.31 cos 3.15 76.98 77.06 77.06 77.10 master (591f3f24fec6) sin 3.34 2.18 1.88 1.80 1.50 cos 3.43 5.23 1.86 1.91 1.50 my-patch sin 3.15 1.87 1.74 1.45 1.32 cos 3.16 2.16 1.76 1.92 1.32``` Gael Guennebaud 2018-12-29 22:20:46 UTC ```Nice trick. but this decreases accuracy for inputs larger than 8*pi. Max absolute errors: input range 3.3 this patch [-4pi,4pi] 5.96046e-08 5.96046e-08 [-8pi,8pi] 5.96046e-08 4.76837e-07 [-16pi,16pi] 5.96046e-08 7.7486e-07 [-32pi,32pi] 5.96046e-08 1.66893e-06 [-64pi,64pi] 5.96046e-08 7.27177e-06 (I used bench/math_func_prec.cpp)``` Christoph Hertzberg 2018-12-30 00:39:39 UTC ```(In reply to Gael Guennebaud from comment #5) > Nice trick. but this decreases accuracy for inputs larger than 8*pi. That trick should be accurate as long as abs(x)*(2/pi) < 2^23 Maybe the clamping to [-pi/4,pi/4] introduces inaccuracies (if pi/4 is rounded down). Can you try deactivating that? (Should only be necessary for very large values) Did you test with SSE/AVX1/AVX2/FMA? > (I used bench/math_func_prec.cpp) I can't find that file in https://bitbucket.org/eigen/eigen/src/591f3f24f/bench/?at=default Do you have that only locally? And should it maybe part of the unit-test suite?``` Gael Guennebaud 2018-12-30 00:40:16 UTC `One of the difficulty in fixing this accuracy issue is that adding 2^23 for rounding is not deterministic for inputs of the form X.5. For some X you get X, for others you get X+1. In practice this means that after the ""Extended precision modular arithmetic" step, we get some entries slightly outside the expected range. Those entries gets truncated, hence the accuracy loss. Actually, the sine/cosine polynomials are valid in a larger range than [-pi/4,pi/4], so a simple fix seems to be to extend the range for clamping.` Gael Guennebaud 2018-12-30 00:44:02 UTC `my bad, I never committed/pushed that bench/math_func_prec.cpp file. I agree that should be part of the unit tests, at least for tracking regressions.` Christoph Hertzberg 2019-01-04 16:32:17 UTC ```Interesting alternative to our Cephes-based implementations: https://sleef.org/ The implementation looks much better optimized for SIMD, e.g., the use just one polynomial (with only one more coefficient) which is valid on [-pi/2:pi/2] (instead of two polynomials valid only on [-pi/4:pi/4]).``` Gael Guennebaud 2019-01-04 22:02:08 UTC `I quickly tested sleef with SSE4. The 1ULP version is twice slower. The 3.5ULP version is x1.33 times faster. Regarding accuracy, the 3.5ULP version is less accurate than our current implementation for inputs in the range [-2000,2000] (roughly). For this range our implementation achieve 0.5ULP. For larger inputs, however, the sleef 3.5ULP version remains more accurate.` Gael Guennebaud 2019-01-07 15:27:58 UTC ```Created attachment 913 [details] Improved version of psincos Playing a bit more with sleef and our cephes based code, I found that our code was pretty good in terms of absolute error, but very bad in terms of relative precision (ULP). The original cephes code exhibits the same issue. Sleef use a single higher order polynomial approximation over the [-pi/2,pi/2] range, but to get 1ULP accuracy (even within the [-pi/2,pi/2] range) this polynomial needs to be evaluated with extra precision in the last 3 operations. Tracking the residuals in a second variable is very costly and requires true FMA. Therefore, using 2 lower order polynomials over two different ranges is not that bad, and even faster. To improve our current code, I borrowed the modulo coefficients of this sinf implementation: https://stackoverflow.com/questions/30463616/payne-hanek-algorithm-implementation-in-c/30465751#30465751 and recomputed the sin and cos polynomial coefficients to improve relative accuracy. With true FMA, I managed to get a max error of 1ULP for x<117435 while sticking to a low-order sin polynomial (the post above use an additional x^9 term). Without true FMA, getting accurate modulo is more tricky, but using one more iteration as the sleef code, we can get 1ULP error for x<2pi (which is a welcome property IMO), and 2ULP for x<51981. Regarding speed, we're still pretty good: SSE2 SSE4 AVX AVX2+FMA sleef35 32 32 20 17 sleef10 96 96 62 32 old 44 44 30 21 new 42 42 25 16 This new version is both more accurate and faster than what we currently have, and with true AVX2+FMA we get same speed as the 3.5ULP version of sleef but with 1ULP precision. We still need to improve the handling of larger input, and for this the sleef approach of branching on the max input value sounds pretty reasonable (low overhead for likely inputs, significant penalty for unlikely ones). I haven't measured the accuracy of cos yet, the valid ranges are likely to be slightly different.``` Gael Guennebaud 2019-01-07 23:13:44 UTC ```I've improved the sine-polynomial coefficients to maintain an error of 1ULP for cos(x) up to x<71476, which is good news. For larger inputs, on x86, the simplest seems to fall back to a scalar path. This is faster than both the 1ULP and 3.5ULP vectorized versions in sleef! This is because on x86 we have fast support for double precision, and implementing sinf(float) using double precision is much simpler (as done in glibc). We could be even faster by implementing a vectorized version of sinf/cosf based on double precision. On other architectures, this might be different though.``` Gael Guennebaud 2019-01-07 23:15:30 UTC `btw, the overhead of the falling-back logic is 0 (for not too huge inputs) because it is compensated by the removal of the clamp.` Gael Guennebaud 2019-01-09 17:06:07 UTC ```I applied the changes I discussed previously: https://bitbucket.org/eigen/eigen/commits/8d8e63a86550/ Summary: Bug 1652: implements a much more accurate version of vectorized sin/cos. This new version achieve same speed for SSE/AVX, and is slightly faster with FMA. Guarantees are as follows: - no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs - FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs There are more details in the code itself.``` Christoph Hertzberg 2019-01-10 13:40:00 UTC ```Only one minor issue: `alignas(Packet)` breaks compilation on g++4.7 with c++11 mode enabled (yes c++11 support of g++4.7 is incomplete, even though it supports -std=c++11). I'd suggest adding a macro EIGEN_ALIGNAS in ConfigureVectorization which is either `alignas(x)` or falls back to `EIGEN_ALIGN_TO_BOUNDARY(EIGEN_ALIGNOF(x))`. (this would also separate preprocessor-logic from the implementation) Alternatively, we could just disable EIGEN_HAS_CXX11 for g++ < 4.8``` Nobody 2019-12-04 18:19:54 UTC ```-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1652.```

 Note You need to log in before you can comment on or make changes to this bug.