Created attachment 108 [details] test program I ran the test suite in Debug mode and to my surprise some of the tests fail (see http://eigen.tuxfamily.org/CDash/viewTest.php?onlyfailed&buildid=5090 ; that run was not done with the latest version of Eigen). In particular, the conservative_resize_1 test failed. I boiled it down to the attached test case. It extends the matrix m using conservativeResizeLike(). For some reason, one of the coefficients of m is not copied over into the new matrix. All other coefficients are correct. The output of the attached program is Before: m(37,40) = -649311 Before: n(37,40) = -649311 Resizing m to be 66 by 58 After: m(37,40) = -125023 After: n(37,40) = -649311 I compiled the program using GCC 4.3.3 with -msse2 and no other compile flags, 32-bits Linux Fedora, current version of Eigen3.
I can reproduce this on my laptop (Debian testing, also 32 bits), both using gcc 4.3.5 and 4.4.5. Thus, I'll mark it as blocking 3.0 ... sorry. As before, it only shows when compiling with -msse2 and without optimization. Valgrind does not show an error.
Confirmed here. g++ 4.4.5 with -m32 -msse2.
The bug is in our fast unaligned load code for integers. If I replace it by the trivial implementation (_mm_loadu_si128) there's no problem. Added asm comments: template<> EIGEN_STRONG_INLINE Packet4i ploadu<Packet4i>(const int* from) { asm volatile("#beginning of ploadu"); EIGEN_DEBUG_UNALIGNED_LOAD __m128d res; res = _mm_load_sd((const double*)(from)) ; res = _mm_loadh_pd(res, (const double*)(from+2)) ; Packet4i x= _mm_castpd_si128(res); asm volatile("#end of ploadu"); return x; //return _mm_loadu_si128(reinterpret_cast<const Packet4i*>(from)); } #APP # 266 "eigen/Eigen/src/Core/arch/SSE/PacketMath.h" 1 #beginning of ploadu # 0 "" 2 #NO_APP movl 8(%ebp), %eax movl %eax, -60(%ebp) movl -60(%ebp), %eax fldl (%eax) fstpl -56(%ebp) movsd -56(%ebp), %xmm0 movhpd .LC0, %xmm0 movapd %xmm0, -104(%ebp) movl 8(%ebp), %eax addl $8, %eax movapd -104(%ebp), %xmm0 movapd %xmm0, -40(%ebp) movl %eax, -44(%ebp) movapd -40(%ebp), %xmm0 movl -44(%ebp), %eax movhpd (%eax), %xmm0 movapd %xmm0, -104(%ebp) movapd -104(%ebp), %xmm0 movapd %xmm0, -24(%ebp) movdqa -24(%ebp), %xmm0 movdqa %xmm0, -88(%ebp) #APP # 272 "eigen/Eigen/src/Core/arch/SSE/PacketMath.h" 1 #end of ploadu # 0 "" 2 #NO_APP
The reason why this happens only to that coeff at (37,40) could be related to the value of that coeff in the original matrix. Adding debug output gave me this: packet writePacket at outer = 40, inner+i = 36, src = 627320806 (hex 256427e6) at outer = 40, inner+i = 36, dst = 627320806 (hex 256427e6) at outer = 40, inner+i = 37, src = -649311 (hex fff617a1) at outer = 40, inner+i = 37, dst = -125023 (hex fffe17a1) oops at outer = 40, inner+i = 37 So this value has just one bit wrong. The original value here is also special as it is smaller than others and starts with fff. Could there be unwanted side effect of this cast, _mm_castpd_si128 ?
So, the bug is in the first intrinsic here, _mm_load_sd indeed: template<> EIGEN_STRONG_INLINE Packet4i ploadu<Packet4i>(const int* from) { EIGEN_ALIGN16 double d[2]; const int *di = from; asm volatile("#beginning of ploadu"); EIGEN_DEBUG_UNALIGNED_LOAD __m128d res; asm volatile("#before _mm_load_sd"); std::cout << "from = " << di[0] << " " << di[1] << " " << di[2] << " " << di[3] << std::endl; res = _mm_load_sd((const double*)(from)) ; asm volatile("#before _mm_loadh_pd"); _mm_store_pd(d, res); di = reinterpret_cast<int*>(d); std::cout << "yyy res = " << di[0] << " " << di[1] << " " << di[2] << " " << di[3] << std::endl; res = _mm_loadh_pd(res, (const double*)(from+2)) ; asm volatile("#before _mm_castpd_si128"); _mm_store_pd(d, res); di = reinterpret_cast<int*>(d); std::cout << "zzz res = " << di[0] << " " << di[1] << " " << di[2] << " " << di[3] << std::endl; Packet4i x= _mm_castpd_si128(res); asm volatile("#end of ploadu"); return x; //return _mm_loadu_si128(reinterpret_cast<const Packet4i*>(from)); } I get this output: packet from = 627320806 -649311 894580355 935800673 yyy res = 627320806 -125023 0 0 zzz res = 627320806 -125023 894580355 935800673 writePacket at outer = 40, inner+i = 36, src = 627320806 (hex 256427e6) at outer = 40, inner+i = 36, dst = 627320806 (hex 256427e6) at outer = 40, inner+i = 37, src = -649311 (hex fff617a1) at outer = 40, inner+i = 37, dst = -125023 (hex fffe17a1) oops at outer = 40, inner+i = 37 Aborted As we see, by the time we print yyy res, which is right after the _mm_load_sd, the wrong value -125023 has already been introduced.
got it :) The _mm_load_sd intrinsic is messing with NaN values. The bug occurs when the 64bit value here, when interpreted as double, is a NaN.
Fixed; we may have the same issue with floats, but i'll file a separate bug about that if that's the case. The fix consisted in dropping our fast unaligned load function, which is too bad. Maybe it's worth checking if we could set some control word to enforce strict preservation of NaNs.
Hm... probably actually the problem was with those x87 instructions fldl (%eax) fstpl -56(%ebp) which explains why the bug only happens in 32bit mode. Strangely though the problem still occurs with -mfpmath=sse. The assembly for __mm_load_sd is still: movl 8(%ebp), %eax movl %eax, -60(%ebp) movl -60(%ebp), %eax fldl (%eax) fstpl -56(%ebp) movsd -56(%ebp), %xmm0 movhpd .LC0, %xmm0 movapd %xmm0, -88(%ebp)
While in 64bit mode, even with -mfpmath=387, the assembly is: movq -88(%rbp), %rax movq %rax, -56(%rbp) movq -56(%rbp), %rax movq (%rax), %rax movq %rax, -48(%rbp) movsd -48(%rbp), %xmm0 movhpd .LC0(%rip), %xmm0 movapd %xmm0, -80(%rbp) and there is no bug.
that's very strange that in 32bits the value is first loaded into a FPU register, then back to memory and finally in a SSE register.... That does not make sense to me. In this context, it is clearly much better to use the default unaligned intrinsic on 32bits system.
Do you want us to do if(sizeof(void*)==8) { use custom unaligned load } else { use standard intrinsic } I looked at the generated assembly, and even on 64bit, the custom unaligned load generates quite big assembly. Are you really sure it's faster?
actually, for float the following piece of code: __m128 res; _mm_loadl_pi(res, (const __m64*)(from)); return _mm_loadh_pi(res, (const __m64*)(from+2)); Is safe and even faster on my Core2 (this might be different for other architecture). The pb is that produce the following warning: warning: ‘res’ is used uninitialized in this function and I don't see any obvious way to suppress it without initializing it and so reducing the perf.
Since this is a stupid warning, we can shut it up. I recently renamed DisableMSVCWarnings to DisableStupidWarnings, feel free to add GCC stuff there.
Ouch... though, in other circumstances this is a quite useful warning. So I don't know.
-- GitLab Migration Automatic Message -- This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/195.