Bug 195 - conservative_resize_1 fails on GCC (up to 4.4) in Debug mode (32bit+SSE) because GCC/i386 emits x87 fldl/fstpl instructions for _mm_load_sd
conservative_resize_1 fails on GCC (up to 4.4) in Debug mode (32bit+SSE) beca...
Status: RESOLVED FIXED
Product: Eigen
Classification: Unclassified
Component: Tests
3.0
x86 - SSE Linux
: --- Unknown
Assigned To: Nobody
:
Depends on:
Blocks: 3.0
  Show dependency treegraph
 
Reported: 2011-02-23 18:40 UTC by Jitse Niesen
Modified: 2011-02-27 23:20 UTC (History)
2 users (show)



Attachments
test program (644 bytes, text/x-c++src)
2011-02-23 18:40 UTC, Jitse Niesen
no flags Details

Description Jitse Niesen 2011-02-23 18:40:31 UTC
Created attachment 108 [details]
test program

I ran the test suite in Debug mode and to my surprise some of the tests fail (see http://eigen.tuxfamily.org/CDash/viewTest.php?onlyfailed&buildid=5090 ; that run was not done with the latest version of Eigen).

In particular, the conservative_resize_1 test failed. I boiled it down to the attached test case. It extends the matrix m using conservativeResizeLike(). For some reason, one of the coefficients of m is not copied over into the new matrix. All other coefficients are correct.

The output of the attached program is 

Before: m(37,40) = -649311
Before: n(37,40) = -649311
Resizing m to be 66 by 58
After: m(37,40) = -125023
After: n(37,40) = -649311

I compiled the program using GCC 4.3.3 with -msse2 and no other compile flags, 32-bits Linux Fedora, current version of Eigen3.
Comment 1 Jitse Niesen 2011-02-23 22:42:12 UTC
I can reproduce this on my laptop (Debian testing, also 32 bits), both using gcc 4.3.5 and 4.4.5. Thus, I'll mark it as blocking 3.0 ... sorry. As before, it only shows when compiling with -msse2 and without optimization. Valgrind does not show an error.
Comment 2 Benoit Jacob 2011-02-24 14:36:40 UTC
Confirmed here. g++ 4.4.5 with -m32 -msse2.
Comment 3 Benoit Jacob 2011-02-24 15:50:24 UTC
The bug is in our fast unaligned load code for integers. If I replace it by the trivial implementation (_mm_loadu_si128) there's no problem.

Added asm comments:

template<> EIGEN_STRONG_INLINE Packet4i ploadu<Packet4i>(const int* from)
{
  asm volatile("#beginning of ploadu");
  EIGEN_DEBUG_UNALIGNED_LOAD
  __m128d res;
  res =  _mm_load_sd((const double*)(from)) ;
  res =  _mm_loadh_pd(res, (const double*)(from+2)) ;
  Packet4i x= _mm_castpd_si128(res);
  asm volatile("#end of ploadu");
  return x;
  //return _mm_loadu_si128(reinterpret_cast<const Packet4i*>(from));
}



#APP
# 266 "eigen/Eigen/src/Core/arch/SSE/PacketMath.h" 1
	#beginning of ploadu
# 0 "" 2
#NO_APP
	movl	8(%ebp), %eax
	movl	%eax, -60(%ebp)
	movl	-60(%ebp), %eax
	fldl	(%eax)
	fstpl	-56(%ebp)
	movsd	-56(%ebp), %xmm0
	movhpd	.LC0, %xmm0
	movapd	%xmm0, -104(%ebp)
	movl	8(%ebp), %eax
	addl	$8, %eax
	movapd	-104(%ebp), %xmm0
	movapd	%xmm0, -40(%ebp)
	movl	%eax, -44(%ebp)
	movapd	-40(%ebp), %xmm0
	movl	-44(%ebp), %eax
	movhpd	(%eax), %xmm0
	movapd	%xmm0, -104(%ebp)
	movapd	-104(%ebp), %xmm0
	movapd	%xmm0, -24(%ebp)
	movdqa	-24(%ebp), %xmm0
	movdqa	%xmm0, -88(%ebp)
#APP
# 272 "eigen/Eigen/src/Core/arch/SSE/PacketMath.h" 1
	#end of ploadu
# 0 "" 2
#NO_APP
Comment 4 Benoit Jacob 2011-02-24 16:00:12 UTC
The reason why this happens only to that coeff at (37,40) could be related to the value of that coeff in the original matrix. Adding debug output gave me this:

packet
writePacket
at outer = 40, inner+i = 36, src = 627320806 (hex 256427e6)
at outer = 40, inner+i = 36, dst = 627320806 (hex 256427e6)
at outer = 40, inner+i = 37, src = -649311 (hex fff617a1)
at outer = 40, inner+i = 37, dst = -125023 (hex fffe17a1)
oops at outer = 40, inner+i = 37

So this value has just one bit wrong. The original value here is also special as it is smaller than others and starts with fff.

Could there be unwanted side effect of this cast, _mm_castpd_si128 ?
Comment 5 Benoit Jacob 2011-02-24 16:10:49 UTC
So, the bug is in the first intrinsic here, _mm_load_sd

indeed:

template<> EIGEN_STRONG_INLINE Packet4i ploadu<Packet4i>(const int* from)
{
  EIGEN_ALIGN16 double d[2];
  const int *di = from;
  asm volatile("#beginning of ploadu");
  EIGEN_DEBUG_UNALIGNED_LOAD
  __m128d res;
  asm volatile("#before _mm_load_sd");
  std::cout << "from = " << di[0] << " " << di[1] << " "  << di[2] << " "  << di[3] << std::endl;
  res =  _mm_load_sd((const double*)(from)) ;
  asm volatile("#before _mm_loadh_pd");
  _mm_store_pd(d, res);
  di = reinterpret_cast<int*>(d);
  std::cout << "yyy res = " << di[0] << " " << di[1] << " "  << di[2] << " "  << di[3] << std::endl;
  res =  _mm_loadh_pd(res, (const double*)(from+2)) ;
  asm volatile("#before _mm_castpd_si128");
  _mm_store_pd(d, res);
  di = reinterpret_cast<int*>(d);
  std::cout << "zzz res = " << di[0] << " " << di[1] << " "  << di[2] << " "  << di[3] << std::endl;
  Packet4i x= _mm_castpd_si128(res);
  asm volatile("#end of ploadu");
  return x;
  //return _mm_loadu_si128(reinterpret_cast<const Packet4i*>(from));
}


I get this output:

packet
from = 627320806 -649311 894580355 935800673
yyy res = 627320806 -125023 0 0
zzz res = 627320806 -125023 894580355 935800673
writePacket
at outer = 40, inner+i = 36, src = 627320806 (hex 256427e6)
at outer = 40, inner+i = 36, dst = 627320806 (hex 256427e6)
at outer = 40, inner+i = 37, src = -649311 (hex fff617a1)
at outer = 40, inner+i = 37, dst = -125023 (hex fffe17a1)
oops at outer = 40, inner+i = 37
Aborted


As we see, by the time we print yyy res, which is right after the _mm_load_sd, the wrong value -125023 has already been introduced.
Comment 6 Benoit Jacob 2011-02-24 16:15:57 UTC
got it :)

The _mm_load_sd intrinsic is messing with NaN values. The bug occurs when the 64bit value here, when interpreted as double, is a NaN.
Comment 7 Benoit Jacob 2011-02-24 16:34:40 UTC
Fixed; we may have the same issue with floats, but i'll file a separate bug about that if that's the case.

The fix consisted in dropping our fast unaligned load function, which is too bad. Maybe it's worth checking if we could set some control word to enforce strict preservation of NaNs.
Comment 8 Benoit Jacob 2011-02-24 16:55:16 UTC
Hm... probably actually the problem was with those x87 instructions

          fldl    (%eax)
          fstpl   -56(%ebp)

which explains why the bug only happens in 32bit mode.

Strangely though the problem still occurs with -mfpmath=sse. The assembly for __mm_load_sd is still:

	movl	8(%ebp), %eax
	movl	%eax, -60(%ebp)
	movl	-60(%ebp), %eax
	fldl	(%eax)
	fstpl	-56(%ebp)
	movsd	-56(%ebp), %xmm0
	movhpd	.LC0, %xmm0
	movapd	%xmm0, -88(%ebp)
Comment 9 Benoit Jacob 2011-02-24 16:56:27 UTC
While in 64bit mode, even with -mfpmath=387, the assembly is:

	movq	-88(%rbp), %rax
	movq	%rax, -56(%rbp)
	movq	-56(%rbp), %rax
	movq	(%rax), %rax
	movq	%rax, -48(%rbp)
	movsd	-48(%rbp), %xmm0
	movhpd	.LC0(%rip), %xmm0
	movapd	%xmm0, -80(%rbp)

and there is no bug.
Comment 10 Gael Guennebaud 2011-02-24 19:29:29 UTC
that's very strange that in 32bits the value is first loaded into a FPU register, then back to memory and finally in a SSE register.... That does not make sense to me. In this context, it is clearly much better to use the default unaligned intrinsic on 32bits system.
Comment 11 Benoit Jacob 2011-02-24 19:44:25 UTC
Do you want us to do

   if(sizeof(void*)==8) {
      use custom unaligned load
   } else {
      use standard intrinsic
   }

I looked at the generated assembly, and even on 64bit, the custom unaligned load generates quite big assembly. Are you really sure it's faster?
Comment 12 Gael Guennebaud 2011-02-24 21:43:53 UTC
actually, for float the following piece of code:

__m128 res;
_mm_loadl_pi(res, (const __m64*)(from));
return _mm_loadh_pi(res, (const __m64*)(from+2));

Is safe and even faster on my Core2 (this might be different for other architecture). The pb is that produce the following warning:

warning: ‘res’ is used uninitialized in this function

and I don't see any obvious way to suppress it without initializing it and so reducing the perf.
Comment 13 Benoit Jacob 2011-02-24 21:49:08 UTC
Since this is a stupid warning, we can shut it up. I recently renamed DisableMSVCWarnings to DisableStupidWarnings, feel free to add GCC stuff there.
Comment 14 Benoit Jacob 2011-02-24 21:50:22 UTC
Ouch... though, in other circumstances this is a quite useful warning. So I don't know.

Note You need to log in before you can comment on or make changes to this bug.