Bug 195 - conservative_resize_1 fails on GCC (up to 4.4) in Debug mode (32bit+SSE) because GCC/i386 emits x87 fldl/fstpl instructions for _mm_load_sd
: conservative_resize_1 fails on GCC (up to 4.4) in Debug mode (32bit+SSE) beca...
Status: RESOLVED FIXED
Product: Eigen
Classification: Unclassified
Component: Tests
: 3.0
: x86 - SSE Linux
: --- normal
Assigned To: Nobody
:
:
:
:
: 3.0
  Show dependency treegraph
 
Reported: 2011-02-23 18:40 UTC by Jitse Niesen
Modified: 2011-02-27 23:20 UTC (History)
2 users (show)



Attachments
test program (644 bytes, text/x-c++src)
2011-02-23 18:40 UTC, Jitse Niesen
no flags Details

Description Jitse Niesen 2011-02-23 18:40:31 UTC
Created attachment 108 [details]
test program

I ran the test suite in Debug mode and to my surprise some of the tests fail
(see http://eigen.tuxfamily.org/CDash/viewTest.php?onlyfailed&buildid=5090 ;
that run was not done with the latest version of Eigen).

In particular, the conservative_resize_1 test failed. I boiled it down to the
attached test case. It extends the matrix m using conservativeResizeLike(). For
some reason, one of the coefficients of m is not copied over into the new
matrix. All other coefficients are correct.

The output of the attached program is 

Before: m(37,40) = -649311
Before: n(37,40) = -649311
Resizing m to be 66 by 58
After: m(37,40) = -125023
After: n(37,40) = -649311

I compiled the program using GCC 4.3.3 with -msse2 and no other compile flags,
32-bits Linux Fedora, current version of Eigen3.
Comment 1 Jitse Niesen 2011-02-23 22:42:12 UTC
I can reproduce this on my laptop (Debian testing, also 32 bits), both using
gcc 4.3.5 and 4.4.5. Thus, I'll mark it as blocking 3.0 ... sorry. As before,
it only shows when compiling with -msse2 and without optimization. Valgrind
does not show an error.
Comment 2 Benoit Jacob 2011-02-24 14:36:40 UTC
Confirmed here. g++ 4.4.5 with -m32 -msse2.
Comment 3 Benoit Jacob 2011-02-24 15:50:24 UTC
The bug is in our fast unaligned load code for integers. If I replace it by the
trivial implementation (_mm_loadu_si128) there's no problem.

Added asm comments:

template<> EIGEN_STRONG_INLINE Packet4i ploadu<Packet4i>(const int* from)
{
  asm volatile("#beginning of ploadu");
  EIGEN_DEBUG_UNALIGNED_LOAD
  __m128d res;
  res =  _mm_load_sd((const double*)(from)) ;
  res =  _mm_loadh_pd(res, (const double*)(from+2)) ;
  Packet4i x= _mm_castpd_si128(res);
  asm volatile("#end of ploadu");
  return x;
  //return _mm_loadu_si128(reinterpret_cast<const Packet4i*>(from));
}



#APP
# 266 "eigen/Eigen/src/Core/arch/SSE/PacketMath.h" 1
    #beginning of ploadu
# 0 "" 2
#NO_APP
    movl    8(%ebp), %eax
    movl    %eax, -60(%ebp)
    movl    -60(%ebp), %eax
    fldl    (%eax)
    fstpl    -56(%ebp)
    movsd    -56(%ebp), %xmm0
    movhpd    .LC0, %xmm0
    movapd    %xmm0, -104(%ebp)
    movl    8(%ebp), %eax
    addl    $8, %eax
    movapd    -104(%ebp), %xmm0
    movapd    %xmm0, -40(%ebp)
    movl    %eax, -44(%ebp)
    movapd    -40(%ebp), %xmm0
    movl    -44(%ebp), %eax
    movhpd    (%eax), %xmm0
    movapd    %xmm0, -104(%ebp)
    movapd    -104(%ebp), %xmm0
    movapd    %xmm0, -24(%ebp)
    movdqa    -24(%ebp), %xmm0
    movdqa    %xmm0, -88(%ebp)
#APP
# 272 "eigen/Eigen/src/Core/arch/SSE/PacketMath.h" 1
    #end of ploadu
# 0 "" 2
#NO_APP
Comment 4 Benoit Jacob 2011-02-24 16:00:12 UTC
The reason why this happens only to that coeff at (37,40) could be related to
the value of that coeff in the original matrix. Adding debug output gave me
this:

packet
writePacket
at outer = 40, inner+i = 36, src = 627320806 (hex 256427e6)
at outer = 40, inner+i = 36, dst = 627320806 (hex 256427e6)
at outer = 40, inner+i = 37, src = -649311 (hex fff617a1)
at outer = 40, inner+i = 37, dst = -125023 (hex fffe17a1)
oops at outer = 40, inner+i = 37

So this value has just one bit wrong. The original value here is also special
as it is smaller than others and starts with fff.

Could there be unwanted side effect of this cast, _mm_castpd_si128 ?
Comment 5 Benoit Jacob 2011-02-24 16:10:49 UTC
So, the bug is in the first intrinsic here, _mm_load_sd

indeed:

template<> EIGEN_STRONG_INLINE Packet4i ploadu<Packet4i>(const int* from)
{
  EIGEN_ALIGN16 double d[2];
  const int *di = from;
  asm volatile("#beginning of ploadu");
  EIGEN_DEBUG_UNALIGNED_LOAD
  __m128d res;
  asm volatile("#before _mm_load_sd");
  std::cout << "from = " << di[0] << " " << di[1] << " "  << di[2] << " "  <<
di[3] << std::endl;
  res =  _mm_load_sd((const double*)(from)) ;
  asm volatile("#before _mm_loadh_pd");
  _mm_store_pd(d, res);
  di = reinterpret_cast<int*>(d);
  std::cout << "yyy res = " << di[0] << " " << di[1] << " "  << di[2] << " " 
<< di[3] << std::endl;
  res =  _mm_loadh_pd(res, (const double*)(from+2)) ;
  asm volatile("#before _mm_castpd_si128");
  _mm_store_pd(d, res);
  di = reinterpret_cast<int*>(d);
  std::cout << "zzz res = " << di[0] << " " << di[1] << " "  << di[2] << " " 
<< di[3] << std::endl;
  Packet4i x= _mm_castpd_si128(res);
  asm volatile("#end of ploadu");
  return x;
  //return _mm_loadu_si128(reinterpret_cast<const Packet4i*>(from));
}


I get this output:

packet
from = 627320806 -649311 894580355 935800673
yyy res = 627320806 -125023 0 0
zzz res = 627320806 -125023 894580355 935800673
writePacket
at outer = 40, inner+i = 36, src = 627320806 (hex 256427e6)
at outer = 40, inner+i = 36, dst = 627320806 (hex 256427e6)
at outer = 40, inner+i = 37, src = -649311 (hex fff617a1)
at outer = 40, inner+i = 37, dst = -125023 (hex fffe17a1)
oops at outer = 40, inner+i = 37
Aborted


As we see, by the time we print yyy res, which is right after the _mm_load_sd,
the wrong value -125023 has already been introduced.
Comment 6 Benoit Jacob 2011-02-24 16:15:57 UTC
got it :)

The _mm_load_sd intrinsic is messing with NaN values. The bug occurs when the
64bit value here, when interpreted as double, is a NaN.
Comment 7 Benoit Jacob 2011-02-24 16:34:40 UTC
Fixed; we may have the same issue with floats, but i'll file a separate bug
about that if that's the case.

The fix consisted in dropping our fast unaligned load function, which is too
bad. Maybe it's worth checking if we could set some control word to enforce
strict preservation of NaNs.
Comment 8 Benoit Jacob 2011-02-24 16:55:16 UTC
Hm... probably actually the problem was with those x87 instructions

          fldl    (%eax)
          fstpl   -56(%ebp)

which explains why the bug only happens in 32bit mode.

Strangely though the problem still occurs with -mfpmath=sse. The assembly for
__mm_load_sd is still:

    movl    8(%ebp), %eax
    movl    %eax, -60(%ebp)
    movl    -60(%ebp), %eax
    fldl    (%eax)
    fstpl    -56(%ebp)
    movsd    -56(%ebp), %xmm0
    movhpd    .LC0, %xmm0
    movapd    %xmm0, -88(%ebp)
Comment 9 Benoit Jacob 2011-02-24 16:56:27 UTC
While in 64bit mode, even with -mfpmath=387, the assembly is:

    movq    -88(%rbp), %rax
    movq    %rax, -56(%rbp)
    movq    -56(%rbp), %rax
    movq    (%rax), %rax
    movq    %rax, -48(%rbp)
    movsd    -48(%rbp), %xmm0
    movhpd    .LC0(%rip), %xmm0
    movapd    %xmm0, -80(%rbp)

and there is no bug.
Comment 10 Gael Guennebaud 2011-02-24 19:29:29 UTC
that's very strange that in 32bits the value is first loaded into a FPU
register, then back to memory and finally in a SSE register.... That does not
make sense to me. In this context, it is clearly much better to use the default
unaligned intrinsic on 32bits system.
Comment 11 Benoit Jacob 2011-02-24 19:44:25 UTC
Do you want us to do

   if(sizeof(void*)==8) {
      use custom unaligned load
   } else {
      use standard intrinsic
   }

I looked at the generated assembly, and even on 64bit, the custom unaligned
load generates quite big assembly. Are you really sure it's faster?
Comment 12 Gael Guennebaud 2011-02-24 21:43:53 UTC
actually, for float the following piece of code:

__m128 res;
_mm_loadl_pi(res, (const __m64*)(from));
return _mm_loadh_pi(res, (const __m64*)(from+2));

Is safe and even faster on my Core2 (this might be different for other
architecture). The pb is that produce the following warning:

warning: ‘res’ is used uninitialized in this function

and I don't see any obvious way to suppress it without initializing it and so
reducing the perf.
Comment 13 Benoit Jacob 2011-02-24 21:49:08 UTC
Since this is a stupid warning, we can shut it up. I recently renamed
DisableMSVCWarnings to DisableStupidWarnings, feel free to add GCC stuff there.
Comment 14 Benoit Jacob 2011-02-24 21:50:22 UTC
Ouch... though, in other circumstances this is a quite useful warning. So I
don't know.

Note You need to log in before you can comment on or make changes to this bug.