Bug 89 - Crashes and wrong results due to GCC <=4.3 compiler bug with asserts inside of Eigen's code
: Crashes and wrong results due to GCC <=4.3 compiler bug with asserts inside o...
Status: RESOLVED FIXED
Product: Eigen
Classification: Unclassified
Component: Householder
: 3.0
: All All
: --- normal
Assigned To: Nobody
:
:
:
:
: 3.0
  Show dependency treegraph
 
Reported: 2010-10-21 10:23 UTC by Hauke Heibel
Modified: 2011-02-16 16:05 UTC (History)
3 users (show)



Attachments
Stack trace for householder_1 (1.60 KB, text/plain)
2010-10-21 13:47 UTC, Hauke Heibel
no flags Details
householder_1 trace (relwithdebinfo) (1.90 KB, text/plain)
2010-10-21 14:07 UTC, Hauke Heibel
no flags Details
householder_1 valgrind output (relwithdebinfo) (57.53 KB, text/plain)
2010-10-21 16:13 UTC, Jitse Niesen
no flags Details
Patch for fuzzy compare (635 bytes, patch)
2011-02-12 17:49 UTC, Jitse Niesen
no flags Details | Diff | Splinter Review
valgrind log (3.70 KB, text/plain)
2011-02-14 00:26 UTC, Benoit Jacob
no flags Details
test case (226 bytes, text/x-c++src)
2011-02-14 01:38 UTC, Benoit Jacob
no flags Details
no boolean redux unrolling with gcc 4.3 : makes array tests pass (1.68 KB, patch)
2011-02-14 03:12 UTC, Benoit Jacob
no flags Details | Diff | Splinter Review
no internal debugging with gcc 4.3: makes basicstuff_1 and linearstructure_1 pass (1.63 KB, patch)
2011-02-14 03:13 UTC, Benoit Jacob
no flags Details | Diff | Splinter Review
work around assert bug, makes householder tests pass (3.38 KB, patch)
2011-02-14 16:27 UTC, Benoit Jacob
no flags Details | Diff | Splinter Review
makes 100% tests pass: use custom assert, and introduce copy_bool non-inline function (9.22 KB, patch)
2011-02-16 14:03 UTC, Benoit Jacob
no flags Details | Diff | Splinter Review

Description Hauke Heibel 2010-10-21 10:23:00 UTC
I built with g++ 4.3.4 on cygwin.
Comment 1 Hauke Heibel 2010-10-21 10:31:43 UTC
I cannot reproduce the issue on VC10 (64bit, debug/release nor 32bit
debug/release).

The project was generated with the default cmake configuration, i.e. I just
called

cmake ../eigen
Comment 2 Jitse Niesen 2010-10-21 11:37:54 UTC
I checked and I have the same problem with g++ 4.3.3 (32 bits Linux).
householder_{1,2} segfaults when SSE is turned off:
http://eigen.tuxfamily.org/CDash/viewTest.php?onlyfailed&buildid=3741
and gives errors (results are way off) when SSE is turned on:
http://eigen.tuxfamily.org/CDash/viewTest.php?onlyfailed&buildid=3740

I haven't investigated it further yet.
Comment 3 Benoit Jacob 2010-10-21 13:37:04 UTC
Can't reproduce on linux x86-64, gcc 4.4
Comment 4 Benoit Jacob 2010-10-21 13:40:44 UTC
Even with -DEIGEN_TEST_X87=ON, i.e. with -mfpmath=387, I can't reproduce. So
this is:
  - either a 32-bit-specific bug in our code
  - or a gcc 4.3 bug
Comment 5 Benoit Jacob 2010-10-21 13:42:47 UTC
Please attach a stack trace for when it's segfaulting.
Comment 6 Hauke Heibel 2010-10-21 13:47:17 UTC
Created attachment 17 [details]
Stack trace for householder_1
Comment 7 Hauke Heibel 2010-10-21 14:07:44 UTC
Created attachment 18 [details]
householder_1 trace (relwithdebinfo)
Comment 8 Benoit Jacob 2010-10-21 14:25:17 UTC
I patched eigen so all arrays are heap-allocated, and ran householder_1 in
valgrind, no errors found.
Comment 9 Jitse Niesen 2010-10-21 16:13:20 UTC
Created attachment 19 [details]
householder_1 valgrind output (relwithdebinfo)

I had a spare ten minutes to have a look at it. I get the same stack trace as
Hauke. Debug build does not show segfault, but RelWithDebInfo does. I ran it
through valgrind and attach the report. The first thing valgrind says is a 'Use
of uninitialised value of size 4' at the line where the program segfaults.
Happy bug hunting!
Comment 10 Jitse Niesen 2010-11-26 15:38:52 UTC
The bug seems to be specific to gcc 4.3 (versions 4.1 and 4.4 work fine).

Bisection points to revision 1e25be802b3b as the problem (see
http://bitbucket.org/eigen/eigen/changeset/1e25be802b3b ). This is a revision
by Gael on 20 July changing the VERIFY_RAISES_ASSERTS macro. I'm not sure this
revision is really the problem as it does not seem related. However, if I
define EIGEN_DEBUG_ASSERTS then the problem disappears.
Comment 11 Jitse Niesen 2010-12-08 10:42:57 UTC
The error (a 'use of uninitialized value' according to valgrind) appears in
line 62 in the function householder(). However, if I comment out lines 83 until
the end of the function, the error disappears. This does not make any sense at
all!

So, we have a weird error that only appears with gcc 4.3 and only for small
matrices. That reminds me of the error discussed in the thread
http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2010/06/msg00489.html
half a year ago. Indeed, if I try compiling with the same flags as normal for
RelWithDebInfo, but with -fno-strict-aliasing instead of -fstrict-aliasing,
then the error disappears again.

We never really found out what happens with the bug half a year ago (as far as
I remember), Gael suspects a GCC bug, and it was way over my head, so I don't
know how to proceed.

If this were the only issue left, one possibility would be to release 3.0-beta3
with a health warning ("exposes a bug in some situations when compiled with gcc
4.3 only").
Comment 12 Benoit Jacob 2010-12-09 03:45:08 UTC
Thanks a lot for looking into this, I agree that we seem to know enough about
this bug to drop 3.0-beta3 blocker status.
Comment 13 Benoit Jacob 2010-12-26 19:53:15 UTC
Only a gcc 4.3 bug --> not a blocker.
Comment 14 Benoit Jacob 2010-12-26 20:00:06 UTC
(In reply to comment #11)
> Indeed, if I try compiling with the same flags as normal for
> RelWithDebInfo, but with -fno-strict-aliasing instead of -fstrict-aliasing,
> then the error disappears again.

Ah ok, so could it be that we're infringing on strict aliasing rules, and that
only gives a crash with a particular optimization that gcc 4.3 does.

That would be our bug, but still not a blocker I guess.
Comment 15 Gael Guennebaud 2011-02-03 19:32:41 UTC
I cannot reproduce with householder_1, but now the same issue shows up with
adjoint_1 lines 71 and 72. (1x1 matrix)...
Comment 16 Benoit Jacob 2011-02-07 17:07:32 UTC
If it's only reproducible with 1x1 matrices on gcc 4.3, our users are not very
likely to ever hit this bug. Maybe it should not block?
Comment 17 Gael Guennebaud 2011-02-07 18:48:23 UTC
yes and I have no clue on how to workaround it...
Comment 18 Jitse Niesen 2011-02-12 17:49:12 UTC
Created attachment 96 [details] [review]
Patch for fuzzy compare

Since 8 February, the following tests are failing on my computer with gcc 4.3:
basicstuff_1, linearstructure_1, array_for_matrix_1, cholesky_1, jacobisvd_1,
jacobisvd_3, eigen2support_1. 

I looked a bit into the first one, basicstuff_1. It seems very similar at first
sight: the failure is resolved when building in debug mode or with
-fno-strict-aliasing. The test failure is caused by a segfault in the last line
of basicStuff():

     VERIFY_IS_APPROX(sm2,-sm1.transpose());

Here, sm1 and sm2 are of type Matrix<float,1,1>. The segfault disappears if we
replace the line by the equivalent

     VERIFY_IS_APPROX(-sm2,sm1.transpose());

When rebuilding with RelWithDebInfo, the stack trace is as follows:

#0  Eigen::DenseBase<Eigen::Matrix<float, 1, 1, 1, 1, 1>
>::isApprox<Eigen::CwiseUnaryOp<Eigen::internal::scalar_multiple_op<float>,
Eigen::Matrix<float, 1, 1, 1, 1, 1> const> > (
    this=0xbfd38ee8, other=@0xbfd38e88, prec=0.00100000005)
    at /home/amsta/jitse/scratch/eigen-official/Eigen/src/Core/Functors.h:188
#1  0x0804f353 in basicStuff<Eigen::Matrix<float, 1, 1, 1, 1, 1> >
(m=@0xbfd38f68)
    at /home/amsta/jitse/scratch/eigen-official/test/main.h:350
#2  0x0804c1ea in test_basicstuff ()
    at /home/amsta/jitse/scratch/eigen-official/test/basicstuff.cpp:216
#3  0x0804c939 in main (argc=1, argv=0xbfd39064)
    at /home/amsta/jitse/scratch/eigen-official/test/main.h:533

Bisection says that the first bad revision is 5e6b790649c4 (fix fuzzy compares
for integer types, using a selector). Interestingly, like last time this is a
change to the fuzzy compares, though I cannot see why the change triggers a
bug. 

Nevertheless, I tried to rewrite the fuzzy compare; see patch. The resulting
code evaluates the matrices being compared twice instead of using the ::Nested
type (so it's less efficient). However, the patch does get rid of the
basicstuff_1 failure, and also of the failing tests linearstructure_1,
array_for_matrix_1, jacobisvd_1, jacobisvd_3. On the other hand, the tests
cholesky_1, eigen2support_1, householder_1, householder_2 still fail after
applying the patch, and additionally the test array_1 now fails while it passed
beforehand. So the patch is not a solution, but it might hint at where to look
at.
Comment 19 Benoit Jacob 2011-02-13 01:33:47 UTC
Thanks a lot Jitse for investigating this. I changed my mind, let's make this a
3.0 blocker. I am especially concerned that _we_ might be doing something evil
wrt strict-aliasing rules.
Comment 20 Benoit Jacob 2011-02-13 17:47:58 UTC
I can reproduce (gcc 4.3, linux 64bit). removing the x86-32bit and
other-unix-like flags.
Comment 21 Benoit Jacob 2011-02-14 00:26:15 UTC
Created attachment 98 [details]
valgrind log

Obtained with:
valgrind --track-origins=yes ./test/basicstuff_1
Comment 22 Benoit Jacob 2011-02-14 01:38:47 UTC
Created attachment 100 [details]
test case

Here's a small test case. It runs fine by default, but segfaults if you compile
with -DEIGEN_INTERNAL_DEBUGGING.

$ g++-4.3 a.cpp -I eigen -o a -O2 -DEIGEN_INTERNAL_DEBUGGING && ./a
Segmentation fault

More specifically, the segfault disappears if I define eigen_internal_assert(x)
as empty.

So it is really eigen_internal_assert that is triggering the crash, and I now
really believe that it is a plain GCC 4.3 bug, because I can't see how this
legitimately be blamed on strict aliasing.
Comment 23 Benoit Jacob 2011-02-14 01:57:52 UTC
Aaaargh!

removing eigen_internal_assert only fixes a few of those unit tests. other like
array_2 keep crashing.

Back to the drawing board...
Comment 24 Benoit Jacob 2011-02-14 03:12:08 UTC
Created attachment 101 [details] [review]
no boolean redux unrolling with gcc 4.3 : makes array tests pass

This makes gcc4.3 pass array_2 and array_3
Comment 25 Benoit Jacob 2011-02-14 03:13:21 UTC
Created attachment 102 [details] [review]
no internal debugging with gcc 4.3: makes basicstuff_1 and linearstructure_1
pass
Comment 26 Jitse Niesen 2011-02-14 15:06:59 UTC
I had a look at the other, older compiler that's installed at uni and I'm
getting similar issues there. The compiler version is

gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-27)

and five tests fail with segfaults; see
http://eigen.tuxfamily.org/CDash/viewTest.php?onlyfailed&buildid=4988 . I
looked into one of them, product_small_4. Valgrind says the following (the
segfault occurs soon afterward in the same line of code)

   Use of uninitialised value of size 4
      at 0x805AD1D: bool
Eigen::DenseBase<Eigen::Transpose<Eigen::Matrix<double, 4, 1, 0, 4, 1> >
>::isApprox<Eigen::CoeffBasedProduct<Eigen::Transpose<Eigen::Matrix<double, 4,
1, 0, 4, 1> >, Eigen::Matrix<double, 4, 4, 0, 4, 4> const&, 6>
>(Eigen::DenseBase<Eigen::CoeffBasedProduct<Eigen::Transpose<Eigen::Matrix<double,
4, 1, 0, 4, 1> >, Eigen::Matrix<double, 4, 4, 0, 4, 4> const&, 6> > const&,
double) const (MathFunctions.h:299)
      by 0x80646FC: void product<Eigen::Matrix<double, 4, 4, 0, 4, 4>
>(Eigen::Matrix<double, 4, 4, 0, 4, 4> const&) (main.h:350)
      by 0x8053777: test_product_small() (product_small.cpp:34)
      by 0x8053B91: main (main.h:533)

The test fails only if SSE2 is enabled and the build type is either Release or
RelWithDebInfo. The failure is resolved when compiling with
-fno-strict-aliasing. Disabling boolean redux unrolling per comment 24 does not
make a difference, but disabling internal debugging per comment 25 does make
the test pass. In both cases, I changed the patch so that it's also activated
by gcc 4.1.
Comment 27 Benoit Jacob 2011-02-14 16:09:41 UTC
(In reply to comment #10)
> The bug seems to be specific to gcc 4.3 (versions 4.1 and 4.4 work fine).
> 
> Bisection points to revision 1e25be802b3b as the problem (see
> http://bitbucket.org/eigen/eigen/changeset/1e25be802b3b ). This is a revision
> by Gael on 20 July changing the VERIFY_RAISES_ASSERTS macro. I'm not sure this
> revision is really the problem as it does not seem related. However, if I
> define EIGEN_DEBUG_ASSERTS then the problem disappears.

Oh, I had missed that. I confirm that undoing this change makes at least the
householder and basicstuff tests pass, but not the array tests (which already
have a work-around above).
Comment 28 Benoit Jacob 2011-02-14 16:11:36 UTC
ignore me, i was still at bisected revision...
Comment 29 Benoit Jacob 2011-02-14 16:27:18 UTC
Created attachment 103 [details] [review]
work around assert bug, makes householder tests pass
Comment 30 Benoit Jacob 2011-02-14 16:44:14 UTC
fwiw, here's the implementation of assert() in all gcc4 versions:

#if defined __cplusplus && __GNUC_PREREQ (2,95)
# define __ASSERT_VOID_CAST static_cast<void>
#else
# define __ASSERT_VOID_CAST (void)
#endif

# define assert(expr)                                                   \
  ((expr)                                                               \
   ? __ASSERT_VOID_CAST (0)                                             \
   : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))


It's really not doing anything nasty. So we can't do anything more than working
around it in our own tests.
Comment 31 Benoit Jacob 2011-02-14 16:56:06 UTC
With this patch, I am down to only 3 failures:

        239 - cholesky_1 (Failed)
        345 - jacobisvd_3 (Failed)
        367 - geo_quaternion_6 (Failed)

the first two are segfaults, the latter is a wrong result.
Comment 32 Jitse Niesen 2011-02-16 11:25:42 UTC
I ran the tests in various configurations in the hope I'd see a pattern.
Nothing emerged, but perhaps it's of use to somebody else.

In the following table, the top half is the configuration. All use GCC 4.3.3
and Release builds, the first row indicates whether SSE2 is enabled, the second
row whether I used -fstrict-aliasing or -fno-strict-aliasing, the other rows
which of the patches attached to this bag was been applied. In the bottom half,
failing tests are indicated with a '-'.

The main conclusion for me is that we still haven't nailed this bug.


SSE2                            y   n   n   n   y   y   y   y   y
-fstrict-aliasing               y   y   y   y   y   y   n   y   y
work-around-assert-bug          n   n   y   y   y   y   n   n   n
no-booleanredux-unrolling       n   n   n   y   y   n   n   n   n
no-internal-debug               n   n   n   n   n   n   n   y   n
no-nested-in-fuzzy-compare    n   n   n   n   n   n   n   n   y

array_1                                 -   -   -   -       -
array_2                                         -
array_for_matrix_1              -   -   -   -   -   -       -
array_for_matrix_2                      -   -
basicstuff_1                    -   -
cblat1                          -   -
cholesky_1                      -   -   -   -   -   -
cwiseop_1                       -
eigen2support_1                 -   -                           -
eigensolver_selfadjoint_1               -   -
eigensolver_selfadjoint_6                       -   -       -
geo_homogeneous_2                                           -
geo_transformations_3                                       -
geo_quaternion_5                -                           -   -
geo_quaternion_6                -                           -   -
householder_1                   -   -                       -   -
householder_2                   -   -                       -   -
jacobisvd_1                         -   -   -
jacobisvd_3                     -   -   -   -
linearstructure_1               -   -                           -
linearstructure_2                               -   -
linearstructure_3                                           -
lu_5                                            -   -
mixingtypes_3                                       -
product_large_4                                 -   -
product_extra_3                 -               -   -   -   -   -
product_symm_7                  -               -   -   -   -   -
product_trmv_4                                  -   -       -
product_trsolve_3               -               -   -   -   -   -
schur_complex_2                                     -
stable_norm_1                   -       -   -   -   -   -   -   -
upperbidiagonalization_3                        -   -       -

failures                       16  10   8   8  14  15   4  16  10


Some notes:
* the third and fourth columns are identical, so I might have done something
  wrong there.
* when using -fno-strict-aliasing, the four remaining failures are not 
  segfaults but wrong results.
* obviously, tests can fail for multiple reasons. For instance, the
  array_for_matrix_1 test seems to have at least two failure points, one of
  which is fixed by turning of internal debugging.
Comment 33 Benoit Jacob 2011-02-16 14:03:59 UTC
Created attachment 104 [details] [review]
makes 100% tests pass: use custom assert, and introduce copy_bool non-inline
function

This patch makes 100% tests pass. It does 2 things on GCC <= 4.3:

 * use custom assert() implementation
 * introduce this function:

     namespace {
       EIGEN_DONT_INLINE bool copy_bool(bool b) { return b; }
     }

to evaluate the bool condition in assert, and in the debug helpers used in
test/main.h.
Comment 34 Benoit Jacob 2011-02-16 14:29:49 UTC
OK, now I have checked with EIGEN_NO_ASSERTION_CHECKING so that our test suite
doesn't fiddle with eigen_assert() anymore, so we are in the real-world
conditions of our users, and I have changed the VERIFY macro to just

   #define VERIFY(a) assert(a)

in order to simulate a user using assert() on Eigen expressions. I didn't get
any crash.

 --> So it really seems that the issue was purely with the asserts that we have
inside of Eigen's code, and that the patch fixes it i.e. it's not just hiding
it.
Comment 35 Benoit Jacob 2011-02-16 14:52:26 UTC
Pushed.
Comment 36 Benoit Jacob 2011-02-16 15:22:08 UTC
Jitse still gets failures on gcc 4.3.3 32bit+SSE2, while for me it's fixed on
gcc 4.3.5 64bit :-(
Comment 37 Jitse Niesen 2011-02-16 15:46:15 UTC
These tests are failing for me:
product_extra_3, product_symm_7, product_trmv_4, product_trsolve_3,
stable_norm_1, (nullary_4)

nullary_4 failed only once, but passed all the other times I tried, even with
r1000000.

stable_norm_1 is probably not critical - the test fails because norm() gives
the correct result

product_extra_3 failure has a different feel to it so probably another bug.
Comment 38 Benoit Jacob 2011-02-16 15:55:22 UTC
I can't reproduce, even with -m32 -mfpmath=387 and no vectorization.

So either it's specific to the 32bit version of gcc itself, or it's 4.3.3 vs
4.3.5.
Comment 39 Jitse Niesen 2011-02-16 15:58:46 UTC
The common thing about the four product_xxx failures is that they test
MatrixXcf. In one of the cases (product_extra), the problem is that a product
evaluates to a matrix of NaN's where it shouldn't.
Comment 40 Benoit Jacob 2011-02-16 15:59:00 UTC
Sorry, actually I can now reproduce with -m32 -msse2  on gcc 4.3.5 64bit.
Comment 41 Benoit Jacob 2011-02-16 16:00:40 UTC
But this is a separate bug than what we were investigating originally here,
which was crashes and valgrind errors. Here, we have what looks like floating
point errors. Valgrind reports no error. So, filing a new bug.
Comment 42 Benoit Jacob 2011-02-16 16:05:54 UTC
Filed bug 186

Note You need to log in before you can comment on or make changes to this bug.