This bugzilla service is closed. All entries have been migrated to https://gitlab.com/libeigen/eigen
Bug 1614 - EigenMatrix vs EigenTensor softmax speed
Summary: EigenMatrix vs EigenTensor softmax speed
Status: CONFIRMED
Alias: None
Product: Eigen
Classification: Unclassified
Component: Tensor (show other bugs)
Version: 3.4 (development)
Hardware: x86 - 64-bit Linux
: Normal Optimization
Assignee: Nobody
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-18 23:56 UTC by william.tambellini
Modified: 2019-12-04 18:03 UTC (History)
5 users (show)



Attachments
bench_softmax.cpp (3.15 KB, text/x-c++src)
2018-10-18 23:56 UTC, william.tambellini
no flags Details
bench_softmax.cpp (3.47 KB, text/x-c++src)
2018-10-20 15:01 UTC, william.tambellini
no flags Details
bench_matrix_vs_tensor.cpp (5.07 KB, text/x-c++src)
2018-11-21 00:37 UTC, william.tambellini
no flags Details
bench_matrix_vs_tensor.cpp (5.26 KB, text/x-c++src)
2018-11-21 20:31 UTC, william.tambellini
no flags Details

Description william.tambellini 2018-10-18 23:56:05 UTC
Created attachment 887 [details]
bench_softmax.cpp

With current stable version of eigen (3.3.5), EigenMatrix softmax is faster than EigenTensor : 

Bench Softmax. Repeat=10 GCC 6.1.0
Eigen version: 3.3.5
NRows   NCols    EMatrix         ETensor
8192    1       0.0168335       0.0500931
16384   1       0.0333193       0.101805
32768   1       0.0660515       0.197528


Recent master Eigen ("3.3.90"), EigenMatrix softmax seems slower than EigenTensor (see attached example).

Bench Softmax. Repeat=10 GCC 6.1.0
Eigen version: 3.3.90
NRows   NCols    EMatrix         ETensor
8192    1       0.0160394       0.00877427
16384   1       0.0326261       0.0169744
32768   1       0.0640419       0.0325718

Is it expected meaning nothing wrong in the attached bench code ?
Comment 1 Christoph Hertzberg 2018-10-19 07:39:30 UTC
(...).unaryExpr(...) does not get vectorized, you should instead write 

  (...).array().exp()

Same for log, of course. And since you operate with elementwise operations only, you should use Eigen::Array instead of Eigen::Matrix, to avoid all the .array().
Changing the typedefs and writing

  EigenMatrix wMinusMax = input.rowwise() - input.colwise().maxCoeff();
  output = wMinusMax.rowwise() - wMinusMax.exp().colwise().sum().log();

gives me these timings (also changed Reps to 100, compiled with -O3 -march=native -DNDEBUG):

Bench Softmax. Repeat=100
GCC 8.1.0
Eigen version: 3.3.90
NRows	NCols	 EMatrix	 ETensor

1024	1	0.000269846	0.000961945
2048	1	0.000366462	0.00162071
4096	1	0.000784435	0.00341198
8192	1	0.00154602	0.00698167
16384	1	0.00297613	0.013448
32768	1	0.0070586	0.0285935

I can't tell much why Tensor takes ~4 times as long as Array now, though. Perhaps one reason is that wMinusMax gets evaluated twice.
Comment 2 william.tambellini 2018-10-20 15:01:01 UTC
Created attachment 888 [details]
bench_softmax.cpp

1st revision with Christoph's optimizations to EMatrix softmax.
Comment 3 william.tambellini 2018-10-20 15:08:33 UTC
Thank you Christoph: I have patched the bench and even without a native build, EigenMatrix now faster than EigenTensor.
I have forced the eval of wMinuxMax into a real Tensor which speeds up a little but still slower than EigenMatrix.
I guess the purpose is now to understand/optimize EigenTensor.

Bench Softmax. Repeat=100
GCC 6.1.0
Eigen version: 3.3.90
Simd: SSE, SSE2
NRows   NCols    EMatrix         ETensor

2048    1       0.000760094     0.00199718
4096    1       0.00150396      0.00372256
8192    1       0.00280879      0.00771556
16384   1       0.00589872      0.0155258
32768   1       0.0119034       0.0322889
2048    2       0.00135247      0.00401503
4096    2       0.00295824      0.00820826
8192    2       0.00579303      0.01618
16384   2       0.0119328       0.030779
32768   2       0.0248354       0.0616968
Comment 4 william.tambellini 2018-11-21 00:36:26 UTC
I m just enlarging the scope of this ticket/benchmark to compare the speed of Matrix vs Tensor for more ops.
My latest retouches to the attached file now compares the transpose op where some interesting differences of speed also appear.
I will perhaps add a reduction later.
Christoph : could you just check if this benchmarker is not doing anything illegal/unfair regarding transpose?


Bench Eigen Matrix vs Tensor
Repeat: 100
GCC: 6.1.0
Eigen version: 3.3.90
Simd: SSE, SSE2
Eigen::nbThreads: 4
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1

Transpose:
           NRows           NCols         EMatrix         ETensor
              64              64     0.000238874              0.00124015
             128             128      0.00157556              0.00298379
             256             256      0.00892913               0.0218981
             512             512       0.0464412               0.0935147
            1024            1024        0.618661                0.488503
            2048            2048          4.2304                 1.98348

LogSoftmax:
           NRows           NCols         EMatrix         ETensor
            2048               1     0.000594632      0.00158938
            4096               1      0.00121075      0.00303042
            8192               1      0.00227049      0.00637481
           16384               1       0.0046793       0.0128154
           32768               1      0.00964536       0.0250104
            2048               2      0.00115565      0.00306239
            4096               2      0.00231914      0.00618422
            8192               2      0.00495743        0.014086
           16384               2       0.0103915       0.0257999
           32768               2       0.0200752       0.0495503
Comment 5 william.tambellini 2018-11-21 00:37:22 UTC
Created attachment 895 [details]
bench_matrix_vs_tensor.cpp
Comment 6 Christoph Hertzberg 2018-11-21 09:02:19 UTC
The transpose stuff seems to be unrelated to the other issue. Please open a separate bug for that. Regarding the benchmark: Move the code to a submethod -- gcc often fails to properly optimize inside main.
Comment 7 william.tambellini 2018-11-21 20:30:13 UTC
Ok I m going to create a different ticket for transpose.
I have moved the op in submethods outside main.

Bench Eigen Matrix vs Tensor
Repeat: 100
GCC: 6.1.0
Eigen version: 3.3.90
Simd: SSE, SSE2
Eigen::nbThreads: 4
EIGEN_NO_DEBUG
EIGEN_VECTORIZE
EIGEN_HAS_OPENMP: 201511
omp_get_num_threads: 1

Transpose:
           NRows           NCols         EMatrix         ETensor
              64              64     0.000278582              0.00118413
             128             128      0.00157701              0.00293576
             256             256      0.00937656               0.0216004
             512             512         0.04582               0.0843192
            1024            1024        0.382853                0.523895
            2048            2048         4.23694                 2.16815

LogSoftmax:
           NRows           NCols         EMatrix         ETensor
            2048               1     0.000603764      0.00154033
            4096               1      0.00114091      0.00302051
            8192               1      0.00223847      0.00607695
           16384               1       0.0046968       0.0127488
           32768               1      0.00952952       0.0249907
            2048               2      0.00114564      0.00303769
            4096               2      0.00225683      0.00607746
            8192               2      0.00452618       0.0125067
           16384               2      0.00926576       0.0247138
           32768               2       0.0194619       0.0490335
Comment 8 william.tambellini 2018-11-21 20:31:19 UTC
Created attachment 896 [details]
bench_matrix_vs_tensor.cpp
Comment 9 Gael Guennebaud 2018-11-22 17:09:52 UTC
The problem most likely comes from broadcast() which is known to be expensive because the Tensor module uses 1D indices thus implying costly integer div/mod every coeff access. In your case the div/mod are actually by passed but at the cost of a dynamic branch every coeff access. I don't think there is any easy fix beside implementing special paths for 1D/2D Tensors so that they implicitly fallback to Eigen's Core. This won't happens soon!
Comment 10 Nobody 2019-12-04 18:03:30 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to gitlab.com's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.com/libeigen/eigen/issues/1614.

Note You need to log in before you can comment on or make changes to this bug.