implementation of pmadd in AVX architecture forces generating vfmadd231ps in Clang with assmebly, this may not be profitable anymore since Clang no longer always "generates vfmadd213ps instruction plus some vmovaps on registers" like it says in the implementation comment.
commuting is done for memory and register operands and the correct fmadd permutation is chosen allowing optimizations such as Memory Folding.
so forcing assembly code might result in skipping optimization opportunities
Do you known which clang version introduced this optimization?
if I'm not mistaken this is the first patch introducing commutable fma oeprands:
but there has been additional patches and changes since then, the latest being this one for AVX512:
After benchmarking several clang versions, the first correct one is clang 3.8:
nice, were there any performance improvements?
no improvement because pmadd is currently only used in places where vfmadd231ps is really the right choice.