63048

Understanding why an ASM fsqrt implementation is faster than the standard sqrt function

I have playing around with basic math function implementations in C++ for academic purposes. Today, I benchmarked the following code for Square Root:

inline float sqrt_new(float n) { __asm { fld n fsqrt } }

I was surprised to see that it is consistently faster than the standard sqrt function (it takes around 85% of the execution time of the standard function).

I don't quite get why and would love to better understand it. Below I show the full code I am using to profile (in Visual Studio 2015, compiling in Release mode and with all optimizations turned on):

#include <iostream> #include <random> #include <chrono> #define M 1000000 float ranfloats[M]; using namespace std; inline float sqrt_new(float n) { __asm { fld n fsqrt } } int main() { default_random_engine randomGenerator(time(0)); uniform_real_distribution<float> diceroll(0.0f , 1.0f); chrono::high_resolution_clock::time_point start1, start2; chrono::high_resolution_clock::time_point end1, end2; float sqrt1 = 0; float sqrt2 = 0; for (int i = 0; i<M; i++) ranfloats[i] = diceroll(randomGenerator); start1 = std::chrono::high_resolution_clock::now(); for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]); end1 = std::chrono::high_resolution_clock::now(); start2 = std::chrono::high_resolution_clock::now(); for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]); end2 = std::chrono::high_resolution_clock::now(); auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count(); auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count(); cout << "Time elapsed for SQRT1: " << time1 << " seconds" << endl; cout << "Time elapsed for SQRT2: " << time2 << " seconds" << endl; cout << "Average of Time for SQRT2 / Time for SQRT1: " << time2 / time1 << endl; cout << "Equal to standard sqrt? " << (sqrt1 == sqrt2) << endl; system("pause"); return 0; }

EDIT: I am editing the question to include disassembly codes of both loops that calculate square roots as they came at Visual Studio 2015.

First, the disassembly for for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);:

00091194 0F 5A C0 cvtps2pd xmm0,xmm0 00091197 E8 F2 18 00 00 call __libm_sse2_sqrt_precise (092A8Eh) 0009119C F2 0F 5A C0 cvtsd2ss xmm0,xmm0 000911A0 83 C6 04 add esi,4 000911A3 F3 0F 58 44 24 4C addss xmm0,dword ptr [esp+4Ch] 000911A9 F3 0F 11 44 24 4C movss dword ptr [esp+4Ch],xmm0 000911AF 81 FE 90 5C 46 00 cmp esi,offset __dyn_tls_dtor_callback (0465C90h) 000911B5 7C D9 jl main+190h (091190h)

Next, the disassembly for for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);:

00091290 F3 0F 10 00 movss xmm0,dword ptr [eax] 00091294 F3 0F 11 44 24 6C movss dword ptr [esp+6Ch],xmm0 0009129A D9 44 24 6C fld dword ptr [esp+6Ch] 0009129E D9 FA fsqrt 000912A0 D9 5C 24 6C fstp dword ptr [esp+6Ch] 000912A4 F3 0F 10 44 24 6C movss xmm0,dword ptr [esp+6Ch] 000912AA 83 C0 04 add eax,4 000912AD F3 0F 58 44 24 54 addss xmm0,dword ptr [esp+54h] 000912B3 F3 0F 11 44 24 54 movss dword ptr [esp+54h],xmm0 000912B9 ?? ?? ?? 000912BA ?? ?? ?? 000912BB ?? ?? ?? 000912BC ?? ?? ?? 000912BD ?? ?? ?? 000912BE ?? ?? ?? 000912BF ?? ?? ?? 000912C0 ?? ?? ?? 000912C1 ?? ?? ?? 000912C2 ?? ?? ?? 000912C3 ?? ?? ?? 000912C4 ?? ?? ?? 000912C5 ?? ?? ?? 000912C6 ?? ?? ?? 000912C7 ?? ?? ?? 000912C8 ?? ?? ?? 000912C9 ?? ?? ?? 000912CA ?? ?? ?? 000912CB ?? ?? ?? 000912CC ?? ?? ?? 000912CD ?? ?? ?? 000912CE ?? ?? ?? 000912CF ?? ?? ?? 000912D0 ?? ?? ?? 000912D1 ?? ?? ?? 000912D2 ?? ?? ?? 000912D3 ?? ?? ?? 000912D4 ?? ?? ?? 000912D5 ?? ?? ?? 000912D6 ?? ?? ?? 000912D7 ?? ?? ?? 000912D8 ?? ?? ?? 000912D9 ?? ?? ?? 000912DA ?? ?? ?? 000912DB ?? ?? ?? 000912DC ?? ?? ?? 000912DD ?? ?? ?? 000912DE ?? ?? ??

Answer1:

Both your loops come out pretty horrible, with many bottlenecks other than the sqrt function call or the FSQRT instruction. And at least 2x slower than optimal scalar SQRTSS (single-precision) code could do. And that's maybe 8x slower than what a decent SSE2 vectorized loop could achieve. Even without reordering any math operations, you could beat SQRTSS throughput.

Many of the reasons from https://gcc.gnu.org/wiki/DontUseInlineAsm apply to your example. The compiler won't be able to propagate constants through your function, and it won't know that the result is alway non-negative (if it isn't NaN). It also won't be able to optimize it into an fabs() if you later square the number.

Also highly important, you defeat auto-vectorization with SSE2 SQRTPS (_mm_sqrt_ps()). A "no-error-checking" scalar sqrt() function using intrinsics also suffers from that problem. IDK if there's any way to get optimal results without /fp:fast, but I doubt it. (Other than writing a whole loop in assembly, or vectorizing the whole loop yourself with intrinsics).

<hr>

It's pretty impressive that your Haswell CPU manages to run the function-call loop as fast as it does, although the inline-asm loop may not even be saturating FSQRT throughput either.

For some reason, your library function call is calling double sqrt(double), not the C++ overload float sqrt(float). This leads to a conversion to double and back to float. Probably you need to #include <cmath> to get the overloads, or you could call sqrtf(). gcc and clang on Linux call sqrtf() with your current code (without converting to double and back), but maybe their <random> header happens to include <cmath>, and MSVC's doesn't. Or maybe there's something else going on.

<hr>

<strong>The library function-call loop</strong> keeps the sum in memory (instead of a register). Apparently the calling convention used by the 32-bit version of __libm_sse2_sqrt_precise doesn't preserve any XMM registers. The Windows x64 ABI does preserve XMM6-XMM15, but wikipedia says this is new and the 32-bit ABI didn't do that. I assume if there were any call-preserved XMM registers, MSVC's optimizer would take advantage of them.

Anyway, besides the throughput bottleneck of calling sqrt on each independent scalar float, the loop-carried dependency on sqrt1 is a latency bottleneck that includes a store-forwarding round trip:

000911A3 F3 0F 58 44 24 4C addss xmm0,dword ptr [esp+4Ch] 000911A9 F3 0F 11 44 24 4C movss dword ptr [esp+4Ch],xmm0

Out of order execution lets rest of the code for each iteration overlap, so you just bottleneck on throughput, but no matter how efficient the library sqrt function is, this latency bottleneck limits the loop to one iteration per 6 + 3 = 9 cycles. (Haswell ADDSS latency = 3, store-forwarding latency for XMM load/store = 6 cycles. 1 cycle more than store-forwarding for integer registers. See Agner Fog's instruction tables.)

SQRTSD has a throughput of one per 8-14 cycles, so the loop-carried dependency is not the limiting bottleneck on Haswell.

<hr>

<strong>The inline-asm version</strong> with has a store/reload round trip for the sqrt result, but it's not part of the loop-carried dependency chain. MSVC inline-asm syntax makes it hard to avoid store-forwarding round trips to get data into / out of inline asm. But worse, you produce the result on the x87 stack, and the compiler wants to do SSE math in XMM registers.

And then MSVC shoots itself in the foot for no reason, keeping the sum in memory instead of in an XMM register. It looks inside inline-asm statements to see which registers they affect, so IDK why it doesn't see that your inline-asm statement doesn't clobber any XMM regs.

So MSVC does a much worse job than necessary here:

00091290 movss xmm0,dword ptr [eax] # load from the array 00091294 movss dword ptr [esp+6Ch],xmm0 # store to the stack 0009129A fld dword ptr [esp+6Ch] # x87 load from stack 0009129E fsqrt 000912A0 fstp dword ptr [esp+6Ch] # x87 store to the stack 000912A4 movss xmm0,dword ptr [esp+6Ch] # SSE load from the stack (of sqrt(array[i])) 000912AA add eax,4 000912AD addss xmm0,dword ptr [esp+54h] # SSE load+add of the sum 000912B3 movss dword ptr [esp+54h],xmm0 # SSE store of the sum

So it has the same loop-carried dependency chain (ADDSS + store-forwarding) as the function-call loop. Haswell FSQRT has one per 8-17 cycle throughput, so probably it's still the bottleneck. (All the stores/reloads involving the array value are independent for each iteration, and out-of-order execution can overlap many iterations to hide that latency chain. However, they will clog up the load/store execution units and sometimes delay the critical-path loads/stores by an extra cycle. This is called a resource conflict.)

<hr>

<strong>Without /fp:fast</strong>, the sqrtf() library function has to set errno if the result is NaN. This is why it can't inline to just a SQRTSS.

If you did want to implement a no-checks scalar sqrt function yourself, you'd do it with Intel intrinsics syntax:

// DON'T USE THIS, it defeats auto-vectorization static inline float sqrt_scalar(float x) { __m128 xvec = _mm_set_ss(x); xvec = _mm_cvtss_f32(_mm_sqrt_ss(xvec)); }

This compiles to a near-optimal scalar loop with gcc and clang (without -ffast-math). See it on the Godbolt compiler explorer:

# gcc6.2 -O3 for the sqrt_new loop using _mm_sqrt_ss. good scalar code, but don't optimize further. .L10: movss xmm0, DWORD PTR [r12] add r12, 4 sqrtss xmm0, xmm0 addss xmm1, xmm0 cmp r12, rbx jne .L10

This loop should bottleneck only on SQRTSS throughput (one per 7 clocks on Haswell, notably faster than SQRTSD or FSQRT), and with no resource conflicts. However, <strong>it's still garbage compared to what you could do even without re-ordering the FP adds</strong> (since FP add/mul aren't truly associative): a smart compiler (or programmer using intrinsics) would use SQRTPS to get 4 results with the same throughput as 1 result from SQRTSS. Unpack the vector of SQRT results to 4 scalars, and then you can keep exactly the same order of operations with identical rounding of intermediate results. I'm disappointed that clang and gcc didn't do this.

However, <strong>gcc and clang do manage to actually avoid calling a library function</strong>. clang3.9 (with just -O3) uses SQRTSS without even checking for NaN. I assume that's legal, and not a compiler bug. Maybe it sees that the code doesn't use errno?

gcc6.2 on the other hand speculatively inlines sqrtf(), with a SQRTSS and a check on the input to see if it needs to call the library function.

# gcc6.2's sqrt() loop, without -ffast-math. # speculative inlining of SQRTSS with a check + fallback # spills/reloads a lot of stuff to memory even when it skips the call :( # xmm1 = 0.0 (gcc -fverbose-asm says it's holding sqrt2, which is zero-initialized, so I guess gcc decides to reuse that zero) .L9: movss xmm0, DWORD PTR [rbx] sqrtss xmm5, xmm0 ucomiss xmm1, xmm0 # compare input against 0.0 movss DWORD PTR [rsp+8], xmm5 jbe .L8 # if(0.0 <= SQRTSS input || unordered(0.0, input)) { skip the function call; } movss DWORD PTR [rsp+12], xmm1 # silly gcc, this store isn't needed. ucomiss doesn't modify xmm1 call sqrtf # called for negative inputs, but not for NaN. movss xmm1, DWORD PTR [rsp+12] .L8: movss xmm4, DWORD PTR [rsp+4] # silly gcc always stores/reloads both, instead of putting the stores/reloads inside the block that the jbe skips addss xmm4, DWORD PTR [rsp+8] add rbx, 4 movss DWORD PTR [rsp+4], xmm4 cmp rbp, rbx jne .L9

gcc unfortunately shoots itself in the foot here, the same way MSVC does with inline-asm: There's a store-forwarding round trip as a loop-carried dependency. All the spill/reloads could be inside the block skipped by the JBE. Maybe gcc things negative inputs will be common.

<hr>

Even worse, if you do use /fp:fast or -ffast-math, even a clever compiler like clang doesn't manage to rewrite your _mm_sqrt_ss into a SQRTPS. Clang is normally pretty good at not just mapping intrinsics to instructions 1:1, and will come up with more optimal shuffles and blends if you miss an opportunity to combine things.

<strong>So with fast FP math enabled, using _mm_sqrt_ss is a big loss</strong>. clang compiles the sqrt() library function call version into RSQRTPS + a newton-raphson iteration.

<hr>

Also note that your microbenchmark code isn't sensitive to the latency of your sqrt_new() implementation, only the throughput. Latency often matters in real FP code, not just throughput. But in other cases, like doing the same thing independently to many array elements, latency doesn't matter, because out-of-order execution can hide it well enough by having instructions in flight from many loop iterations.

As I mentioned earlier, latency from theextra store/reload round trip your data takes on its way in/out of MSVC-style inline-asm is a serious problem here. When MSVC inlines the function, the fld n doesn't come directly from the array.

<hr>

BTW, Skylake has SQRTPS/SS throughput of one per 3 cycles, but still 12 cycle latency. SQRTPD/SD throughput = one per 4-6 cycles, latency = 15-16 cycles. So FP square root is more pipelined on Skylake than on Haswell. This magnifies the difference between benchmarking FP sqrt latency vs. throughput.

Answer2:

compiling in Release mode and with all optimizations turned on

They are not all turned on, you missed one. In the IDE it is Project > Properties > C/C++ > Code Generation > Floating Point Model. You left it at its default setting, /fp:precise. That has a very visible side-effect on the generated machine code:

00091197 E8 F2 18 00 00 call __libm_sse2_sqrt_precise (092A8Eh)

Perhaps it is intuitive enough that calling a helper function in the CRT is always slower than a inline instruction like FSQRT.

There is a lot to say about the exact semantics of /fp, the MSDN article about it is not very good. It is also hard to reverse-engineer, Microsoft purchased the code from Intel and could not obtain a source license that allowed them to re-publish the assembly code. Its original goal was certainly to deal with the horrid floating point consistency problems caused by Intel's 8087 FPU design. That is not so relevant today anymore, all mainstream C and C++ compilers now emit SSE2 code. MSVC++ does so since VS2012. These Intel library functions now mainly ensure that floating point operations still produce results that are consistent with older versions of the compiler.

__libm_sse2_sqrt_precise() does rather a lot. At the considerable risk of trying to document an undocumented function, I think I see it:

    <li>check the value of the MXCSR register, the control register for SSE. If it doesn't have its default value (0x1F80) then the code assumes that the programmer has used _control_fp() and jumps to the legacy sqrt() implementation that calculates with FPU semantics.</li> <li>check the value of the FPU control register to see if floating point exceptions are enabled. Normally off, if any are enabled then it jumps to sqrt() as above.</li> <li>checks if the argument is less than zero, but not minus 0, calls the user-supplied _matherr() function.</li> <li>finally computes the result with the SQRTSD instruction.</li> </ul>

    None of this actually having anything to do with precision :) Seeing this execute at 85% perf is rather a good result aided however by FSQRT being substantially slower than SQRTSD. The latter got a lot more silicon love in modern processors.

    If you care about fast floating point operations then change the setting to /fp:fast. Which produces:

    00D91310 sqrtsd xmm0,xmm0

    An inline instruction instead of a library call. In other words, skips the first 3 bullets in the previous list. Also beats FSQRT handily.

Recommend

  • Automatic update of parent record updated_at field (Elixir-Ecto)
  • Why does Subject.Dispose does not dispose current subscriptions?
  • JAX-WS servlet filter exceptions
  • Why doesn't this clpfd query terminate until I add a redundant constraint?
  • what -j4 value is used to invoke make?
  • Prevent Emacs from modifying the OS X clipboard?
  • What does the “?” mean in the following statement
  • Read stdin in chunks in Bash pipe
  • eC (Ecere) how to not worry about private data fields of a class
  • vi mode to emacs mode while on R
  • Encode Byte array to JPEG image in Objective-C
  • How do I compile a C/C++ program through windows command prompt?
  • python string formatting fixed width
  • Get variable height for a UIButton for iPhones 5/6/6+
  • Mongodb update() vs. findAndModify() performace
  • Problem in concatenation of objects in javascript
  • Jquery UI Sortable, move item automatically
  • PHP + XML - how to rename and delete XML elements using SimpleXML or DOMDocument?
  • Get current user from inside the model in Sails
  • What's the name of this finding square root algorithm?
  • EntLib Way to Bind “Null” Value to Parameter
  • configure: error: no acceptable C compiler found in $PATH
  • Django model inheritance, filtering models
  • Installing Apache MyFaces 2 on WildFly 8.2.0
  • How to use remove-erase idiom for removing empty vectors in a vector?
  • ImageMagick, replace semi-transparent white with opaque white
  • Cannot connect to cassandra from Spark
  • Is possible to count alias result on mysql
  • Cross-Platform Protobuf Serialization
  • Statically linking a C++ library to a C# process using CLI or any other way
  • Alternatives to the OPTIONAL fallback SPARQL pattern?
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • Linker errors when using intrinsic function via function pointer
  • Revoking OAuth Access Token Results in 404 Not Found
  • WPF Applying a trigger on binding failure
  • Benchmarking RAM performance - UWP and C#
  • C# - Getting references of reference
  • -fvisibility=hidden not passed by compiler for Debug builds
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • XCode 8, some methods disappeared ? ex: layoutAttributesClass() -> AnyClass