82325

What's up with gcc weird stack manipulation when it wants extra stack alignment?

Question:

I've seen this r10 weirdness a few times, so let's see if anyone knows what's up.

Take this simple function:

#define SZ 4 void sink(uint64_t *p); void andpop(const uint64_t* a) { uint64_t result[SZ]; for (unsigned i = 0; i < SZ; i++) { result[i] = a[i] + 1; } sink(result); }

It just adds 1 to each of the 4 64-bit elements of the passed-in array and stores it in a local and calls sink() on the result (to avoid the whole function being optimized away).

Here's the <a href="https://godbolt.org/g/2qAttE" rel="nofollow">corresponding</a> assembly:

andpop(unsigned long const*): lea r10, [rsp+8] and rsp, -32 push QWORD PTR [r10-8] push rbp mov rbp, rsp push r10 sub rsp, 40 vmovdqa ymm0, YMMWORD PTR .LC0[rip] vpaddq ymm0, ymm0, YMMWORD PTR [rdi] lea rdi, [rbp-48] vmovdqa YMMWORD PTR [rbp-48], ymm0 vzeroupper call sink(unsigned long*) add rsp, 40 pop r10 pop rbp lea rsp, [r10-8] ret

It's hard to understand almost everything that is going on with r10. First, r10 is set to point to rsp + 8, then push QWORD PTR [r10-8], which as far as I can tell pushes a copy of the return address on the stack. Following that, rbp is set up as normal and then finally r10 itself is pushed.

To unwind all this, r10 is popped off of the stack and used to restore rsp to its original value.

Some observations:

<ul><li>Looking at the entire function, all of this seems like a totally roundabout way of simply restoring rsp to it's original value before ret - but the usual epilog of mov rsp, rpb would do just as well (see clang)!</li> <li>That said, the (expensive) push QWORD PTR [r10-8] doesn't even help in that mission: this value (the return address?) is apparently never used.</li> <li>Why is r10 pushed and popped at all? The value isn't clobbered in the very small function body and there is no register pressure.</li> </ul>

What's up with that? I've seen it several times before, and it usually wants to use r10, sometimes r13. It seems likely that has something to do with aligning the stack to 32 bytes, since if you change SZ to be less than 4 it uses xmm ops and the issue disappears.

Here's SZ == 2 for example:

andpop(unsigned long const*): sub rsp, 24 vmovdqa xmm0, XMMWORD PTR .LC0[rip] vpaddq xmm0, xmm0, XMMWORD PTR [rdi] mov rdi, rsp vmovaps XMMWORD PTR [rsp], xmm0 call sink(unsigned long*) add rsp, 24 ret

Much nicer!

Answer1:

Well, you answered your question: The stack pointer needs to be aligned to 32 bytes before it can be accessed with aligned AVX2 loads and stores, but the ABI only provides 16 byte alignment. Since the compiler cannot know how much the alignment is off, it has to save the stack pointer in a scratch register and restore it afterwards. But the saved value has to outlive the function call, so it has to be put on the stack, and a stack frame has to be created.

Some x86-64 ABIs have a red zone (a region of the stack below the stack pointer which is not used by signal handlers), so it is feasible not to change the stack pointer at all for such short functions, but GCC apparently does not implement this optimization and it would not apply here anyway because of the function call at the end.

In addition, the default stack alignment implementation is rather poor. For this case, -maccumulate-outgoing-args results in better-looking code with GCC 6, just aligning RSP after saving RBP, instead of copying the return address before saving RBP:

andpop: pushq %rbp movq %rsp, %rbp # make a traditional stack frame andq $-32, %rsp # reserve 0 or 16 bytes subq $32, %rsp vmovdqu (%rdi), %xmm0 # split unaligned load from tune=generic vinserti128 $0x1, 16(%rdi), %ymm0, %ymm0 # use -march=haswell instead movq %rsp, %rdi vpaddq .LC0(%rip), %ymm0, %ymm0 vmovdqa %ymm0, (%rsp) vzeroupper call sink@PLT leave ret

(editor's note: gcc8 and later make asm like this by default (<a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSAbAQwDtRkBSAJgCFufSAZ1QBXYskwgA5NwDMeFsgYisAag6yAwoIL4WBAHQIN2DgAYAgnKwAzBZlUBlAFqqALOauWAbqjzpVQQUAawgRBQIANjcAfQJVACoABwBKDT5LT19/VVZ0JNQkiDQWHVVw/Wi4hNyU9QB2DItVFvKIqvjiTEERBgIOAFYeF0GAEXTPVtUbElUw0rxgFkwAvHVZUdUzdNU1jU0nZx28fn46jkbJqdaunr7BnhOBzY1Npgenl95VWgnLKYu40y/1aQRYoVuvQIaVkTUBUhSjGkAykpBY0jMqNQ0k0p2%2BwjEEnUXFktFRBAxCMRwRApIM9QAnJEBmYSQAONz1WT1SJspFSNyogC2dDMZjRlNI2KkqMEIHFFKkmMRcFgMCg6ogSDQQqSeAYmDIFGKqF1%2BsNIGAbK4pDsfUNcogACNJU6FExiABPaRk0g6oWYfQAeRYDG9StRWCFrGABsl%2BC6yAIeG83UlmAAHphkCICJIpL6IpgGD7UQw8E7iB7PZoMPnfQRiHgRQWEYwYyg8QJy065ZBEYVk6hStIALQZ5CqUdB2RTmyp4hO1CCTCjpiCIVTgDqTAYDCn0fExg2CHXAHdiwxZaJxJJaIiS1IURKI1LpBm2ZFR9FVMBkJO2QMLg5lwQhZjkehVFrM0DWIYlSTqXFeH4clKRSREEEwJgsGIShqRAFl%2BUFUgRUI9FX2lWV5VIRVlVIVVEBQU09Vg8hKB1FiLTwf9aAZeg7TzYhHRdV83RYatSz9U0A2DUNw0xUgoxjONXwTbNk1TOVX0zbNc3rVEi0fX0eyrL1aywSTG2bUsHw7ThkN4RgKz7CABySIcRykadZ1HHc9wPD1kGPUZT0EC892vQk7wfZFUXIhTpQ/L8f24ydeIMMwDFoED8CIOCINIKDmPNfKSS4RCu1QiN0NITDsItVzSBpQjH2I%2BKsWkKiFTQmKpC4YUCPFdq3xlGi0PozVNW1YrWONDiSpQZg2HqMVbX1QThNdd0vUk/1AwIEMw3jTBozYFSFLUpMUzTbSsxzPNJMMySTOrcz9JopsWzJWy2E7Bzu2c%2BA3I8rTx0nbzt13fdR0PILXlC8L9znGx1wIGGmAIBBItvOheufYbEs/b83FUJQY1UeoMoynKwNK0lCugzi6dkCr/qq5UMKwnC8KawaiIGsjJUooRqNotsGK1JiYMNNiTWl3ClBiZYzxiaI1vtITKBEhSxIk1spKFGSDrk47TtjfMLrwRMNJuhSdPu96ntfF6zLrSzPps9tfvsvhHJ7FzgbwYdQaDDMpwAcW8VQAHpAqQCwADUAA1%2BqEG8JFx/l8aF98fyFQRvEnbx0toEmIC3ABJAA5bA6ggUC8vgyDGZK%2BC3FZ32eHZqlaq5hr8Ja6Q2pz0a5W66retT0ihpH7uappXkgLMWQ3AZHlaAGLhIjK/lZDi2exon/lU4JzrD450gFyCYcQDcIA%3D%3D%3D" rel="nofollow">Godbolt compiler explorer with gcc8, clang7, ICC19, and MSVC</a>), even without -maccumulate-outgoing-args)

<hr />

This issue (GCC generating poor code for stack alignment) recently came up when we had to implement a workaround for GCC __tls_get_addr ABI bug, and we ended up writing the stack realignment by hand.

<strong>EDIT</strong> There is also another issue, related to RTL pass ordering: stack alignment is picked before the final determination whether the stack is actually needed, <a href="https://godbolt.org/g/enf33U" rel="nofollow">as BeeOnRope's second example shows</a>.

Recommend

  • Get file size from folder by filter
  • Moving google apps script to v8 file upload stopped working from sidebar
  • How to Store Distance value to SortValues or Entity
  • How to use implementation loaded with different Java classloader?
  • Does ADL work for the global namespace?
  • H2 database: how to protect with encryption, without exposing file encryption key
  • The underlying provider failed on Open after deleting database files and sqllocaldb
  • How to acceptance test captcha-protected web application functionality?
  • How to write a string to a file in C?
  • How can I provide my own @id field using Spring Roo and JPA
  • Unable to authenticate the package: 721772200.itmsp
  • Geographical borders incomplete using geom_polygon for plotting map - ggplot2
  • WebSockets through Apache and Tomcat: HTTP upgrade is not supported by the AJP protocol
  • websocket._exceptions.WebSocketProxyException: failed CONNECT via proxy status: 503
  • Repository vs. UnitOfWork [closed]
  • Clearing Custom Tool file property in Visual Studio 2010
  • How to make the url from APIGateway to AWS Lambda was available only from a certain domain
  • linking pgi compiled library with gcc linker
  • Using COUNT returns different results
  • How to capture if current vendor is selected
  • openOptionsMenu() across android versions
  • Chart.js not showing dynamically populated data
  • Android LinearLayout Line Breaks in XML
  • primefaces datatable selectionMode multiple not working with mojarra
  • java.io.FileNotFoundException on getInputStream()
  • How to set and check a session after login?
  • Print a Form at higher dpi than screen resolution
  • get value using jquery
  • Why do I need an infinite loop in STM32 programming?
  • ODBC connection to an .accdb file
  • Display standard razor/mvc 3 validation messages displayed in another language
  • Pass multiple lines of stdin input to interactive Java command line program, non-interactively
  • Storing the Cursor for App Engine Pagination
  • Simultaneous animation when entering editing mode of UITableViewCell
  • Is possible having two COM STA instances of the same component?
  • Circular Left Rotation Algorithm in C#
  • Java Collections.shuffle() weird behaviour [closed]
  • Sql - ON DUPLICATE KEY UPDATE