Getting your ducks in line

Posted on 27.02.2015

Getting the details right of low level C or C++ is one of those really tricky areas many people do not directly grasp. I saw a case recently where msvc was reporting error C4366. In this scenario the warning was triggered by a function call using a pointer to a packed struct member. If you read the linked MSDN page, you'll see a scenario you can reproduce.

This alone didn't trigger the writing of this blog post. I heard tell that pointers must be aligned - you can't have an unaligned pointer and that copying the data out of the struct then taking a pointer is a solution. The first is not true and the second might be okay, but probably isn't, as we shall see.

Why does this warning occur? Well, it comes down to two potential issues.

Firstly, on the lowest level, processors move information to and from memory and registers. How much they should read from a given location can often be variable, for example, I might ask to

mov  byte eax, [ecx]

or I might ask

mov  dword eax, [ecx]

On some processors, there are strict alignment requirements - that the address of memory must be divisible by the processor word size unless it is a byte read. So, to load a 4-byte word from address 0x3 violates these assumptions and causes the CPU to fault.

The x86 processor family is both the exception to the rule and a good way of demonstrating it. On x86, unaligned address reads are fine (they will not raise a fault), however they can have a performance impact. This is issue two - unaligned reads where permitted can be costly.

This actually does not apply to all x86 instructions - for example:

movdqa xmm1, 0x3

will fault the processor, because the a in movdqa stands for aligned.

So, unaligned reads are generally not liked. In fact, I struggled to force my compiler to produce an unaligned read in any way. So I've written some assembly to prove unaligned reads are possible on modern x86-64 processors:

[bits 64]

section .data
    ; BBBB BBB A
    bdta: db 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x41, 0, 0, 0, 0x41, 0
    format1: db "Character test+7 is %d", 10, 0
    format2: db "Address   test+7 is %p", 10, 0

section .text

extern printf
global main

main:
    push  rbp
    mov   rbp, rsp
    lea   rbx, [bdta+7]
    mov   rdi, format1
    mov dword edx, [rbx]
    mov   rsi, rdx
    xor   rax, rax
    call  printf

    mov   rdi, format2
    mov   rsi, rbx
    xor   rax, rax
    call  printf

    xor   rax,rax
    mov   rsp, rbp
    pop   rbp
    ret

If you compile this you should get some output like this:

% ./unaligned_read
Character test+7 is 65
Address   test+7 is 0x60103b

Which is, quite clearly, unaligned. I even went so far as to read a dword from an unaligned address, just to be sure I wasn't reading a byte!

So x86 can do this but will try very hard not to, because underneath the memory management unit is correcting our stupidity, causing a performance hit. However, not all platforms have this luxury and so, how do we avoid this?

One way is to use pointer arithmetic, which brings me back to the "unaligned pointers" issue. They can definitely be unaligned, because one task for an embedded memcpy() for example might check for alignment. The generic implementation in uClibc can be seen doing exactly that - byte copying until it reaches alignment, then copying via words or pages as necessary - see also their header macros.

That's that out the way. So, supposing you have a packed struct, how do you safely extract your "bigger than a byte" data?

Well you can copy - e.g. if you have a uint64_t stored in a struct somewhere, it was suggested we should just copy it like this:

uint64_t temp = struct.unaligned_ui64;
some_function(&temp);

Will this work? We don't have to wonder, we have a compiler! I threw together a little program that will fight off the optimizers attempts at avoiding having unaligned variables in the first place:

#include <inttypes.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef struct __attribute__((aligned(1), packed)) packed_tag 
{
    uint8_t padding[3];
    uint64_t z;
} packed_s;

void print(uint64_t* p)
{
    printf("Value is: %" PRIu64 "\n at %p", *p, p);
}

packed_s* allocate_struct()
{
    packed_s* t = calloc(1, sizeof(packed_s));
    if ( t )
    {
        uint64_t data = 0;
        scanf("%" PRIu64 "", &data);
        t->z = data;
    }
    return t;
}

int main(int argc, char** argv)
{
    packed_s* s = allocate_struct();
    uint64_t t = s->z;
    print(&t);
    return 0;
}

If you compile this with: gcc -O2 -fno-asynchronous-unwind-tables -masm=intel -S unaligned_2.c -o unaligned_2.s you will get this optimized version:

    .file   "unaligned_2.c"
    .intel_syntax noprefix
    .section    .rodata.str1.1,"aMS",@progbits,1
.LC0:
    .string "Value is: %lu\n at %p"
    .text
    .p2align 4,,15
    .globl  print
    .type   print, @function
print:
    mov rsi, QWORD PTR [rdi]
    mov rdx, rdi
    xor eax, eax
    mov edi, OFFSET FLAT:.LC0
    jmp printf
    .size   print, .-print
    .section    .rodata.str1.1
.LC1:
    .string "%lu"
    .text
    .p2align 4,,15
    .globl  allocate_struct
    .type   allocate_struct, @function
allocate_struct:
    push    rbx
    mov esi, 11
    mov edi, 1
    sub rsp, 16
    call    calloc
    test    rax, rax
    mov rbx, rax
    je  .L3
    lea rsi, [rsp+8]
    mov edi, OFFSET FLAT:.LC1
    xor eax, eax
    mov QWORD PTR [rsp+8], 0
    call    __isoc99_scanf
    mov rax, QWORD PTR [rsp+8]
    mov QWORD PTR [rbx+3], rax
.L3:
    add rsp, 16
    mov rax, rbx
    pop rbx
    ret
    .size   allocate_struct, .-allocate_struct
    .section    .text.startup,"ax",@progbits
    .p2align 4,,15
    .globl  main
    .type   main, @function
main:
    sub rsp, 24
    xor eax, eax
    call    allocate_struct
    mov rsi, QWORD PTR [rax+3]
    lea rdx, [rsp+8]
    mov edi, OFFSET FLAT:.LC0
    xor eax, eax
    mov QWORD PTR [rsp+8], rsi
    call    printf
    xor eax, eax
    add rsp, 24
    ret
    .size   main, .-main
    .ident  "GCC: (GNU) 4.8.3"
    .section    .note.GNU-stack,"",@progbits

which as you can see, is quite happily doing an unaligned read, even though it provides a pointer to a function which is aligned. If the compiler is intelligent enough to understand it can do this on x86, but not on other platforms, we are in luck. If not, we've produced unaligned access and so a CPU fault.

So, how do we get rid of that unaligned read of a uint64, guaranteed? The only safe way is to use memcpy to extract to our local variable byte at a time from the packed struct and into our local variable.

Of course, we can use unaligned pointers, providing we don't read their addresses - on such platforms memcpy is in all likeliness checking the src and dst addresses to be safe.