Query on -ffunction-section & -fdata-sections options of gcc

Interestingly, using -fdata-sections can make the literal pools of your functions, and thus your functions themselves larger. I’ve noticed this on ARM in particular, but it’s likely to be true elsewhere. The binary I was testing only grew by a quarter of a percent, but it did grow. Looking at the disassembly of the changed functions it was clear why.

If all of the BSS (or DATA) entries in your object file are allocated to a single section then the compiler can store the address of that section in the functions literal pool and generate loads with known offsets from that address in the function to access your data. But if you enable -fdata-sections it puts each piece of BSS (or DATA) data into its own section, and since it doesn’t know which of these sections might be garbage collected later, or what order the linker will place all of these sections into the final executable image, it can no longer load data using offsets from a single address. So instead, it has to allocate an entry in the literal pool per used data, and once the linker has figured out what is going into the final image and where, then it can go and fix up these literal pool entries with the actual address of the data.

So yes, even with -Wl,--gc-sections the resulting image can be larger because the actual function text is larger.

Below I’ve added a minimal example

The code below is enough to see the behavior I’m talking about. Please don’t be thrown off by the volatile declaration and use of global variables, both of which are questionable in real code. Here they ensure the creation of two data sections when -fdata-sections is used.

static volatile int head;
static volatile int tail;

int queue_empty(void)
{
    return head == tail;
}

The version of GCC used for this test is:

gcc version 6.1.1 20160526 (Arch Repository)

First, without -fdata-sections we get the following.

> arm-none-eabi-gcc -march=armv6-m \
                    -mcpu=cortex-m0 \
                    -mthumb \
                    -Os \
                    -c \
                    -o test.o \
                    test.c

> arm-none-eabi-objdump -dr test.o

00000000 <queue_empty>:
 0: 4b03     ldr   r3, [pc, #12]   ; (10 <queue_empty+0x10>)
 2: 6818     ldr   r0, [r3, #0]
 4: 685b     ldr   r3, [r3, #4]
 6: 1ac0     subs  r0, r0, r3
 8: 4243     negs  r3, r0
 a: 4158     adcs  r0, r3
 c: 4770     bx    lr
 e: 46c0     nop                   ; (mov r8, r8)
10: 00000000 .word 0x00000000
             10: R_ARM_ABS32 .bss

> arm-none-eabi-nm -S test.o

00000000 00000004 b head
00000000 00000014 T queue_empty
00000004 00000004 b tail

From arm-none-eabi-nm we see that queue_empty is 20 bytes long (14 hex), and the arm-none-eabi-objdump output shows that there is a single relocation word at the end of the function, it’s the address of the BSS section (the section for uninitialized data). The first instruction in the function loads that value (the address of the BSS) into r3. The next two instructions load relative to r3, offsetting by 0 and 4 bytes respectively. These two loads are the loads of the values of head and tail. We can see those offsets in the first column of the output from arm-none-eabi-nm. The nop at the end of the function is to word align the address of the literal pool.

Next we’ll see what happens when -fdata-sections is added.

arm-none-eabi-gcc -march=armv6-m \
                  -mcpu=cortex-m0 \
                  -mthumb \
                  -Os \
                  -fdata-sections \
                  -c \
                  -o test.o \
                  test.c

arm-none-eabi-objdump -dr test.o

00000000 <queue_empty>:
 0: 4b03     ldr   r3, [pc, #12]    ; (10 <queue_empty+0x10>)
 2: 6818     ldr   r0, [r3, #0]
 4: 4b03     ldr   r3, [pc, #12]    ; (14 <queue_empty+0x14>)
 6: 681b     ldr   r3, [r3, #0]
 8: 1ac0     subs  r0, r0, r3
 a: 4243     negs  r3, r0
 c: 4158     adcs  r0, r3
 e: 4770     bx    lr
    ...
             10: R_ARM_ABS32 .bss.head
             14: R_ARM_ABS32 .bss.tail

arm-none-eabi-nm -S test.o

00000000 00000004 b head
00000000 00000018 T queue_empty
00000000 00000004 b tail

Immediately we see that the length of queue_empty has increased by four bytes to 24 bytes (18 hex), and that there are now two relocations to be done in queue_empty’s literal pool. These relocations correspond to the addresses of the two BSS sections that were created, one for each global variable. There need to be two addresses here because the compiler can’t know the relative position that the linker will end up putting the two sections in. Looking at the instructions at the beginning of queue_empty, we see that there is an extra load, the compiler has to generate separate load pairs to get the address of the section and then the value of the variable in that section. The extra instruction in this version of queue_empty doesn’t make the body of the function longer, it just takes the spot that was previously a nop, but that won’t be the case in general.

Leave a Comment