The Many Faces of Hello World

Today we’re going to be taking a look at Hello World. Yes, you heard that right – the most basic program possible.

#include <stdio.h>

int main(void)
{
  printf("Hello World\n");
}

Let’s compile it (with a few choice compiler flags):

gcc -g -O0 -fno-builtin helloworld.c -o helloworld

-g means ‘give me debugging symbols’
-O0 and -fno-builtin turn off all optimization
-o helloworld gives us ‘helloworld’ as the output name (instead of a.out)

So let’s take a look at the result.

objdump -h helloworld

I’m not going to paste the whole output here (since it’s huge), but a couple segments that are of immediate interest are .text and .rodata. You might think that .text would hold strings or constants and .rodata would hold executable code, right? Well, it’s the opposite! .text is where the executable code is stored, and .rodata is where constants are stored, so be careful not to mix them up. With that out of the way, let’s go a bit deeper and look at what machine code is produced from our compile command (shown in human readable assembly language).

objdump -d helloworld
0000000000401126 <main>:
  401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       bf 10 20 40 00          mov    $0x402010,%edi
  40112f:       b8 00 00 00 00          mov    $0x0,%eax
  401134:       e8 f7 fe ff ff          callq  401030 <printf@plt>
  401139:       b8 00 00 00 00          mov    $0x0,%eax
  40113e:       5d                      pop    %rbp
  40113f:       c3                      retq

The first two instructions are stack management – they set up a new stack frame for the main function. In the interests of staying on topic, I won’t go too into depth on what the stack is, check out https://en.wikipedia.org/wiki/Call_stack if you’re unfamiliar with it. The third instruction sets up the function parameter for our printf(“Hello World\n”). $0x402010 is the address of “Hello World\n”, which resides in the previously mentioned .rodata (you can see this if you do objdump -s helloworld).

The next instruction stores the value 0 in the register eax (all you really need to know at this point is that registers are memory within the CPU and function return results are stored in eax), and to be honest I’m not sure why. Anyway, we’re halfway through the function by now – all that’s left is

  401134:       e8 f7 fe ff ff          callq  401030 <printf@plt>
  401139:       b8 00 00 00 00          mov    $0x0,%eax
  40113e:       5d                      pop    %rbp
  40113f:       c3                      retq

It’s not hard to guess that the first instruction here calls printf. What’s @plt, though? It stands for Procedure Linkage Table (and is the .plt segment in our executable). So, what is it? Well the vast majority of programs are ‘dynamically linked’, which means that they depend on calling functions from external libraries. The functions in stdio.h aren’t contained in our executable – they’re external, and the PLT holds information related to that. So basically, <printf@plt> means we’re calling printf from an external library.

With that out of the way, we’re placing the value 0 into eax again (except this time I know why). In C programs, main always returns a value. We didn’t put a return value in helloworld.c, but the compiler is smarter than us and it knows that we need to return something. So it assumes we’re returning 0 (which represents ‘program exited successfully, by the way). After that, the second to last instruction resets the stack to how it was before main was called, and the last instruction returns from the function.

So, now that we’ve taken a look at our helloworld program (unoptimized, dynamically linked with debugging symbols), let’s compile it in different ways and see what changes. First off, let’s statically link it (if this doesn’t work, you may need to install the package ‘glibc-static’ on your machine):

gcc -g -O0 -fno-builtin --static helloworld.c -o staticworld

Now let’s take a look at what we’ve got.

-rwxrwxr-x. 1 bgreenham bgreenham  24K Sep  8 10:37  helloworld
-rwxrwxr-x. 1 bgreenham bgreenham 1.7M Sep  8 11:26  staticworld

Wow. We shot up from 24k to a whopping 1.7MB. Not surprising, given that our .text should now contain the whole contents of stdio. Furthermore, running objdump -h on staticworld reveals several new segments starting with __libc. Interestingly, .plt is still there, with a bunch of identical entries. I don’t understand the ELF format enough to know why, but a google search suggests that it’s for performance reasons. Anyway, let’s check out the disassembly:

0000000000401bb5 <main>:
  401bb5:       55                      push   %rbp
  401bb6:       48 89 e5                mov    %rsp,%rbp
  401bb9:       bf 10 00 48 00          mov    $0x480010,%edi
  401bbe:       b8 00 00 00 00          mov    $0x0,%eax
  401bc3:       e8 f8 72 00 00          callq  408ec0 <_IO_printf>
  401bc8:       b8 00 00 00 00          mov    $0x0,%eax
  401bcd:       5d                      pop    %rbp
  401bce:       c3                      retq
  401bcf:       90                      nop

Mostly the same. Ignore the nop at the end, it’s for alignment requirements (https://en.wikipedia.org/wiki/Data_structure_alignment if you’re interested). The callq has now changed to <_IO_printf> rather than pointing to the plt. As expected, since now we’re just jumping to a function within the program, rather than an external function.

Onwards. Let’s remove -fno-builtin next. -fno-builtin prevents the compiler from performing some optimizations related to C standard library functions.

gcc -g -O0 helloworld.c -o helloworld

Now let’s look at the disassembly:

0000000000401126 <main>:
  401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       bf 10 20 40 00          mov    $0x402010,%edi
  40112f:       e8 fc fe ff ff          callq  401030 <puts@plt>
  401134:       b8 00 00 00 00          mov    $0x0,%eax
  401139:       5d                      pop    %rbp
  40113a:       c3                      retq
  40113b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Ignore the nopl, it’s a five byte long ‘nop’. So, our callq has changed from printf to puts. Why? Because printf is a complicated function that does a lot of things, and thus has to run extra processing on the string. But all we’ve done is feed it a string to print, so all the extra processing is unnecessary. Therefore the compiler optimizes it to just puts.

Onwards. Let’s remove the debugging symbols next.

gcc -O0 helloworld.c -o nodebugworld

Taking a look at the size:

-rwxrwxr-x. 1 bgreenham bgreenham  25K Sep  8 11:49  helloworld
-rwxrwxr-x. 1 bgreenham bgreenham  22K Sep  8 11:56  nodebugworld
-rwxrwxr-x. 1 bgreenham bgreenham 1.7M Sep  8 11:26  staticworld

It’s smaller. 3kb might not seem like much, but that’s about 12% smaller – fairly noticeable. Taking a look at objdump -h reveals that we’ve got 5 less sections, as you’d expect (everything for debugging is gone). Now, the disassembly:

0000000000401126 <main>:
  401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       bf 10 20 40 00          mov    $0x402010,%edi
  40112f:       e8 fc fe ff ff          callq  401030 <puts@plt>
  401134:       b8 00 00 00 00          mov    $0x0,%eax
  401139:       5d                      pop    %rbp
  40113a:       c3                      retq
  40113b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Hmm. No obvious differences (yet at least). Onwards, then. Let’s try modifying our helloworld program

#include <stdio.h>

int main(void)
{
  printf("Hello World\n%d %d %d %d %d %d %d %d %d %d\n", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}
gcc -O0 helloworld.c -o helloworld

Let’s go straight to the disassembly:

0000000000401126 <main>:
  401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       48 83 ec 08             sub    $0x8,%rsp
  40112e:       6a 0a                   pushq  $0xa
  401130:       6a 09                   pushq  $0x9
  401132:       6a 08                   pushq  $0x8
  401134:       6a 07                   pushq  $0x7
  401136:       6a 06                   pushq  $0x6
  401138:       41 b9 05 00 00 00       mov    $0x5,%r9d
  40113e:       41 b8 04 00 00 00       mov    $0x4,%r8d
  401144:       b9 03 00 00 00          mov    $0x3,%ecx
  401149:       ba 02 00 00 00          mov    $0x2,%edx
  40114e:       be 01 00 00 00          mov    $0x1,%esi
  401153:       bf 10 20 40 00          mov    $0x402010,%edi
  401158:       b8 00 00 00 00          mov    $0x0,%eax
  40115d:       e8 ce fe ff ff          callq  401030 <printf@plt>
  401162:       48 83 c4 30             add    $0x30,%rsp
  401166:       b8 00 00 00 00          mov    $0x0,%eax
  40116b:       c9                      leaveq
  40116c:       c3                      retq
  40116d:       0f 1f 00                nopl   (%rax)

Significant differences. I won’t go over them line by line, but some general things to take away from this:
– The first 5 arguments to a function get stored in registers. After that they get pushed onto the stack. Note that this is true in regards to 64 bit x86, not a universal rule.
– The stack grows downwards: sub $0x8, %rsp is where extra space is allocated on the stack to store our function parameters. I’m not entirely sure what’s going on with add $0x30,%rsp, I was expecting to see it adding the same value we subtracted. The leaveq instruction is related to stack management, as well.

Now, let’s change helloworld.c some more

#include <stdio.h>

void doStuff(void)
{
  printf("Hello World\n%d %d %d %d %d %d %d %d %d %d\n", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}

int main(void)
{
  doStuff();
}
gcc -O0 helloworld.c -o helloworld

Disassembly:

0000000000401126 <doStuff>:
  401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       48 83 ec 08             sub    $0x8,%rsp
  40112e:       6a 0a                   pushq  $0xa
  401130:       6a 09                   pushq  $0x9
  401132:       6a 08                   pushq  $0x8
  401134:       6a 07                   pushq  $0x7
  401136:       6a 06                   pushq  $0x6
  401138:       41 b9 05 00 00 00       mov    $0x5,%r9d
  40113e:       41 b8 04 00 00 00       mov    $0x4,%r8d
  401144:       b9 03 00 00 00          mov    $0x3,%ecx
  401149:       ba 02 00 00 00          mov    $0x2,%edx
  40114e:       be 01 00 00 00          mov    $0x1,%esi
  401153:       bf 10 20 40 00          mov    $0x402010,%edi
  401158:       b8 00 00 00 00          mov    $0x0,%eax
  40115d:       e8 ce fe ff ff          callq  401030 <printf@plt>
  401162:       48 83 c4 30             add    $0x30,%rsp
  401166:       90                      nop
  401167:       c9                      leaveq
  401168:       c3                      retq

0000000000401169 <main>:
  401169:       55                      push   %rbp
  40116a:       48 89 e5                mov    %rsp,%rbp
  40116d:       e8 b4 ff ff ff          callq  401126 <doStuff>
  401172:       b8 00 00 00 00          mov    $0x0,%eax
  401177:       5d                      pop    %rbp
  401178:       c3                      retq
  401179:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

About what we’d expect. Main now calls doStuff() which has its own stack management. Let’s see how it changes if we enable optimization with -03:

gcc -O3 helloworld.c -o helloworld
0000000000401140 <doStuff>:
  401140:       48 83 ec 10             sub    $0x10,%rsp
  401144:       b9 03 00 00 00          mov    $0x3,%ecx
  401149:       ba 02 00 00 00          mov    $0x2,%edx
  40114e:       31 c0                   xor    %eax,%eax
  401150:       6a 0a                   pushq  $0xa
  401152:       41 b9 05 00 00 00       mov    $0x5,%r9d
  401158:       41 b8 04 00 00 00       mov    $0x4,%r8d
  40115e:       be 01 00 00 00          mov    $0x1,%esi
  401163:       6a 09                   pushq  $0x9
  401165:       bf 10 20 40 00          mov    $0x402010,%edi
  40116a:       6a 08                   pushq  $0x8
  40116c:       6a 07                   pushq  $0x7
  40116e:       6a 06                   pushq  $0x6
  401170:       e8 bb fe ff ff          callq  401030 <printf@plt>
  401175:       48 83 c4 38             add    $0x38,%rsp
  401179:       c3                      retq
  40117a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000401040 <main>:
  401040:       48 83 ec 08             sub    $0x8,%rsp
  401044:       e8 f7 00 00 00          callq  401140 <doStuff>
  401049:       31 c0                   xor    %eax,%eax
  40104b:       48 83 c4 08             add    $0x8,%rsp
  40104f:       c3                      retq

Hmm. I’d thought it was going to inline the function, but it didn’t for some reason. Let’s change helloworld.c and do another quick test with the same compiler flags:

#include <stdio.h>

void doStuff(void)
{
  printf("Hello World\n");
}

int main(void)
{
  doStuff();
}
0000000000401040 <main>:
  401040:       48 83 ec 08             sub    $0x8,%rsp
  401044:       bf 10 20 40 00          mov    $0x402010,%edi
  401049:       e8 e2 ff ff ff          callq  401030 <puts@plt>
  40104e:       31 c0                   xor    %eax,%eax
  401050:       48 83 c4 08             add    $0x8,%rsp
  401054:       c3                      retq

Aha. No function call. Maybe putting a 10 parameter function call causes the compiler to consider the function too complicated to be worth inlining? Apparently gcc isn’t as aggressive about inlining functions as I thought. Either way, there’s something else to notice here – it doesn’t bother setting up ‘stack frames’, it just adds/subtracts space to/from the stack as necessary.

Anyway, that’s about it for examining hello world. One final note is that these results are for x86_64. If you’re trying this out on an AArch64 machine (or a 32 bit setup), you’ll get different disassembly. Just as one example, in AArch64 all instructions are the same length. This is as opposed to x86, where instructions can be different lengths (and the same instruction can have different length parameters, based on if its operands are registers or memory addresses). There are a lot more differences, but that’s beyond the scope of this post so I’ll leave it at one example. If you’re bored, have some free time and access to an AArch64 box, go ahead and try it yourself.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: