Some Assembly Required

Have you ever wondered what it’s like to code directly in assembly? The answer isn’t ‘barrels of fun’, but I believe it’s a worthwhile experience for any serious programmer. The more you understand at the low level, the better you’ll be at utilizing higher level languages efficiently. Anyway, without further ado, here are a couple programs I wrote in assembly. It’s actually the same program, one written for x86_64, and one written for AArch64.

https://drive.google.com/open?id=1QznWAQS-pui7dvQND7o4VnxcxOTuOIdT

Try assembling and running them yourself (replace “loop_x86.s” with “loop_aarch64.s” if you’re on an AArch64 machine).

as -g -o loop.o loop_x86.s
ld    -o loop   loop.o
./loop

As a side note, we’re doing half of the usual compilation process that would happen in a C program. Normally, you go through preprocessing (directives like #include and #define), compilation, assembling, and linking. Here, we’re calling the last two steps directly – assembling and then linking.

With that aside, take a look at the source code for these programs. They’re about fifty lines of code each, and unless you’re used to reverse engineering or hand writing assembly, they probably don’t look terribly simple. Want to know the equivalent C code to these assembly programs? Here you go:

#include <stdio.h>

int main (void)
{
        for (int i = 0; i < 31; i++)
        {
                  printf("Loop: %2d\n", i);
        }
}

Pretty dramatic difference, isn’t it? Right now I can tell you I’m awfully glad for the existence of gcc (and other C compilers). Handwriting assembly takes far more time and effort than writing C code (which is often considered too low level by programmers), and that’s without getting into what debugging assembly is like.

In higher level languages like C++, you often (or at least sometimes) get a nice error when your program crashes. For example, you might have a nice popup saying something like this:

Debug assertion failed! Vector iterator not dereferenceable.

Well that’s pretty easy to figure out, looks like we tried to dereference an invalid iterator. Chances are, you could probably guess what the problem is and quickly fix it without even having to use a debugger (at least if you’re experienced in working with the language). Meanwhile, when a handwritten assembly program crashes, you tend to get messages like:

Segmentation fault
Illegal operation

While these errors do tell you what went wrong, it’s far more difficult to tell why it went wrong. As you can imagine, debugging code written in assembly is a lot more challenging than debugging code in higher level languages. It doesn’t help that it’s a lot harder to follow the flow of an assembly program in general – ‘what was in that register again?’ as opposed to having nice human readable variable names.

So, now that we’ve talked about the experience of programming in assembly a bit, let’s move onto a related topic – the experience of writing x86_64 assembly vs AArch64. I’ll get the big question out of the way first, because the answer here is fairly clear: handwriting assembly is generally easier in AArch64 than in x86_64 (and from previous experience, this holds true for 32 bit and older variants as well).

So why is writing AArch64 assembly easier? For a few reasons. The first being registers – x86 registers go from a-d, and then you have four special purpose registers (rbp, rsp, rsi, rdi), and then we get numbered registers from r8 to r15. Meanwhile in AArch64, you have… r0-r30, with r31 being special (stack pointer if you’re using an opcode that deals with the stack, zero register otherwise).

Okay, so registers are more complicated in x86_64. Let’s compare an instruction across the two, then (using examples from my assembly programs above). Here’s x86_64’s unsigned divide in action:

mov		%r15,%rax
mov		$10,%r9
mov		$0,%rdx	/* because div opcode divides rdx:rax by the provided operand */
div		%r9	/* result ends up in rax, remainder in rdx */

As you can see, x86_64’s div takes input from specific registers, and provides output in specific registers.

mov	x3, x10       /* loop index is stored in r10 */
mov	x0, 10
udiv	x3, x3, x0    /* now we have the first digit in r3 */

Meanwhile, AArch64’s udiv… just divides what’s in the second param by what’s in the third param and places it in the first param. It’s clearly far easier to remember and to use than x86_64’s div, since you don’t have to worry about putting everything into specific registers beforehand.

However, I want to point something else out here, because this isn’t a fully clear cut case of ‘x86_64 bad, AArch64 good’. In this case, we don’t just want to divide the number by 10 – we want to divide the number by 10 and get the remainder, so we know what each digit is and can put them into the message. Here’s the rest of the code in x86_64

add	$0x30,%rdx
add	$0x30,%rax
movb	%al,numberLocation-1	/* specifying the register as 'al' instead of 'rax' means we only copy the low byte */
movb	%dl,numberLocation	/* same principle here, but with 'dl' instead of 'rdx' */

Since x86_64’s div gives us the remainder in rdx, all we need to do is increment both by 30 and slap them into the message and we’re good to go. So now let’s see what we need to do in AArch64:

mov	x4, x3
mul	x4, x4, x0
sub	x4, x10, x4	/* and now we have the second digit in r4 */

adr	x2, msg		/* and now we can store them back into the message */
add	x3, x3, 0x30
strb	w3,[x2,6]	/* first digit is 6 characters into the string */
add	x4, x4, 0x30
strb	w4,[x2,7]	/* second is 7 */

Hm. We had to do another mov, a multiply and a subtract to get what we wanted – a value that was just there waiting for us after the original div in x86_64. There’s also something else I’d like to point out here. In the x86_64 code, we directly reference a memory address (numberLocation is a constant) in our movb opcodes, whereas in AArch64 we have to load the memory address (msg is a constant as well) into a register before we can use it in other opcodes. This trend holds true in general for x86_64 (many opcodes can directly access memory locations), and for AArch64 (where you have to explicitly perform loads/stores beforehand).

While I’m here, I’d like to point out something that annoys me.

mov	x4, x3

This is an AArch64 mov instruction. It moves the contents of r3 into r4.

mov	%r15,%r8

This is an x86_64 mov instruction. It moves the contents of r15 into r8. This naturally leads to screwups when you go from writing one kind of assembly to the other. It probably caused about half my problems while writing the x86_64 program (since I am more used to AArch64 style from previous assembly work with older ARM processors). For added fun:

strb	w3, [x2,7]

This AArch64 opcode stores the contents of r3 into the memory address (plus 7 bytes) at r2. I never found this too bad (since the [] enclosing the second param makes it obvious that it’s a dereference), but I can see how it would confuse people who are newer to assembly.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: