AArch64 Optimization – Stage 1 – Benchmarks and Strategy

Now that I’ve settled on a function, it’s time to benchmark it. We’ll need this benchmarking data for later, once I start changing things. Taking a harder look at the benchmark script in the makefile, it isn’t completely suitable for my purposes. Firstly, it references a bunch of random files found (presumably) on the original developer’s computer, so I’ll have to pick a new file to test compression on. For this purpose, I think I’m going to use a pdf of the Arm® Architecture Reference Manual, both because it’s amusingly appropriate, and because it’s a good size to test with (~45MB).

Looking further into the benchmarking script within the makefile, it seems designed to compare csnappy against two competing algorithms. It tests both compression and decompression, and compares them on time taken and compression ratio. It also only runs the test once – and that’s not good enough for my purposes. It seems like it’ll be better to make my own benchmarking script (based off the one I used to test sound scaling algorithms in previous posts). The end result looks like this:

#!/bin/bash
TOTAL=0
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.
export LD_LIBRARY_PATH

for (( x = 0; x < $1; x++ )); do
	TEMP=`./block_compressor -c snappy testdata/DDI0487E_a_armv8_arm.pdf itmp  | grep seconds | grep -o '[0-9]\.[0-9]*'`
	TOTAL=`bc <<< "scale=3;$TOTAL+$TEMP"`
done
RESULT=`bc <<< "scale=9;$TOTAL/$1"`
echo "The average time for thing across $1 trials is: $RESULT seconds"

It’s not particularly fancy, but it gets the job done. I quickly threw up tmux and set the script to run a thousand times, and was promptly impressed by the results:

The average time for thing across 1000 trials is: .219818994 seconds

Barely more than a fifth of a second to compress ~45MB! For reference, compressing the exact same file with 7zip (on my home computer, which is vastly more powerful than the AArch64 box I’m using to test) takes over four seconds. While not a serious comparison (as that isn’t my goal here), it’s clear that this library is blazing fast. But, let’s not go too off topic – comparing snappy to other compression methods is not my goal.

Next, I need to decide on a strategy to optimize csnappy_compress_fragment. Interestingly enough, where I’m drawn next is to the project’s makefile ( https://github.com/zeevt/csnappy/blob/master/Makefile ). Or more specifically, to a single line within that makefile:

OPT_FLAGS = -g -O2 -DNDEBUG -fomit-frame-pointer

For optimization, we’re only building with -O2 and a single additional flag, -fomit-frame-pointer. I have a feeling that there are further optimization flags in -03 and possibly even -Ofast that can give us a speed boost. While I didn’t think I’d be saying this, I’m going to be focusing on build flags as my primary optimization method. I’m not typically a ‘build guy’ – I tend to just throw -O2 on and call it a day. This could be a good opportunity to improve my knowledge of the compiler’s optimizations. Speaking of, I’m not really sure why -fomit-frame-pointer is specified here – looking up the gcc documentation suggests that -fomit-frame-pointer is present at -O1 and higher.

My plan for build flags is: I’ll have to try building with -O3 and -Ofast. The first order of business will be to see if the library still functions correctly at these higher optimization levels. If it doesn’t, I’ll have to start individually enabling/disabling optimization flags until I find out which breaks functionality. Once I know what flags to avoid, then I’ll benchmark to see how much speed is gained from the higher optimization level.

I also have a secondary plan, if optimizing with build flags doesn’t work out (or if I have extra time and wish to try further optimization). Specifically, I don’t see any defines in the header files related to ARMv8-A (aka AArch64). Here’s an example:

#if defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) || defined(__ARMV6__) || \
    defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__)
#  define ARCH_ARM_HAVE_UNALIGNED
#endif

Early versions of ARM didn’t support unaligned memory access, but everything from ARMv6 and up does. It’s possible that adding || defined(__ARM_ARCH_8A__) to this would speed things up all on its own.

Anyway, now that I’ve found a project, found a function to optimize, obtained some baseline benchmark results, and developed a strategy on how to optimize, I think I can call “Stage 1” of this process complete. Stage 2 will involve actually performing the optimization, making sure it functions properly, and benchmarking again to evaluate the improvement (or lack thereof). I’ll also need to make sure I don’t cause any regressions on other platforms (specifically x86_64 – I have neither the time or the boxes to test every possible architecture).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: