AArch64 Optimization – Stage 2 – Optimization Time

Good news, I’ve gotten access to a much stronger AArch64 box, and I can now test on much more reasonable data. In celebration, I immediately went off and used the power of lossless video to acquire a ~9.3GB video file. And then it was time to run it through csnappy:

The average time for thing across 100 trials is: 8.372800311 seconds

Now that’s a lot more reasonable for a test. I’ll get x86_64 data later if it ends up being necessary, but I may not end up making any code changes that affect execution on x86_64, so let’s leave that off for now.

First trial – compiling with -O3 instead of -O2:

The average time for thing across 100 trials is: 8.348095289 seconds

It’s… technically an improvement. But it only saves three hundredths of a second, which makes me think it’s not worth switching to -O3. You might ask ‘why? it’s just changing a compiler flag, might as well leave it’. The answer is that it’s not worth the risk of regressions elsewhere.

-O3 isn’t just a magic speedup button, it makes changes that have tradeoffs. Generally these tradeoffs are in your favor, but let me present a scenario to you: You’re using a program on an old or cheap processor and you don’t have a lot of CPU cache to spare. Compiling with O2, the program fits in the CPU cache and runs (relatively) well. Then you compile with -O3, hoping for that sweet speedup, and the optimizer does all kinds of cool tricks like loop unrolling that make the executable size larger, but generally result in a speedup. And then your program runs slower, because it no longer fits in the CPU cache. Is this an edge case? Absolutely. But the gain here is so small that it’s really not worth risking things like that.

Anyway, with -O3 out of the way as a possibility, I’m moving onwards to take a look at some of the preprocessor directives. Previously, I’d noticed things like

#if defined(__arm__) && !defined(ARCH_ARM_HAVE_UNALIGNED)

It turns out that this isn’t particularly relevant for our purposes. __arm__ is not defined on AArch64 systems, instead __aarch64__ is. So, let’s take a look around at some other preprocessor directives and see what we can find. The first thing that sticks out to me is this:

 * Separate implementation for x86_64, for speed.  Uses the fact that
 * x86_64 is little endian.
 */
#if defined(__x86_64__)
static INLINE int
FindMatchLength(const char *s1, const char *s2, const char *s2_limit)

Hmm, that’s interesting. For context, FindMatchLength is called within the main loop of our csnappy_compress_fragment, so if we can optimize that, csnappy_compress_fragment will obviously end up faster. Anyway, this version of FindMatchLength is followed by a different version used for everything other than x86_64. The x86_64 function uses 64 bit loads, and it looks like everything else gets 32 bit loads. I have a feeling I can just chuck ‘|| defined(__aarch64__)’ onto the end of that #if directive and things will work faster on AArch64. Let’s try it:

The average time for thing across 100 trials is: 8.298365891 seconds

Hey, that’s definitely faster. Granted, it’s only about a 1%-ish speedup, but it’s a speedup. And this one’s far less likely to cause side effects than jacking up the optimization to -O3, so I’m feeling a lot better about it. All we’re doing is switching from one variation of a function to another variation, and this won’t even affect any platform that isn’t AArch64.

So, I’ve performed an optimization. Time to send it upstream and call this done? No, not quite. First off, I need to be certain that this change hasn’t broken the library’s functionality. I’m 99% sure that it hasn’t, but good practice demands being 100% sure. I could test that now, and if I suspected that this change may have broken things, I would. But I’m going to leave that for later, and I’ll test functionality at the end (which might backfire on me if I do break something and then have to figure out which change did it, but I figure I’d probably have a pretty good idea).

As the previous paragraph implied, I’m most likely going to keep looking for some other optimizations to make. I’d like to squeeze at least a few more percent out of this, if I can. I’m not 100% sure the approach I’ll take next, but I have my eye on EmitLiteral (another function called within csnappy_compress_fragment ) for a couple reasons.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: