AArch64 Optimization – Stage 2 – More Optimization

First off, see this if statement in EmitLiteral?

		/*
		The vast majority of copies are below 16 bytes, for which a
		call to memcpy is overkill. This fast path can sometimes
		copy up to 15 bytes too much, but that is okay in the
		main loop, since we have a bit to go on for both sides:
		- The input will always have kInputMarginBytes = 15 extra
		available bytes, as long as we're in the main loop, and
		if not, allow_fast_path = false.
		- The output will always have 32 spare bytes (see
		snappy_max_compressed_length).
		*/
		if (allow_fast_path && len <= 16) {

So, when EmitLiteral is called, it’s almost always called with int allow_fast_path = 1. The only time it’s called with allow_fast_path = 0 is after the end of the main loop, when emitting the remaining bytes. And the comments here say that ‘the vast majority of copies are below 16 bytes’. I’m thinking wrapping the ‘allow_fast_path && len <= 16’ in the ‘likely’ macro may improve performance, like so:

		if (likely(allow_fast_path && len <= 16)) {

The ‘likely’ and ‘unlikely’ macros are supposed to help gcc optimize the branch depending on if it’s likely or unlikely to occur. I’ve never really played with them before, but they’re used fairly regularly throughout csnappy code. These macros are supposed to help gcc optimize in a way so as to work with processor branch prediction. Branch prediction is a pretty complicated topic, and I’m not going to pretend that I understand more than the very basics of it. Therefore, I’m not entirely sure how this is going to work out – it could be that modern processors are already smart enough to figure out this branch on their own. So, this is kind of another shot in the dark – if it works, it’ll provide an easy and noncontroversial speedup. But I figure it’s equally as likely that it won’t provide a noticeable speedup.

The average time for thing across 100 trials is: 8.341093851 seconds

Okay, that’s not what I was expecting. I’m really not sure why this made things slower. Unless there are more >16byte loads than the authors think, but I get the feeling that isn’t the problem. I’d be interested to know why this ends up slower, but I don’t think digging too far into this is going to be worth the time investment right now. There are other possible optimizations that I can make, and those other optimizations are a lot more likely to produce results. So, let’s move on.

The next thing that catches my eye is just inside that if statement:

			UnalignedCopy64(literal, op);

I wonder what exactly UnalignedCopy64 is doing. More specifically, I wonder if it’s actually doing 64 bit operations on AArch64, or if it’s doing something that’s less efficient. UnalignedCopy64 is defined in https://github.com/zeevt/csnappy/blob/master/csnappy_internal.h so let’s take a look, shall we?

static INLINE void UnalignedCopy64(const void *src, void *dst) {
#if defined(__i386__) || defined(__x86_64__) || defined(__powerpc__) || defined(ARCH_ARM_HAVE_UNALIGNED)
  if ((sizeof(void *) == 8) || (sizeof(long) == 8)) {
    UNALIGNED_STORE64(dst, UNALIGNED_LOAD64(src));
  } else {

Almost immediately we run into this. I won’t paste the whole function, but underneath this there’s an else clause that does the unaligned copy by copying single bytes – which would be the code we’re currently running on AArch64. There’s absolutely no way this is efficient – it’s fallback code, put in place to handle processors that can’t perform unaligned multi byte loads. Let’s add ‘|| defined(__aarch64__)’ to this #if directive and see what happens:

The average time for thing across 100 trials is: 8.262914045 seconds

It’s an improvement, but it’s less of an improvement than I was hoping for. Compared to the original result of 8.372800311 seconds, we’re 1.3% faster with these two optimizations. I can’t say I’m entirely pleased with this – I’d hoped to be able to squeeze a few percent out of this, at the least. However, I’m running out of both time (since I have a number of other things I need to be working on), and ideas.

By now I’ve pretty much exhausted the limits of my original plan. I didn’t try the build flags in -Ofast, but from reading what they do I get the feeling that they’ll either be tiny gains at best, or will break functionality. -ffast-math doesn’t seem like it would be of a large benefit given how little math this library does (it’s largely bit shifts and simple addition/subtraction). -fno-protect-parens doesn’t seem like it’ll be useful for the same reason -ffast-math won’t. -fallow-store-data-races I’m honestly not 100% sure how it functions, but ‘data races’ in a compression library sounds like a recipe for trouble.

I think I’m going to call it here, in terms of performing optimization. I’ve gone through my original plan – I examined what I could do with build flags, and then looked through every relevant preprocessor directive. I even tried something that wasn’t in the plan (with the ‘likely’ macro). At this point there’s not much more I can do without returning to the drawing board and coming up with a completely new plan. In the end, I did succeed in optimizing csnappy_compress_fragment – it just wasn’t as large of an optimization as I had hoped for. Soon I’ll write up a shorter post wrapping up and summarizing Stage II (including testing to make sure I haven’t broken any functionality), and after that it’ll be time to submit these two changes to upstream.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: