AArch64 Optimization – Conclusion

It seems like my pull request has been merged with no further ado. That’s not surprising, given how simple the changes were in the end. All I ended up doing was making the compiler use different parts of the code on AArch64 – either way, the code being executed is well tested.

I guess this will be a short post – I don’t really have anything else to say on this topic. Everything’s been covered in previous posts. It’s been an interesting journey, and in the end I’d call the process a success, even if not by the magnitude I was aiming for.

AArch64 Optimization – Stage 3 – Getting It Upstream

So, now that I’m finished my optimizations, it’s time to send them upstream. How does this work? Different projects have different methods of submission, but one of the most common (and the one that’s relevant for my situation) is the pull request on github. Basically, you fork the project (creating a copy of it on your own account), make some changes, and then a pull request asks upstream to merge those changes into their repository. In practice, you can generally expect one of four things to happen when you make a pull request:

  • They fully approve of your changes, and hit the ‘merge’ button without any further discussion. Ideal, though unlikely to happen unless you’ve discussed it ahead of time (or are already familiar with submitting code to the project).
  • They don’t approve it immediately. Instead, they request that you make changes before the pull request is accepted. These changes can be anything from small tweaks (like coding style changes, or other simple things) to more fundamental reworks that will take more effort. I’d say this is probably the most common result – expect that your pull request won’t be perfect on the first try.
  • They don’t like your changes at all and close the pull request without merging it. Assuming your pull request is reasonably thought out (and that you’ve tested it properly), this doesn’t usually happen.
  • There’s no reply and the pull request simply sits open on github. For slower projects this isn’t too surprising. If the project isn’t completely dead, this will likely be replaced by one of the first three options eventually. People maintaining open source get busy too, especially if it’s a project with a single (or few) developers.

As a side note, I would have communicated more with the csnappy developers before this point, but they don’t seem to have an IRC channel or any other contact information, so I’m not sure where the project is discussed. In retrospect I could have opened an issue on github, though issues are typically used more as bug reports for the developer to solve, so I didn’t think of it at the time.

Anyway, here are some tips for a successful pull request:

  • Make one commit per change. Use a descriptive commit name. For example, I used ‘Use 64bit load/store for UnalignedCopy64 on AArch64’ as my second commit name. Reading it gives you a pretty good idea what’s going on. You can also elaborate a bit in the description field – in this case, I put ‘ This results in a slight performance gain.’ Doing this makes it easy for other people to know what’s going on and what you’ve changed.
  • If your project has guidelines for submissions, look them up and follow them. There aren’t any that I can find for csnappy, so I don’t have much more to say here (plus, I already did a couple blog posts about this near the beginning of this blog). Another thing worth looking for is coding style guidelines – following those can save you time, and save the people running the project time.
  • Have a glance through previous pull requests on the project to get an idea what to expect. This can be useful, especially if the project does not have posted submission guidelines.
  • Provide detail, but don’t provide irrelevant detail. My pull request is simply titled ‘Minor AArch64 Optimizations’, with the text ‘From my tests, these changes increase compression speed by roughly 1.3%.’ You don’t need to explain everything about the process, just what you’ve done and why + what the results are. If this was a more in depth or potentially controversial pull request, I would spend some more time explaining what I’ve done and why it’s worth it, but for something this simple, a one line description (plus the commits themselves) should suffice.
  • If you’re contributing multiple things, split them up into multiple pull requests. This makes it easier to test and handle your pull requests. For example, if I was performing significant optimizations and code changes to both the compression and the decompression algorithms, it would make sense to do separate pull requests. Though, if both were simply adding AArch64 to existing preprocessor defines, I’d likely keep them as a single pull request.

Anyway, I’m running out of advice to give, so without any further ado, here’s the pull request I ended up with: https://github.com/zeevt/csnappy/pull/34

I imagine a pull request as simple as this will just be accepted when someone with access sees it, but we’ll see. I’ll make another blog post whenever the csnappy maintainers reply to or accept the pull request.

AArch64 Optimization – Stage 2 – Finishing Up

Okay, now that we’ve performed some optimizations, it’s time to test and make sure these optimizations didn’t break any functionality. Given that all I did was switching around some preprocessor directives (and thus none of the code is different – just which parts of it get compiled on AArch64), I really doubt that anything will have broken. Still, it’s good practice to be absolutely certain.

Another thing that we need to test is if our optimizations affected other platforms. It’s entirely possible that a code change that makes a function run faster on AArch64 could make it run more slowly on x86_64 (or vice versa). If it does happen that a change improves things on one architecture while causing regressions on another, that doesn’t mean the change is bad either. We can simply use preprocessor directives, so that the optimal code is compiled on each architecture. The C preprocessor is a useful tool, though heavy use of it can make source code more difficult to read.

However, this doesn’t actually apply here. Since I only changed preprocessor directives to affect what code is compiled on AArch64, there’s absolutely no difference in the running code on x86_64 (or on any other architecture). This means I can save time and avoid having to run benchmarks and functionality tests on x86_64. It also means there’s no chance of the change affecting less commonly used architectures, which is good given that I lack both the time and the boxes to test on anything other than AArch64 and x86_64 (or maybe on older 32 bit ARM processors, but still).

Okay, so how do we test if the functionality is broken? There are a few ways. Given that this is a compression library, the first and most obvious is ‘after we compress/decompress a file, is the result the same?’ Testing this is very straightforward – compress a file, decompress it, then compare the output. The ‘cmp’ command (on Linux) compares two files and tells you if there’s any difference between them. So I ran the following command:

cmp testdata/9GB.avi otmp

And it gave me no results, which means the files are identical. So far, so good. What else can we do to be sure? Some projects have their own testing suites, and csnappy happens to be one of those projects. Running ‘make test’ performs a series of standard functionality tests (similar to what I just did), and then runs valgrind’s memory leak checker. Not all projects will have their own tests set up, but it’s quite helpful when they do. And in this case, the test results all look proper, so everything seems good so far.

There’s one final test I want to do before I can be absolutely certain that my changes haven’t broken anything – backwards compatibility. It’d be a problem if an update to a popular (or even somewhat popular) program meant it produced files that didn’t work with earlier programs. At the very least, if an optimization does break backwards compatibility, upstream needs to know about it so they can hold it off until a major version release.

How to test this is fairly obvious – compress something with the modified library, then see if the regular version of the library can decompress it without problems. In this case it’s all but certain that this won’t be a problem – the different code in csnappy compiled for different architectures is made to produce the same output. Otherwise, you could compress a file on an x86_64 box and would potentially be unable to decompress it on x86_64 – unacceptable behavior for a compression library.

Fortunately for us, everything seems to check out. With this, I can finally call Stage 2 complete, and move onto the third and final stage – getting my code accepted upstream. If you’ve worked with open source communities before, you’re probably familiar with the process. But if you’re not, stay tuned as I’ll be talking about it as I go along.

AArch64 Optimization – Stage 2 – More Optimization

First off, see this if statement in EmitLiteral?

		/*
		The vast majority of copies are below 16 bytes, for which a
		call to memcpy is overkill. This fast path can sometimes
		copy up to 15 bytes too much, but that is okay in the
		main loop, since we have a bit to go on for both sides:
		- The input will always have kInputMarginBytes = 15 extra
		available bytes, as long as we're in the main loop, and
		if not, allow_fast_path = false.
		- The output will always have 32 spare bytes (see
		snappy_max_compressed_length).
		*/
		if (allow_fast_path && len <= 16) {

So, when EmitLiteral is called, it’s almost always called with int allow_fast_path = 1. The only time it’s called with allow_fast_path = 0 is after the end of the main loop, when emitting the remaining bytes. And the comments here say that ‘the vast majority of copies are below 16 bytes’. I’m thinking wrapping the ‘allow_fast_path && len <= 16’ in the ‘likely’ macro may improve performance, like so:

		if (likely(allow_fast_path && len <= 16)) {

The ‘likely’ and ‘unlikely’ macros are supposed to help gcc optimize the branch depending on if it’s likely or unlikely to occur. I’ve never really played with them before, but they’re used fairly regularly throughout csnappy code. These macros are supposed to help gcc optimize in a way so as to work with processor branch prediction. Branch prediction is a pretty complicated topic, and I’m not going to pretend that I understand more than the very basics of it. Therefore, I’m not entirely sure how this is going to work out – it could be that modern processors are already smart enough to figure out this branch on their own. So, this is kind of another shot in the dark – if it works, it’ll provide an easy and noncontroversial speedup. But I figure it’s equally as likely that it won’t provide a noticeable speedup.

The average time for thing across 100 trials is: 8.341093851 seconds

Okay, that’s not what I was expecting. I’m really not sure why this made things slower. Unless there are more >16byte loads than the authors think, but I get the feeling that isn’t the problem. I’d be interested to know why this ends up slower, but I don’t think digging too far into this is going to be worth the time investment right now. There are other possible optimizations that I can make, and those other optimizations are a lot more likely to produce results. So, let’s move on.

The next thing that catches my eye is just inside that if statement:

			UnalignedCopy64(literal, op);

I wonder what exactly UnalignedCopy64 is doing. More specifically, I wonder if it’s actually doing 64 bit operations on AArch64, or if it’s doing something that’s less efficient. UnalignedCopy64 is defined in https://github.com/zeevt/csnappy/blob/master/csnappy_internal.h so let’s take a look, shall we?

static INLINE void UnalignedCopy64(const void *src, void *dst) {
#if defined(__i386__) || defined(__x86_64__) || defined(__powerpc__) || defined(ARCH_ARM_HAVE_UNALIGNED)
  if ((sizeof(void *) == 8) || (sizeof(long) == 8)) {
    UNALIGNED_STORE64(dst, UNALIGNED_LOAD64(src));
  } else {

Almost immediately we run into this. I won’t paste the whole function, but underneath this there’s an else clause that does the unaligned copy by copying single bytes – which would be the code we’re currently running on AArch64. There’s absolutely no way this is efficient – it’s fallback code, put in place to handle processors that can’t perform unaligned multi byte loads. Let’s add ‘|| defined(__aarch64__)’ to this #if directive and see what happens:

The average time for thing across 100 trials is: 8.262914045 seconds

It’s an improvement, but it’s less of an improvement than I was hoping for. Compared to the original result of 8.372800311 seconds, we’re 1.3% faster with these two optimizations. I can’t say I’m entirely pleased with this – I’d hoped to be able to squeeze a few percent out of this, at the least. However, I’m running out of both time (since I have a number of other things I need to be working on), and ideas.

By now I’ve pretty much exhausted the limits of my original plan. I didn’t try the build flags in -Ofast, but from reading what they do I get the feeling that they’ll either be tiny gains at best, or will break functionality. -ffast-math doesn’t seem like it would be of a large benefit given how little math this library does (it’s largely bit shifts and simple addition/subtraction). -fno-protect-parens doesn’t seem like it’ll be useful for the same reason -ffast-math won’t. -fallow-store-data-races I’m honestly not 100% sure how it functions, but ‘data races’ in a compression library sounds like a recipe for trouble.

I think I’m going to call it here, in terms of performing optimization. I’ve gone through my original plan – I examined what I could do with build flags, and then looked through every relevant preprocessor directive. I even tried something that wasn’t in the plan (with the ‘likely’ macro). At this point there’s not much more I can do without returning to the drawing board and coming up with a completely new plan. In the end, I did succeed in optimizing csnappy_compress_fragment – it just wasn’t as large of an optimization as I had hoped for. Soon I’ll write up a shorter post wrapping up and summarizing Stage II (including testing to make sure I haven’t broken any functionality), and after that it’ll be time to submit these two changes to upstream.

AArch64 Optimization – Stage 2 – Optimization Time

Good news, I’ve gotten access to a much stronger AArch64 box, and I can now test on much more reasonable data. In celebration, I immediately went off and used the power of lossless video to acquire a ~9.3GB video file. And then it was time to run it through csnappy:

The average time for thing across 100 trials is: 8.372800311 seconds

Now that’s a lot more reasonable for a test. I’ll get x86_64 data later if it ends up being necessary, but I may not end up making any code changes that affect execution on x86_64, so let’s leave that off for now.

First trial – compiling with -O3 instead of -O2:

The average time for thing across 100 trials is: 8.348095289 seconds

It’s… technically an improvement. But it only saves three hundredths of a second, which makes me think it’s not worth switching to -O3. You might ask ‘why? it’s just changing a compiler flag, might as well leave it’. The answer is that it’s not worth the risk of regressions elsewhere.

-O3 isn’t just a magic speedup button, it makes changes that have tradeoffs. Generally these tradeoffs are in your favor, but let me present a scenario to you: You’re using a program on an old or cheap processor and you don’t have a lot of CPU cache to spare. Compiling with O2, the program fits in the CPU cache and runs (relatively) well. Then you compile with -O3, hoping for that sweet speedup, and the optimizer does all kinds of cool tricks like loop unrolling that make the executable size larger, but generally result in a speedup. And then your program runs slower, because it no longer fits in the CPU cache. Is this an edge case? Absolutely. But the gain here is so small that it’s really not worth risking things like that.

Anyway, with -O3 out of the way as a possibility, I’m moving onwards to take a look at some of the preprocessor directives. Previously, I’d noticed things like

#if defined(__arm__) && !defined(ARCH_ARM_HAVE_UNALIGNED)

It turns out that this isn’t particularly relevant for our purposes. __arm__ is not defined on AArch64 systems, instead __aarch64__ is. So, let’s take a look around at some other preprocessor directives and see what we can find. The first thing that sticks out to me is this:

 * Separate implementation for x86_64, for speed.  Uses the fact that
 * x86_64 is little endian.
 */
#if defined(__x86_64__)
static INLINE int
FindMatchLength(const char *s1, const char *s2, const char *s2_limit)

Hmm, that’s interesting. For context, FindMatchLength is called within the main loop of our csnappy_compress_fragment, so if we can optimize that, csnappy_compress_fragment will obviously end up faster. Anyway, this version of FindMatchLength is followed by a different version used for everything other than x86_64. The x86_64 function uses 64 bit loads, and it looks like everything else gets 32 bit loads. I have a feeling I can just chuck ‘|| defined(__aarch64__)’ onto the end of that #if directive and things will work faster on AArch64. Let’s try it:

The average time for thing across 100 trials is: 8.298365891 seconds

Hey, that’s definitely faster. Granted, it’s only about a 1%-ish speedup, but it’s a speedup. And this one’s far less likely to cause side effects than jacking up the optimization to -O3, so I’m feeling a lot better about it. All we’re doing is switching from one variation of a function to another variation, and this won’t even affect any platform that isn’t AArch64.

So, I’ve performed an optimization. Time to send it upstream and call this done? No, not quite. First off, I need to be certain that this change hasn’t broken the library’s functionality. I’m 99% sure that it hasn’t, but good practice demands being 100% sure. I could test that now, and if I suspected that this change may have broken things, I would. But I’m going to leave that for later, and I’ll test functionality at the end (which might backfire on me if I do break something and then have to figure out which change did it, but I figure I’d probably have a pretty good idea).

As the previous paragraph implied, I’m most likely going to keep looking for some other optimizations to make. I’d like to squeeze at least a few more percent out of this, if I can. I’m not 100% sure the approach I’ll take next, but I have my eye on EmitLiteral (another function called within csnappy_compress_fragment ) for a couple reasons.

AArch64 Optimization – Stage 1 – Benchmarks Redux

When I did the benchmark in the last post, I imagined that the compression would take longer. Benchmarking something that only takes a fifth of a second generally isn’t a great idea – it’s too easy for anything else going on in the system to disrupt the results of the benchmark. Running it a few thousand times does help, but in general it’s best to use data that takes at least a second or two to process.

So, I went and looked through an old harddrive for some video files I could use. The first thing I tried was about 1.1GB. I saw it taking far longer than I expected to run, so I popped top open and had a look. I promptly discovered that the machine I’m testing on only has about 550MB free RAM (and thus was significantly swapping, which is no good). Fortunately, I also had a 514MB file lying around. That should work for now:

The average time for thing across 100 trials is: 1.440855900 seconds

I also decided to get some preliminary benchmark results on an x86_64 box. The reason for this isn’t to compare the two, it’s so that I have numbers to compare against later. I want to make sure any changes I make don’t result in a performance loss on x86_64.

The average time for thing across 100 trials is: .132044908 seconds

That’s definitely a bit awkward. Can’t use a larger file because it’ll swap on the AArch64 box (which would distort the results), and testing with an entirely different file on x86_64 probably isn’t a good idea (since the result of optimizations could vary depending on what file is used). Well, there’s not too much I can do about it at this point anyway. I may end up doing optimizations that don’t affect the code executed on x86_64 at all, so I’ll leave it at this for now. There’s also a chance I’ll have access to a better AArch64 box in the future.

AArch64 Optimization – Stage 1 – Benchmarks and Strategy

Now that I’ve settled on a function, it’s time to benchmark it. We’ll need this benchmarking data for later, once I start changing things. Taking a harder look at the benchmark script in the makefile, it isn’t completely suitable for my purposes. Firstly, it references a bunch of random files found (presumably) on the original developer’s computer, so I’ll have to pick a new file to test compression on. For this purpose, I think I’m going to use a pdf of the Arm® Architecture Reference Manual, both because it’s amusingly appropriate, and because it’s a good size to test with (~45MB).

Looking further into the benchmarking script within the makefile, it seems designed to compare csnappy against two competing algorithms. It tests both compression and decompression, and compares them on time taken and compression ratio. It also only runs the test once – and that’s not good enough for my purposes. It seems like it’ll be better to make my own benchmarking script (based off the one I used to test sound scaling algorithms in previous posts). The end result looks like this:

#!/bin/bash
TOTAL=0
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.
export LD_LIBRARY_PATH

for (( x = 0; x < $1; x++ )); do
	TEMP=`./block_compressor -c snappy testdata/DDI0487E_a_armv8_arm.pdf itmp  | grep seconds | grep -o '[0-9]\.[0-9]*'`
	TOTAL=`bc <<< "scale=3;$TOTAL+$TEMP"`
done
RESULT=`bc <<< "scale=9;$TOTAL/$1"`
echo "The average time for thing across $1 trials is: $RESULT seconds"

It’s not particularly fancy, but it gets the job done. I quickly threw up tmux and set the script to run a thousand times, and was promptly impressed by the results:

The average time for thing across 1000 trials is: .219818994 seconds

Barely more than a fifth of a second to compress ~45MB! For reference, compressing the exact same file with 7zip (on my home computer, which is vastly more powerful than the AArch64 box I’m using to test) takes over four seconds. While not a serious comparison (as that isn’t my goal here), it’s clear that this library is blazing fast. But, let’s not go too off topic – comparing snappy to other compression methods is not my goal.

Next, I need to decide on a strategy to optimize csnappy_compress_fragment. Interestingly enough, where I’m drawn next is to the project’s makefile ( https://github.com/zeevt/csnappy/blob/master/Makefile ). Or more specifically, to a single line within that makefile:

OPT_FLAGS = -g -O2 -DNDEBUG -fomit-frame-pointer

For optimization, we’re only building with -O2 and a single additional flag, -fomit-frame-pointer. I have a feeling that there are further optimization flags in -03 and possibly even -Ofast that can give us a speed boost. While I didn’t think I’d be saying this, I’m going to be focusing on build flags as my primary optimization method. I’m not typically a ‘build guy’ – I tend to just throw -O2 on and call it a day. This could be a good opportunity to improve my knowledge of the compiler’s optimizations. Speaking of, I’m not really sure why -fomit-frame-pointer is specified here – looking up the gcc documentation suggests that -fomit-frame-pointer is present at -O1 and higher.

My plan for build flags is: I’ll have to try building with -O3 and -Ofast. The first order of business will be to see if the library still functions correctly at these higher optimization levels. If it doesn’t, I’ll have to start individually enabling/disabling optimization flags until I find out which breaks functionality. Once I know what flags to avoid, then I’ll benchmark to see how much speed is gained from the higher optimization level.

I also have a secondary plan, if optimizing with build flags doesn’t work out (or if I have extra time and wish to try further optimization). Specifically, I don’t see any defines in the header files related to ARMv8-A (aka AArch64). Here’s an example:

#if defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) || defined(__ARMV6__) || \
    defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__)
#  define ARCH_ARM_HAVE_UNALIGNED
#endif

Early versions of ARM didn’t support unaligned memory access, but everything from ARMv6 and up does. It’s possible that adding || defined(__ARM_ARCH_8A__) to this would speed things up all on its own.

Anyway, now that I’ve found a project, found a function to optimize, obtained some baseline benchmark results, and developed a strategy on how to optimize, I think I can call “Stage 1” of this process complete. Stage 2 will involve actually performing the optimization, making sure it functions properly, and benchmarking again to evaluate the improvement (or lack thereof). I’ll also need to make sure I don’t cause any regressions on other platforms (specifically x86_64 – I have neither the time or the boxes to test every possible architecture).

AArch64 Optimization – Stage 1 – Choosing a Package

As I said in my previous post, I set off through Fedora’s package list ( https://apps.fedoraproject.org/packages/ ) to look for anything interesting and CPU intensive to optimize. At first I just typed ‘a’ into the search box and read through results. Some of what I found was quite interesting, but as I was essentially browsing randomly I had a hard time finding anything that met the criteria (a lot of projects do not have architecture specific code, it seems). After a while, I decided to search for libraries instead – figuring that it’d be easier to pick out relevant CPU intensive libraries.

Searching for libraries didn’t take too much longer before it paid off. I stumbled across a rather interesting looking project: https://github.com/zeevt/csnappy . It’s a C port of a C++ compression/decompression library written by Google. The goal of snappy/csnappy is a bit different from most compression projects I’ve seen. Often, compression projects will boast about how well they can compress data – how small the resulting file size is. This is a great approach under certain circumstances – for example, if you’re hosting files for others to download and wish to minimize the bandwidth you’re using.

However, there’s one drawback of this approach – it generally takes a while. If you regularly compress/decompress data, you’ll end up sitting there waiting while the compression algorithm works. Snappy and csnappy go in the opposite direction – they optimize for compression/decompression speed, settling for ‘reasonable’ (Google’s words) compression. Sounds like an interesting project to look further into, so let’s go deeper.

Upon looking at csnappy’s source, one thing instantly stood out to me. An .s file – handwritten assembly code. Take a look at it: https://github.com/zeevt/csnappy/blob/master/unaligned_arm.s

It’s a function written for an old version of ARM, and it loads a word from an unaligned memory location. It even shows off some fun features of older ARM architectures – conditional opcodes that include logical shifts. But (as far as I’m aware) AArch64 didn’t retain those features, so let’s not get too off topic.

Moving on, csnappy does have a fair amount of architecture specific code. Looking through header files reveals a lot of preprocessor directives that compile different code based on the architecture. Looking through the .c files reveals large amounts of code wrapped in #if and #else directives as well. Exactly the kind of thing I’m looking for.

Stepping back from the actual compression code, there are a couple other things I noticed about csnappy that will be quite helpful. Checking out the makefile ( https://github.com/zeevt/csnappy/blob/master/Makefile ) reveals that csnappy comes with a series of tests, both to make sure the compression/decompression are functioning properly, and to benchmark its speed/performance. Both of these will be very helpful – I can near instantly tell if a change I’ve made breaks the (de)compression, and I won’t have to build my own benchmarks.

With that said, let’s look for a specific function to focus on. The code can be somewhat difficult to follow, because there are so many preprocessor directives involved, to the point where one source file can contain multiple separate versions of the same function. It makes me wish they’d split some of it out into separate source files.

With that aside, there’s a huge hint as to what function I should be looking at. Looking through the benchmarking code they have, the ‘time’ result that it spits out is the time that it takes csnappy_compress_fragment to run. That’s basically a flashing neon sign saying ‘this function’s speed is very important!’ So it’s an easy decision – I’ll be focusing on optimizing csnappy_compress_fragment.

AArch64 Optimization – Intro

So, I figure it’s about time to take my current knowledge and do something of use in the real world with it. Specifically, the world of open source. So, let’s make something faster. Sound good? Yes, but I’m going to need to be a lot more specific than that. Let’s set some parameters and goals. First off, the goal will be to increase a piece of software’s (or a library’s) performance on the AArch64 architecture. If I also manage to increase its performance on x86 (or other architectures), then that’s even better. However, for the purposes of this, I will be focusing on AArch64 performance, and as long as performance on x86 does not decrease, I’ll consider the optimization a success.

As for some parameters, I’ll be choosing a piece of open source software (or library) that I am not currently involved with. Even more specifically, I’ll be choosing a single function (or method/routine/whatnot) and focusing on the performance of that. I’m still quite new to optimization, so only focusing on a single thing sounds like the best way to start. Furthermore, whatever piece of software I pick needs to be something that compiles to machine code, as optimization of interpreted languages is outside the scope of what I’m currently doing. Another parameter is that the chosen function must be CPU intensive – in other words, it needs to be something where optimizing it will actually make a difference. It’s a waste of time to write optimizations for functions that represent a very small portion of execution time, after all.

Oh, and for this, the optimization does not necessarily need to involve changing code. If I can optimize a program by changing build options, that is an entirely legitimate approach. That said, I’m most likely going to be looking for software that already has architecture specific code for x86 (but does not have AArch64 specific code, or at least does not have as much AArch64 specific code). There are a few reasons for this approach. The first is that it points out areas where optimization is useful – someone spent time optimizing code for x86, so it’s entirely likely that there are performance gains to be made there for AArch64 as well. Another reason being that writing code that will only be used for AArch64 means I can sidestep the issue of potential regressions on other architectures. I won’t limit myself to exclusively looking for architecture specific code, but it seems like the most practical approach here.

Anyway, I think I’m going to stop writing now and go off and search for software to optimize. I’m probably just going to look through the Fedora package list and see what I can find. Fedora in specific because I know it has a huge, well established repository, and because I have access to a large amount of boxes with Fedora installed already.

Optimizing Your Volume Slider 4 – Benchmark Results Redux

Looking back at my last post, I realize I could have done things in a much more robust (and frankly much easier) way than running the test multiple times manually and looking for patterns. While the conclusions I came to in my last post hold, I’d like to provide a far more robust set of results and the method I used to arrive at those results.

The method being ‘use a couple bash scripts to run the test multiple times and average the results’. First, I decided to make a script I could use to test any program I want any number of times I want.

#!/usr/bin/env bash
TOTAL=0

for (( x = 0; x < $2; x++ )); do
	TEMP=`( time $1 ) 2>&1 | grep real | grep -o '[0-9]\.[0-9]*'`
	TOTAL=`bc <<< "scale=3;$TOTAL+$TEMP"`
done
RESULT=`bc <<< "scale=3;$TOTAL/$2"`
echo "The average time for $1 across $2 trials is: $RESULT seconds"

And then I made a simple script to run this on all of my different volume scalers

#!/usr/bin/env bash
TRIALS=$1
bash timescript ./vol_noscale $TRIALS
bash timescript ./vol1 $TRIALS
bash timescript ./vol2 $TRIALS
bash timescript ./vol3 $TRIALS
bash timescript ./vol_inline $TRIALS
bash timescript ./vol_intrinsics $TRIALS

Then I ran this script inside tmux (just incase of disconnection), and waited for it to spit out the results.

The average time for ./vol_noscale across 100 trials is: 5.084 seconds
The average time for ./vol1 across 100 trials is: 5.228 seconds
The average time for ./vol2 across 100 trials is: 6.409 seconds
The average time for ./vol3 across 100 trials is: 5.180 seconds
The average time for ./vol_inline across 100 trials is: 5.166 seconds
The average time for ./vol_intrinsics across 100 trials is: 5.166 seconds

And there you have it.

Create your website with WordPress.com
Get started