Optimizing Your Volume Slider 3 – Benchmark Results

If you haven’t read the first two posts: Optimizing Your Volume Slider and Optimizing Your Volume Slider 2 – SIMD Edition then it’d be a good idea for you to do so before reading this one, since they’re necessary to understand the results here.

So, finally getting around to the benchmarking. We’re going to be taking this a bit more seriously than previous benchmarking, which means a few things. First, before running the benchmark, I ran top and made sure nothing else on the system was sucking resources. This is important because if other programs are running, it can significantly affect the time your benchmark takes, and throw off the results.

Another point is that we won’t be comparing results across different machines (especially different processor architectures) anymore. While such comparisons are definitely interesting, there is no one size fits all solution to optimization (as much as we’d probably like there to be). A program will have to be compiled differently for different processor architectures anyway, and if we were really writing a sound library, we’d probably either be gearing the library at a specific situation (ex: music on mobile devices, so we’d optimize for ARM), or including preprocessor directives that use different code for different architectures. To drive this point home, vol_intrinsics and vol_inline won’t even run on x86, since they’re making use of AArch64 instructions.

The last point I’ll make about benchmarking itself is arguably the simplest: run the benchmark multiple times. This means both: have a high enough sample size so that the results are meaningful, and ‘run the benchmark a bunch of times’. Anyway, let’s get into (one instance of) the benchmark results themselves:

bash -c "time ./vol_noscale"
Result: -846

real    0m5.087s
user    0m4.907s
sys     0m0.170s

bash -c "time ./vol1"
Result: -649

real    0m5.229s
user    0m5.069s
sys     0m0.149s

bash -c "time ./vol2"
Result: -949

real    0m6.408s
user    0m6.254s
sys     0m0.140s

bash -c "time ./vol3"
Result: -181

real    0m5.183s
user    0m5.043s
sys     0m0.130s

bash -c "time ./vol_inline"
Result: -181

real    0m5.173s
user    0m5.014s
sys     0m0.150s

bash -c "time ./vol_intrinsics"
Result: -425

real    0m5.161s
user    0m5.011s
sys     0m0.140s

vol_noscale is the ‘control’ condition – it has no scaling code, therefore it exists purely to tell us the time taken to generate the samples, sum them afterwards, etc. vol1 is our familiar ‘scale via floating point calculations’, vol2 is the precalculated results table, vol3 is fixed point math using standard C. Finally, vol_inline is fixed point math using inline assembly, and vol_intrinsics is fixed point math using AArch64 intrinsics.

Our previous conclusions regarding floating point vs precalculated result table vs fixed point seem to hold – floating point is surprisingly robust, the precalculated result table is very slow, and fixed point math is the best. In every test I ran this pattern held true, so by now I’m confident in these results (for this specific machine).

Now, let’s look at the new methods, vol_inline and vol_intrinsics. They’re faster than vol3 alone, which is to be expected. But, vol_intrinsics seems to have completed 0.012s faster than the vol_inline. Does that mean intrinsics are faster than inline assembly?

This pattern didn’t hold for repeated tests. On some tests, vol_inline beat out vol_intrinsics. These two were consistently faster than vol3, but from these tests we cannot conclude that inline assembly is slower (or faster) than intrinsics. In this case, it would be entirely valid to make a decision based on the fact that intrinsics are easier to use than inline assembly. It’s likely that the fluctuation in time taken are due to background processes running on the system (even things like sshd take resources!) or more esoteric factors that aren’t under our control, like the temperature of the CPU. This is why it’s important to run benchmarks multiple times – you never know what could be skewing the results, and only by doing multiple tests can you be sure that the pattern holds.

In the end, I’d probably settle on using the intrinsics. They’re faster than doing the fixed point math in regular C, while being far more user friendly than directly writing inline assembly (since they operate mostly the same way as regular C functions). One thing we might do if we were actually writing a sound library is to try floating point math and the precalculated result table with these lower level techniques, but what I’ve done here should be enough to demonstrate the involved techniques, so I’m going to leave it at this.

One thought on “Optimizing Your Volume Slider 3 – Benchmark Results

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: