If you haven’t read the first two posts: Optimizing Your Volume Slider and Optimizing Your Volume Slider 2 – SIMD Edition then it’d be a good idea for you to do so before reading this one, since they’re necessary to understand the results here.
So, finally getting around to the benchmarking. We’re going to be taking this a bit more seriously than previous benchmarking, which means a few things. First, before running the benchmark, I ran top and made sure nothing else on the system was sucking resources. This is important because if other programs are running, it can significantly affect the time your benchmark takes, and throw off the results.
Another point is that we won’t be comparing results across different machines (especially different processor architectures) anymore. While such comparisons are definitely interesting, there is no one size fits all solution to optimization (as much as we’d probably like there to be). A program will have to be compiled differently for different processor architectures anyway, and if we were really writing a sound library, we’d probably either be gearing the library at a specific situation (ex: music on mobile devices, so we’d optimize for ARM), or including preprocessor directives that use different code for different architectures. To drive this point home, vol_intrinsics and vol_inline won’t even run on x86, since they’re making use of AArch64 instructions.
The last point I’ll make about benchmarking itself is arguably the simplest: run the benchmark multiple times. This means both: have a high enough sample size so that the results are meaningful, and ‘run the benchmark a bunch of times’. Anyway, let’s get into (one instance of) the benchmark results themselves:
bash -c "time ./vol_noscale" Result: -846 real 0m5.087s user 0m4.907s sys 0m0.170s bash -c "time ./vol1" Result: -649 real 0m5.229s user 0m5.069s sys 0m0.149s bash -c "time ./vol2" Result: -949 real 0m6.408s user 0m6.254s sys 0m0.140s bash -c "time ./vol3" Result: -181 real 0m5.183s user 0m5.043s sys 0m0.130s bash -c "time ./vol_inline" Result: -181 real 0m5.173s user 0m5.014s sys 0m0.150s bash -c "time ./vol_intrinsics" Result: -425 real 0m5.161s user 0m5.011s sys 0m0.140s
vol_noscale is the ‘control’ condition – it has no scaling code, therefore it exists purely to tell us the time taken to generate the samples, sum them afterwards, etc. vol1 is our familiar ‘scale via floating point calculations’, vol2 is the precalculated results table, vol3 is fixed point math using standard C. Finally, vol_inline is fixed point math using inline assembly, and vol_intrinsics is fixed point math using AArch64 intrinsics.
Our previous conclusions regarding floating point vs precalculated result table vs fixed point seem to hold – floating point is surprisingly robust, the precalculated result table is very slow, and fixed point math is the best. In every test I ran this pattern held true, so by now I’m confident in these results (for this specific machine).
Now, let’s look at the new methods, vol_inline and vol_intrinsics. They’re faster than vol3 alone, which is to be expected. But, vol_intrinsics seems to have completed 0.012s faster than the vol_inline. Does that mean intrinsics are faster than inline assembly?
This pattern didn’t hold for repeated tests. On some tests, vol_inline beat out vol_intrinsics. These two were consistently faster than vol3, but from these tests we cannot conclude that inline assembly is slower (or faster) than intrinsics. In this case, it would be entirely valid to make a decision based on the fact that intrinsics are easier to use than inline assembly. It’s likely that the fluctuation in time taken are due to background processes running on the system (even things like sshd take resources!) or more esoteric factors that aren’t under our control, like the temperature of the CPU. This is why it’s important to run benchmarks multiple times – you never know what could be skewing the results, and only by doing multiple tests can you be sure that the pattern holds.
In the end, I’d probably settle on using the intrinsics. They’re faster than doing the fixed point math in regular C, while being far more user friendly than directly writing inline assembly (since they operate mostly the same way as regular C functions). One thing we might do if we were actually writing a sound library is to try floating point math and the precalculated result table with these lower level techniques, but what I’ve done here should be enough to demonstrate the involved techniques, so I’m going to leave it at this.