Optimizing Your Volume Slider

Today, let’s take a look at a feature that everyone here almost certainly takes for granted (unless you’ve worked on sound libraries before) – adjusting sound volume. It doesn’t sound terribly fascinating (pun unintended) – just multiply the sound sample by a user provided value, right? But the question is, what’s the most efficient way to do so?

I tried three ways of doing so (with two variations on one method). If you want to follow along at home, here’s the code I used (along with a handy makefile): https://drive.google.com/file/d/1MvG14YU1q6PZ01e1K8ZNFE3WALTC38fG/view?usp=sharing

I encourage you to try this on your own machine. Why? Because the results differ based on the specs of the computer you’re doing this on. I tested on two different machines – an x86_64 box (small number of cores, but each core is powerful), and an AArch64 box (many cores, but individual cores are weaker). Since this is a single threaded application, results were about 10 times faster on the x86_64 box – which suggests the importance of threading, depending on your processor. But, for the purposes of this experiment, I decided to keep the program working with a single thread.

The program here generates an array of random samples, runs them through a ‘scale sample’ function (the interesting part), and then does some math with the result. Why bother doing math on the results? Mostly so the optimizer doesn’t pull a fast one on us – if we never use the result of the calculation, -O3 may well go ‘hey none of this code actually affects the output, let’s just strip it’.

Anyway, onwards to the scaling algorithms. Here’s the first and the simplest one (vol1.c for those following at home):

static inline int16_t scale_sample(int16_t sample, float volume_factor) {
	return (int16_t) (volume_factor * (float) sample);
}

So we cast the sample to a float, multiply it by the volume factor and then cast it back to a 16 bit signed integer. The CPU has to do a reasonable amount of work here – converting ints to floats isn’t a trivial operation. However, modern processors are increasingly good at handling floating point math. This is especially true in the case of x86, since x86 processors have floating point units, and thus a lot of hardware support for floating point math.

So, onto the results. I found that this method was sometimes the fastest on the x86 processor, and surprisingly it was also one of the faster methods on the AArch64 box. Wait, sometimes? Yes, that’s correct. The results of these tests will vary depending on what else your machine is doing at the time. How many times have you had sound playing on your computer while you use it for other things? Even playing a video game will mean sound at the same time as many other calculations.

vol2.c does things slightly differently – it precalculates all possible results for the given volume selection (using the algorthim from vol1.c) into a table, then looks up the results for each sample. Here’s the relevant code:

for (int16_t x = -32768; x < 32767; x++)
{
	resultTable[x + 32768] = scale_sample(x, 0.75);
}	

for (x = 0; x < SAMPLES; x++) {
	data[x] = resultTable[data[x] + 32768];
}

I thought this would perform fairly quickly on larger sample sizes, but it was consistently the worst performing algorithm on both systems. In retrospect, it’s not too hard to guess why – the overhead incurred by extra memory access. Maybe if I was testing on a machine with fast enough memory access, this algorithm would come out on top. There’s also one more interesting thing about this – I tried compiling everything with -O0 instead of -O3 just to see what would happen, and it resulted in this algorithm coming out as the fastest. Makes me curious what magic the optimizer is doing to make the first and third methods faster, though sadly digging into the assembly to find out exactly what’s going on is a bit outside the scope of this post (and a bit outside my knowledge of how x86_64’s FPU works, too.

Hmm, can’t this code by optimized further? “x + 32768” is resulting in two extra instructions executed for each sample, whereas if we just cast x to uint16_t we’d get the exact same results, with zero cost (this kind of cast just tells the compiler to treat it differently). So you get code that looks like this instead (vol3.c):

for (int16_t x = -32768; x < 32767; x++)
{
	resultTable[(uint16_t)x] = scale_sample(x, 0.75);
}	

for (x = 0; x < SAMPLES; x++) {
	data[x] = resultTable[(uint16_t)data[x]];
}

Does exactly the same thing, but it’s a bit faster. I mostly included it because these kinds of tiny details can add up when you’re trying to optimize software.

Anyway, onto vol4.c. This file uses a neat little trick called ‘fixed point math’. Basically, you pick an arbitrary point in your number and say ‘okay, the decimal point is there!’. Fixed point math generally involves less work for the CPU than floating point math does. So here, we’ll decide to make the 9th bit of the number as ‘1’. Here’s how the code looks (as a quick side note, if you haven’t seen the 0b100000000 syntax before, it’s how you declare a number in binary in C):

static inline int16_t scale_sample(int16_t sample, float volume_factor) {
	int16_t volFactor = 0b100000000 * volume_factor;
	return (volFactor * sample) >> 8;
}

A multiply and a bit shift. Multiplication might be expensive, but bit shifts are blazing fast on modern processors. This is probably going to come out faster than having to do both math and casting with floating point numbers. It does come out faster on the AArch64 box (fairly reliably the fastest algorithm), though it seems roughly tied (or sometimes slightly ahead) on the x86_64 box. I’m impressed by the performance of the floating point unit on modern x86 processors – I was expecting this to win even on x86.

Though, there’s one more thing to say about this algorithm – it’s the least accurate. It regularly introduced rounding errors making the results off by 1 compared to the other algorithms. Is this a problem? Should we go back to the first method instead? Not at all. You will never be able to perceive the difference between the two. This is a completely acceptable level of inaccuracy. If I had to pick a winner out of the three algorithms, it’d be the fixed point math.

The fixed point math performs well on both AArch64 and x86_64. The floating point math performs well on x86_64, but lags behind on AArch64. The lookup table has the worst results on both – and it even has one more disadvantage that we haven’t discussed yet – it’s going to use more memory, because the lookup table has to be stored somewhere. That said, we’re talking about an amount measured in kilobytes – which really isn’t enough to sway the decision making process.

In response to all this, you might say: So why does this matter? CD quality audio is less than a hundred thousand samples per second, any difference is going to be minuscule, and any modern CPU will be able to handle this no problem. Two answers, the first being ‘battery life’. You’re often going to be listening to music on a phone or other portable device, and more efficient sound scaling will mean your batteries deplete slower. The second is that efficiency adds up. If we coded everything with a “why does this matter” attitude towards efficiency, eventually the lack would add up into noticeably slower performance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: