As I said in my previous post, I set off through Fedora’s package list ( https://apps.fedoraproject.org/packages/ ) to look for anything interesting and CPU intensive to optimize. At first I just typed ‘a’ into the search box and read through results. Some of what I found was quite interesting, but as I was essentially browsing randomly I had a hard time finding anything that met the criteria (a lot of projects do not have architecture specific code, it seems). After a while, I decided to search for libraries instead – figuring that it’d be easier to pick out relevant CPU intensive libraries.
Searching for libraries didn’t take too much longer before it paid off. I stumbled across a rather interesting looking project: https://github.com/zeevt/csnappy . It’s a C port of a C++ compression/decompression library written by Google. The goal of snappy/csnappy is a bit different from most compression projects I’ve seen. Often, compression projects will boast about how well they can compress data – how small the resulting file size is. This is a great approach under certain circumstances – for example, if you’re hosting files for others to download and wish to minimize the bandwidth you’re using.
However, there’s one drawback of this approach – it generally takes a while. If you regularly compress/decompress data, you’ll end up sitting there waiting while the compression algorithm works. Snappy and csnappy go in the opposite direction – they optimize for compression/decompression speed, settling for ‘reasonable’ (Google’s words) compression. Sounds like an interesting project to look further into, so let’s go deeper.
Upon looking at csnappy’s source, one thing instantly stood out to me. An .s file – handwritten assembly code. Take a look at it: https://github.com/zeevt/csnappy/blob/master/unaligned_arm.s
It’s a function written for an old version of ARM, and it loads a word from an unaligned memory location. It even shows off some fun features of older ARM architectures – conditional opcodes that include logical shifts. But (as far as I’m aware) AArch64 didn’t retain those features, so let’s not get too off topic.
Moving on, csnappy does have a fair amount of architecture specific code. Looking through header files reveals a lot of preprocessor directives that compile different code based on the architecture. Looking through the .c files reveals large amounts of code wrapped in #if and #else directives as well. Exactly the kind of thing I’m looking for.
Stepping back from the actual compression code, there are a couple other things I noticed about csnappy that will be quite helpful. Checking out the makefile ( https://github.com/zeevt/csnappy/blob/master/Makefile ) reveals that csnappy comes with a series of tests, both to make sure the compression/decompression are functioning properly, and to benchmark its speed/performance. Both of these will be very helpful – I can near instantly tell if a change I’ve made breaks the (de)compression, and I won’t have to build my own benchmarks.
With that said, let’s look for a specific function to focus on. The code can be somewhat difficult to follow, because there are so many preprocessor directives involved, to the point where one source file can contain multiple separate versions of the same function. It makes me wish they’d split some of it out into separate source files.
With that aside, there’s a huge hint as to what function I should be looking at. Looking through the benchmarking code they have, the ‘time’ result that it spits out is the time that it takes csnappy_compress_fragment to run. That’s basically a flashing neon sign saying ‘this function’s speed is very important!’ So it’s an easy decision – I’ll be focusing on optimizing csnappy_compress_fragment.