I use Arch Linux, and you know what that means: investing hours of my time into saving a few megabytes of disk or RAM. Here is my most recent bit of tinkering:
I like Xfce: I think it strikes a good balance between being feature-rich and lightweight. Of course, this is all subjective, but having dabbled in LXDE, LXQt, and KDE in my early Linux days, I think I’m familiar enough with the two extremes to safely say that Xfce sits in the middle.
Still, there are some features I could live without – the desktop is one of them. I find desktop icons ugly, so I got rid of them, effectively leaving me with a blank screen sporting a basicminimalist wallpaper and a panel that is definitely not just Windows-lite.
Despite how bare my desktop is, xfdesktop takes up 50MiB of RAM. Not ideal.
After doing some research, I found that xfdesktop doesn’t seem to do much: just wallpaper management, desktop icons, and a right-click menu. The wallpaper thing can be worked around, and I don’t really care about the other two, so bye-bye xfdesktop.
Now, what do I do about the wallpaper? Desktop icons may have no practical use, but how can I live without my precious background image? Well, for that, I can use feh – a minimalist command line image viewer that can also act as a wallpaper renderer. I just have to add it to Xfce’s application autostart and boom – a wallpaper with no 50MiB memory cost.
And now I can sit here smugly knowing that my system uses 550MiB of RAM at startup instead of 600MiB. Yay.
Last time, I went over the ‘why’ of clownaudio’s development. In this part, I hope to go over the ‘how’.
But before we begin, I’d like to briefly mention something that I didn’t bring up in the last part: ever since its early days, my Cave Story Ogg Vorbis mod has been under version control, and is hosted on GitHub. This means you can read though the commit history and see the early development of clownaudio for yourself.
It was June 2017. I was freshly annoyed by SDL_mixer’s API, and I figured my best option for an Ogg Vorbis playback library that met my needs was to make one myself.
On paper, this seemed simple enough: SDL2 asks for a constant stream of samples, and plays them back in realtime. All I have to do is obtain raw PCM samples from an Ogg Vorbis file, give them to SDL2, and ta-da: music playback.
And simple it was: the go-to Ogg Vorbis decoder libraries (a combination of libogg, libvorbis, and libvorbisfile) were easy enough to use, and I was piping decoded PCM samples to SDL2 in no time.
What blew my mind was the number of possibilities this opened up; by having such low-level control, all the previous barriers were gone:
Playing a song that’s split into two files? Easy: just load them both, and when the decoder for the first one runs out of data, switch to the other.
Figuring out how far into a song you are? Simple: just count how many samples you’ve read from the decoder.
Having one song be interrupted by another, and then resuming when it ends? You have full control of the audio pipeline – you can do whatever you want!
Sure, this solution was significantly lower-level and required more code and thus more maintenance, but it was so rewarding. With a little elbow-grease, I’d overcome limitations, eliminated dependencies, and created so many more possibilities for audio playback.
Low latency? Less bloated than SDL2? ‘Ooh shiny’ appeal? I couldn’t resist.
And so I set about replacing SDL2 with Cubeb.
Bizarrely, my troubles with Cubeb began with installing it: unlike SDL2, Cubeb isn’t available as part of MSYS2’s software repository. I’d have to compile and install it myself – something I’d never done before. Long story short, this is how I first learned to use CMake.
With Cubeb installed, I dug into its API docs and eventually produced a working conversion.
I can distinctly remember the joy of being able to replace the massive multi-megabyte ‘SDL2.dll’ file in my mod with a several-hundred-kilobyte ‘libcubeb.dll’ file. I always love that feeling: replacing something big/slow/buggy with something minimal/fast/reliable.
Closing
And so it was that the very first incarnation of clownaudio was complete. Of course, it didn’t have the name back then, and it sure wasn’t a standalone library yet, but I consider this the cut-off point where it ceased being a simple layer on top of SDL_mixer, and became its own thing.
But not everything is sunshine and rainbows – for all its improvements, this first revision also had some very questionable design choices:
For one, the decoders streamed the Ogg Vorbis files from disk in real-time. Gross. But more pressingly, instead of having a single constant playback stream with a fixed sample rate, and resampling the songs to suit this sample rate, I instead made it so each song would create its own playback stream, with a sample rate that matches its own.
But I suppose that’s enough for today. Next time, I’ll cover the great refactor, the birth of the backend system, and the quest for Windows XP compatibility.
I recently noticed that my go-to Game Boy Advance emulator, Visual Boy Advance – M, has a GitHub repo. My curiosity got the best of me and I wound up browsing its Issues tab for a few minutes. Eventually I found an issue about an audio delay in the SDL backend.
In the end, this bug wasn’t too interesting: it was just a ring buffer being 300ms long instead of something more reasonable like 100ms. What was interesting was this block of code I came across:
// no sound on windows unless we do this
#ifdef _WIN32
SDL_setenv("SDL_AUDIODRIVER", "directsound", true);
#endif
This code is all kinds of suspicious: I’ve been using SDL2 since 2014, and I know for a fact that it doesn’t just output nothing on Windows if the user doesn’t explicitly select an audio driver. Surely this had to be a hack that exploits some quirk of DirectSound.
Having used SDL2 for so long, I was confident that I could find the real cause, so I set up MSYS2 on a spare Windows 10 PC of mine and started testing.
The bug
True to the comment’s word, disabling that line of code did indeed eliminate audio output.
I once encountered a similar bug in my own software, and the cause was a mismatch between the requested sample format and the provided one. The two most common formats are S16 (signed 16-bit integer) and F32 (32-bit floating point) – if you provide S16 samples to a library expecting F32 samples, you’ll get inaudible output.
With this in mind, I began checking how VBA-M initialises SDL2’s audio subsystem. My eye was drawn to this line:
This line supplies a configuration struct to SDL2, and then receives one back. The supplied struct details the configuration you’d like SDL2 to use, and the received struct details the configuration SDL2 has chosen to use.
You might think that sounds pretty stupid. What’s even the point of requesting a format if SDL2 is just going to choose its own anyway? Well, SDL2 only chooses its own if you let it – that’s what the SDL_AUDIO_ALLOW_ANY_CHANGE flag is for.
The presence of that flag was – pardon my pun – a red flag.
Software that lets the backend decide the audio configuration would have to be extremely versatile, containing numerous codepaths for handling arbitrary configurations like S16 mono at 48kHz, F32 stereo at 44kHz, and S32 5.1 surround at 96kHz. Most people just force SDL2 to use the configuration they want, and stick to a single hardcoded setup.
Safe to say, I had my doubts that VBA-M had that level of flexibility. The easiest way to find out is to see how it’s using that audio_spec variable, which contains SDL2’s chosen configuration.
So, let’s see… it uses the struct in this line of code to determine the silence value…
SDL_memset(stream, audio_spec.silence, length);
…and…
…that’s it.
It never uses that struct again. Well I guess there’s the problem: it allows SDL2 to choose its own configuration, and then never adapts itself to it.
That still leaves one question: is VBA-M actually able to handle alternate configurations, or is it only hardcoded to one in particular?
One look at the audio callback should answer that:
As you can see, it forcefully casts to uint16_t, the data type of S16.
(Okay, I know it’s not signed, but that’s just a weird design choice – you can see VBA-M cast from uint16_t to blip_sample_t (short) in other parts of its source code.)
The fix
Now that I knew that the problem was VBA-M allowing alternate configurations when it could only handle a certain one, I could create a fix.
It’s simple: just don’t pass the SDL_AUDIO_ALLOW_ANY_CHANGE flag:
And just like that, SDL2 could output audio on its default backend – WASAPI.
So why did forcing SDL2 to use DirectSound work around the issue? For whatever reason, SDL2 defaults to S16 when using DirectSound.
Miscellaneous
As you could guess from the title, this bug is quite old.
It was introduced during VBA-M’s conversion from SDL1 to SDL2, back in 2015. The original code explicitly disallowed SDL1 from choosing its own configuration by passing NULL to the SDL_OpenAudio function. For whatever reason, this behaviour was not retained in the SDL2 conversion.
It wasn’t until 2018 that the hack to force DirectSound was introduced, leaving a three year gap where audio on Windows was presumably broken.
But Windows isn’t the only platform this bug affects: theoretically any platform could have been broken by it. I guess it was by pure luck that backends such as PulseAudio (common on Linux) defaulted to S16, just like DirectSound.
Curiously, therearemanyunresolvedIssues on VBA-M’s GitHub repo that mention a lack of audio. It really makes you wonder how many people this bug affected, even after the Windows hack was introduced.
Closing
Working on this bugfix was a fun little distraction from university, and it was nice to finally contribute to an emulator that I’ve been using since the early 2010s. This bug was a bit of a showstopper, so hopefully fixing it will help VBA-M survive into the future. Can’t let mGBA have all the glory after all.
Of course, I’ve created a pull request to have this fix merged upstream.
I suppose I should start this off by explaining exactly what clownaudio is:
clownaudio is my custom sound engine library. It performs real-time decoding, mixing, and playback of sounds in a variety of formats. Think ‘the thing that plays music and sound effects in a video game’.
When I was a poor naive soul, I didn’t think creating such a library was necessary. Surely there was a standard way for C programs to play music and sounds, right? Oh how wrong I was…
The absolute state of audio on PC
I like to think I didn’t start off with C the way most programmers did: instead of Visual Studio and DirectX, I opted for MSYS (and later MSYS2) and SDL2. They were a good pair, and made it very easy to transition to developing on Linux (I use Arch, BTW).
SDL2 is a lovely bit of middleware that handles everything from hardware-accelerated 2D rendering, to window management, event handling, gamepad reading… just everything. Every little platform-dependent thing a newbie like me could want, SDL2 could do in a way that works just the same on Windows as it does on Linux and Mac. Want to create a window? SDL2 has you covered. Want to draw sprites? Done.
Want to play sounds? Oh, er, it can’t do that.
And this isn’t just a one-off thing: go look at any audio playback library, and you’ll find the same thing. SDL2, Cubeb, PortAudio, they claim to be audio playback libraries, but you can’t just load a sound and play it the same way SDL2 can load a .bmp file and start drawing it.
Instead, what you get is something not too unlike the DAC sound channel on a Sega Mega Drive: you get a raw PCM stream, and what samples you feed to it come out of your speakers in real-time.
What. The. Hell.
Have we really not progressed since 1988? Do modern PCs really not have multiple sound channels – just a DAC-wannabe that expects you to funnel software-mixed samples into it? Apparently yes.
In hindsight, I understand why this is the case: audio processing is relatively light on the CPU, you’re not limited to a fixed number of channels, you can apply whatever filters and effects you want, etc. But to a newbie, this was a royal pain – I didn’t know the first thing about writing a sound mixer.
The solution… that led to another problem
Now, it didn’t take me long to find the SDL_mixer project, which is an add-on for SDL2 that provides the kind of audio playback API you’d expect: you can load sounds, be they Ogg Vorbis, WAV, or MP3 files, and play them whenever you want… hey, that sounds just like clownaudio.
So if SDL_mixer exists, why did I eventually make clownaudio? Well…
An old indie game meets Ogg Vorbis
Development of what would eventually become clownaudio dates all the way back to a goofy little mod I made for Cave Story:
Cave Story has a modding community, and it’s always extending the original engine with extra features such as a money system and new weapons. This is achieved through modifying the executable’s machine code.
I did things a little differently though: I wasn’t interested in cramming custom x86 assembly code into a dusty old EXE, so instead I wrote a little hack that made the game load an arbitrary DLL. This DLL would contain whatever extra functionality I wanted, all written in glorious C. Particularly, I wanted to add support for Ogg Vorbis music files.
You see, Cave Story doesn’t use standard audio formats like Vorbis: instead it uses its own little tracker format – Organya. This can be limiting for modders, since Organya’s only meant for pulling off simple 8-bit-style chiptunes.
Getting the DLL working was simple enough, as was hooking it into the game’s music playback system. I then made the DLL use SDL_mixer to play certain .ogg files depending on the current sound ID, and voila: Ogg Vorbis support in Cave Story.
An edgecase too many
It didn’t take long for SDL_mixer to become a problem.
You see, its API is weird: instead of just having a generic ‘sound’ type, which can be used as either music or a sound effect, SDL_mixer has a dedicated ‘music’ type for music, and a ‘chunk’ type for sound effects. Not only that, but while you can have multiple ‘chunk’s playing at once, you can only have one ‘music’ playing at a time.
Why…? Either way, this is a problem for Cave Story:
Cave Story, being a Metroid-style game, has item pickups such as health capsules. Like in Metroid, these pickups play a short jingle when obtained. The jingle interrupts the background music, which pauses when it begins, and resumes when it ends.
In any other scenario, this wouldn’t be a problem: just pause the background music, load the jingle, play it, unload it, and resume the background music.
But SDL_mixer doesn’t let you play more than one song at a time: playing a new one will cancel the old one, even if it was paused.
Well fine – just play the song again, and seek back to where it was before, right?
Playing the songs as chunks isn’t an option either, as there are features exclusive to the music type that the mod absolutely requires in order to work properly (namely the Mix_HookMusicFinished function, which is used for playing songs made of multiple files which are played one after the other).
So that’s it – no way to have multiple songs loaded at once, and no way to seek back to where a song was before it was cut off. Great. A simple mod foiled by a terrible API. Guess I’ll just have to live with the background music constantly resetting for the rest of time, right?
If you want something done right, then do it yourself
You can probably guess where this is going. Disgruntled by how such a popular library could be such a limited, complicated, frustrating mess, I figured it would be better to take matters into my own hands.
It’s just a mod for playing .ogg files, after all – all I need to do is find an Ogg Vorbis decoder, and funnel whatever samples it outputs into SDL2’s audio stream. That can’t be too hard, right?
But I think that’s enough storytelling for today. Next time, I’ll cover libvorbis, Cubeb, and my horrific abuse of audio streams.
I’ve been telling people that I was going to start a blog for a while now, so here it is.
Where do I begin… well, I’ve always been a bit of a rambler, and for some reason people seem to enjoy it. What do I ramble about? Why, my programming projects of course.
Want to hear about my custom sound engine – clownaudio? Want to hear how I ported Cave Story to the Wii U without its source code? Want to hear how graph theory can be used to create a perfect LZSS compressor?
I’m always working on something, so I’ve always got something to talk about.
Why a blog? Because I’d like a nice centralised place to keep these things together. On top of being good for archival, it’s also useful for giving me a place to point people to instead of needing to repeat myself a bunch of times.
I suppose that’s enough for a first post. Now to make the first real post…