A’sittin’ on a rainbow: Using Buzz to transcribe speech from audio files

Have you ever listened to a long interview or podcast or a nice, long vlog and then, just a couple of days later, when you got the itch to look up something like a book or a song you heard them mention found that you just couldn’t quite remember the title or the name of the author, though you were able to recall the topic or something the person had been saying just before or after they mentioned the book or track? Or have you ever been in an extended meeting for which you had audio or video available and wanted to confirm that a certain commitment had been made or needed to double-check a precise detail in some instructions you’d been given or given to someone else but winced at the daunting prospect of wading through an hour or multiple hours of audio? Or a series of meetings, spaced out over a couple of weeks? Or you were given recordings of meetings and tasked with preparing the minutes? Or are you trying to do something like put together a family oral or video history and need a way to turn hours and hours of recordings into interview transcripts?

In some of these examples, the transcripts are the desired end product whereas in others (like getting a book title or confirming someone did in fact say a certain thing), a transcript would merely be the means to an end. Hiring a human transcriptionist, setting aside the cost factor, would be overkill in nearly every one of these scenarios and would add a time delay, prohibitively long especially for trivial, personal-interest-based use cases. There’s also the privacy angle to consider before handing over (or uploading) audio recordings to a third party.

Animated GIF of a 6-second clip from the video for a cover of John Prine's 'In Spite Of Ourselves'. This version performed by the Viagra Boys, with Amy Taylor (frontwoman for Amyl and the Sniffers). — Animated GIF of a 6-second clip from the video for a cover of John Prine‘s 1999 song In Spite Of Ourselves. This version (click here to watch the video from which I clipped the above snippet) was performed by the Viagra Boys, with Amy Taylor (frontwoman for Amyl and the Sniffers), and uploaded to YouTube on December 15, 2020. Compare this punk version to Prine’s duet with Iris DeMent: In Spite of Ourselves (Live From Sessions at West 54th), older but only recently uploaded (July 10, 2023).

Good news! at least one suitable tool exists: Chidi Williams‘s Buzz. The project’s GitHub About blurb:

Buzz transcribes and translates audio offline on your personal computer. Powered by OpenAI’s Whisper.

How it works, in a nutshell, after you’ve downloaded and run the Buzz installer: Import a media file (or turn on the mic on your device), choose a model and the version of the model you’d like to use, and then let ‘er rip. The default model (and top option on the Model drop-down dialog, is Whisper (its GitHub About blurb is Robust Speech Recognition via Large-Scale Weak Supervision) and, for Whisper, you have a choice of versions: Tiny, Base, Small, Medium, and Large. English is not the only supported language for transcription and you can translate non-English speech into English text by selecting Translate instead of Transcribe from the Task drop-down. Run output (TXT files in my case, since that’s what I chose) appears in the same directory as the source file, with a nicely descriptive name.

The first time that you transcribe or translate some speech audio using a particular model version, that model has to be downloaded and larger models (which, as one might intuit, yield more-correct transcriptions) are larger and take more time to download initially and take longer to run subsequently than smaller models. The README for Whisper has a section on models, with a comparison table. Buzz runs locally and, once you’ve run each model/model-version of interest, if you don’t care about updating the software or the speech recognition models it uses, then you can firewall-off Buzz’s Internet access and continue using it.

To try Buzz out, I ran every Whisper model-version on the same input file, the audio from the Viagra Boys‘ and Amy Taylor’s (frontwoman for Amyl and the Sniffers) December 2020 cover of John Prine‘s 1999 song In Spite Of Ourselves. I clipped the video for this newer ironic/satirical video to make the animated GIF embedded above. I’m not a fan of either band or what little I’ve gleaned of their politics, dislike face-scribble and random somebody-Sharpied-a-c*ck-and-balls-on-my-face-while-I-was-asleep all-over tattoos, and I don’t think that, for example, naming a music group to reference an often-abused inhalant like amyl nitrite is kewl or clever, but I did get a kick out of their rendition of the song. The lyrics are neat and their take on it and the style of the video is neat, too.

Here are the lyrics:

In Spite of Ourselves Lyrics   (Archive.today: In Spite of Ourselves Lyrics)

[Verse 1]
She don't like her eggs all runny
She thinks crossin' her legs is funny
She looks down her nose at money
She gets it on like the Easter Bunny

[Chorus]
She's my baby, I'm her honey
She's my baby, I'm never gonna let her go

[Verse 2]
He ain't been laid in a month of Sunday
I caught him once, and he was sniffin' my undies
He ain't too sharp, but he gets things done
Drinks his beer like it's oxygen

[Chorus]
He's my baby, I'm his honey
He's my baby, I'm never gonna let him go

[Verse 3]
She thinks all my jokes are corny
Convict movies make her horny
She likes ketchup on her scrambled eggs
Swears like a sailor when she shaves her legs

[Chorus]
She takes a lickin', keeps on tickin'
She takes a lickin', I'm never gonna let her go

[Verse 4]
He's got more balls than a big brass monkey
A wacked out weirdo, and a lovebug junkie
Sly as a fox, crazy as a loon
Payday comes, and he's a-howlin' at the moon

[Chorus]
He's my baby, I don't mean maybe
He's my baby, and I'm his honey
He's my baby, I'm his honey
He's my baby, I'm never gonna let him go

[Bridge]
In spite of ourselves
We'll end up a-sittin' on a rainbow
Against all odds
Honey, we're the big door prize
We're gonna spite our noses
Right off of our faces
There won't be nothin' but big old hearts
Dancin' in our eyes

I saved the lyrics in a text file, deleted the section headers, and manually smooshed consecutive lines together, mostly in pairs as in the Buzz transcript output files, to make comparison easier, but I didn’t try to replicate every aspect of the layout or punctuation or inconsistent capitalization of the Whisper-Large transcript.

The song audio is exactly five minutes long. Buzz’s run time (excluding the aforementioned one-time model-version download) using Whisper-Tiny, the smallest Whisper model, was 35 seconds and processing the same audio file using Whisper-Large took 11 minutes 11 seconds.

But how accurate were the transcripts? I compared the Buzz output generated using Whisper-Large and Whisper-Tiny to the lyrics in KDiff3 (installers here if you’re interested):

KDiff 3 comparison of the song lyrics, the Whisper large output, and the Whisper tiny output. — KDiff 3 comparison of (in order from left to right) the song lyrics and the Buzz output using, respectively, `Whisper-Large` and `Whisper-Tiny` model-versions.

Whisper-Tiny didn’t perform very well in this test, but Whisper-Large did alright. The chorus (She’s my baby. I’m his honey. She’s my baby. I’m never gonna let you go.) is certainly not sung multiple times at the end of the actual song, but it gets repeated five times in the Whisper-Large transcript. It’s not a case of lines sung in unison by both vocalists getting teased out into separate lines, either, because they take turns on the honey-baby stuff and only do the last line, I’m never gonna let you go, together. Additionally, lines which they do sing together simultaneously only show up once in the transcript.

It’s a bit of a head-scratcher, but there’s some consolation in knowing that I’m not alone in experiencing this issue. An April 2023 OpenAI forum thread describes the same problem with Whisper and the replies from other similarly-affected users continue with the most recent posted in November 2023, just a couple of months back now: Dialog before long pause gets repeated over and over again by Whisper.

There are also some pretty glaring mistakes in the interpretation of the female vocals. Undies becomes yarns and weirdo is turned into widow. Given the strong Australian accent underneath the faux twang Amy Taylor tries to affect for this song, the weirdo-widow mix-up is somewhat understandable. If someone played a clip of her singing that word, all by itself, I might make the same error myself. The yarns thing, though, I don’t get. Even after re-listening to that part of the song a few times. Undies is a bit clipped-sounding, but there’s no way I could get yarns out of that. The male vocals (from Viagra Boys member Sebastian Murphy, an American long-term resident of Sweden) are delivered in a male faux twang overlaying what sounds (to yours truly) like plain, non-regional/generic/neutral American English and that appears to be a little more amenable to transcription using Whisper’s models.

If you’re interested in hearing him talk a bit in his natural intonation, here’s a July 2022 interview with Murphy: A Viagra Boy Talks to Me. In the spirit of fairness, here’s a 2019 interview with Amy Taylor and a bandmate so that you can hear her warbling in Australian: Amyl and the Sniffers interview – Amy and Declan.

POSTSCRIPT: `Faster Whisper` is a no-go for me (so far).

A final note: I’ve also tried to use Faster Whisper, starting with the Large model (which I’ll refer to as Faster-Whisper-Large for the remainder of this post) but (with Internet access unblocked), the model download process is weird. Download progress isn’t smooth as with the regular Whisper models. It jumps from 0 to 25%, stays stuck there for a long time (IIRC, about 15 minutes) and seems to complete. The status of the transcription job in the Buzz program window changes to Queued and then, after a little while, the transcription fails with this Unhandled exception in script error dialog:

Error message I get when trying to run Faster Whisper (Large) model. — Error message I get when trying to run Faster Whisper (Large) model: Unhandled exception in script.

The same error message in text:

Traceback (most recent call last):
  File "multiprocessing\process.py", line 314, in _bootstrap
  File "multiprocessing\process.py", line 108, in run
  File "buzz\transcriber.py", line 355, in transcribe_whisper
  File "buzz\transcriber.py", line 392, in transcribe_faster_whisper
  File "faster_whisper\transcribe.py", line 101, in __init__
RuntimeError: CUDA failed with error out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 4, in 
  File "buzz\buzz.py", line 36, in main
  File "Lib\site-packages\PyInstaller\hooks\rthooks\pyi_rth_multiprocessing.py", line 52, in _freeze_support
  File "multiprocessing\spawn.py", line 116, in spawn_main
  File "multiprocessing\spawn.py", line 129, in _main
  File "multiprocessing\process.py", line 329, in _bootstrap
AttributeError: 'NoneType' object has no attribute 'write'

A modest amount of searching online shows at least a few other people have experienced similar issues and the project owner’s recommendation has been to update to the current version (0.8.4) and my fellow-afflicted Buzz users have responded that they’re already using it or that upgrading it didn’t help. OTOH, maybe the model download got borked and I could resolve this error by deleting the file(s) associated with Faster-Whisper-Large and allowing Buzz to re-download the model and rebuild whatever other files it constructs on first-run of that model. Note that Buzz stores models from its program files (e.g. on Windows, seek downloaded Whisper models in C:\Users\\.cache\whisper\). Or maybe uninstalling Buzz (and deleting the models stored elsewhere), reinstalling, and starting from scratch would fix it. I dunno.

Maybe it’s simply a matter of what the error message says: my PC’s graphics card isn’t beefy enough and the GPU I’m using ran out of memory? I doubt this is actually the problem, though because I re-transcribed the same audio file one more time using Whisper-Large just a moment ago, while repeatedly (every second or so) invoking GPUtil.showUtilization(all=True) in a Python REPL running in a terminal window. Note: if you don’t already have Anders Krogh Mortensen et al.’s GPUtil module installed, you need to install it (and import it) first. I may have missed some transient spikes, but none of the readings I got for GPU utilization during transcription ever exceeded 6% and none of the memory utilization values ever exceeded 29%. Then I did the same with a Faster Whisper model and the operation failed in the same way, but a look at the same metrics indicates that it didn’t seem to push my graphics card any harder than the regular Whisper had. So… *shrug*.

For now, I’m satisfied enough with regular (slow) Whisper-Large performance.

POSTSCRIPT: Faster Whisper is a no-go for me (so far).

POSTSCRIPT: `Faster Whisper` is a no-go for me (so far).