Transcribing Audio without Automation or Outsourcing


Recently, I had tens of thousands of spoken words that need to be transcribed into written text. "Hang on," you might be thinking, "why would a college student who, by all accounts, spends his time on reputable pursuits such as Python, need to convert audio into text?" While I was doing interviews for my book, I transcribed about 8 hours of audio, a process that took somewhere between 24 and 32 hours of very focused work over the course of a month or so. Here's what I learned in the process.

Author's note: this article was written before Whisper, a highly accurate speech-to-text model, was available. Now, I'd just use AI.

Transcribing audio by hand is hard, but valuable. With just a few simple pieces of equipment and some dedicated focus, it's possible to transcribe audio quickly and accurately by hand.

Use a Foot Pedal

Here's how transcribing works on a minute-to-minute basis. You press play on the recording. You type as you listen. Inevitably, you fall behind, so you pause the recording and type out the rest of the sentence as cached in your short-term memory. Caught up, you press play. Repeat. If you don't pause in time, you'll forget some words and need to rewind. Occasionally, you'll have trouble understanding something and have to rewind and listen to it a few times. The problem isn't the typing or the remembering. The biggest problem with transcribing is hitting the play/pause button.

If you're the type of person who uses custom keyboard shortcuts, hear me out, you're going to love this. Keyboard shortcuts are usually great, but in this case any keyboard shortcut has the same problem: it distracts your already overloaded hands from their primary task of trying to keep up with the speaker. However, other parts of your body are just sitting there idling. Make like a Crossfit instructor and get your whole body in on this movement. Use a foot pedal to play and pause audio.

If I had a foot pedal to recommend, I'd be linking it right here. However, I didn't use a foot pedal, I used my mouse. More specifically, I mapped the thumb button on my Logitech MX Master 2S to audio play/pause, put it on the floor, and got to work. With a touch of my toe, I could start and stop the recording. Immediately, my speed and comfort doubled with this physical equivalent of the sort of if-it-works-it-works code that I'd be churning out at a hackathon. If you have a spare keyboard, multifunction mouse, macro pad, video game controller, or anything else that you can connect to your computer, remap to play/pause, and operate with your foot, use it. Otherwise, go ahead and buy a foot pedal, I'm sure you'll find other uses for it.

Be(come) a Decent Typist

I don't want to wade too deep into the holy war of the importance of typing for a programmer, but I'll say this. A person talks somewhere around 150 words per minute in casual conversation. The average person types below 40 words per minute. Unless you have some limiting physical disability, you should be able to get up above a sixty-word-per-minute typing speed with practice. When I say "limiting," I have nine fingers, only six of which are fully functional. I type above sixty words per minute.

The asterisk here is sustained versus peak typing speed. When people talk about how fast they can type, they're usually referring to their score on a one-minute typing test. Just like sprinting is faster than jogging, you'll type slower when typing actually non-stop for an hour or more straight. My sustained transcribing pace is between 40 and 45 words per minute, which also accounts for play/pause and rewind.

Even an extremely fast typist won't be able to keep up with an average speaker. That's not the point. Per the earlier discussion of play/pause, increasing your sustained typing rate by N-percent will result in a greater-than-N-percent increase in your transcription speed because you'll need to play, pause, and rewind less often.

Consider Ergonomics

I shelled out 220 of my hard-earned dollars for a Kinesis Gaming Freestyle Edge RGB Keyboard for general writing purposes last year. It is split in half, with the halves positionable up to 20 inches apart, and tilted up to about a 20 degree angle. The difference in typing experience when transcribing is monumental. I typed over 9,000 words in a single day on that keyboard without any pain, whereas days before I'd experienced pain typing 4,000 words on my laptop keyboard. Use a keyboard that's comfortable for you. Taking breaks is also important, I stopped for 5 minutes at the end of every hour. Ergonomics are not my hill to die on, but whatever efforts you make for normal desk work should be strongly applied here.