Transcribing Audio without Automation or Outsourcing


Recently, I had tens of thousands of spoken words that need to be transcribed into written text. “Hang on,” you might be thinking, “why would a college student who, by all accounts, spends his time on reputable pursuits such as Python, need to convert audio into text?” While I was doing interviews for my book, I transcribed about 8 hours of audio, a process that took somewhere between 24 and 32 hours of very focused work over the course of a month or so. Here’s what I learned in the process.

How would you do it? First, you might think of an automated solution. Dictation software has been around for years, so speech-to-text must be good by now, right? Unfortunately, as gathered from my experiments with AWS speech-to-text, the technology isn’t a viable option for any level of readability. If you have the money, by the time you’re done reading this post you’ll want to outsource it, and probably should (at a cost of around a hundred dollars per hour of audio, less if you can wait a week or more for delivery). My only recommendation is that if you’re going to spend money, spend enough to get a good result. If spending that kind of money isn’t viable, and you’re not going to single-handedly decrease the error rate of AI speech-to-text systems by a couple orders of magnitude, then you’re left with only one option: transcribe it yourself.

These notes only apply to transcribing audio/video files, where you have the ability to pause and rewind the audio. I have no idea how real-time transcription is possible, I assume it is by the use of some sort of shorthand or specialized input device. Ultimately, transcribing is not a fun physical act, but it does help you get familiar with the material you’re listening to. With just a few simple pieces of equipment and a sufficient time investment, transcribing audio is an accomplishable task.

Use a Foot Pedal

Here’s how transcribing works on a minute-to-minute basis. You press play on the recording. You type as you listen. Inevitably, you fall behind, so you pause the recording and type out the rest of the sentence as cached in your short-term memory. Caught up, you press play. Repeat. If you don’t pause in time, you’ll forget some words and need to rewind. Occasionally, you’ll have trouble understanding something and have to rewind and listen to it a few times. The problem isn’t the typing or the remembering. The biggest problem with transcribing is hitting the play/pause button.

If you’re the type of person who uses custom keyboard shortcuts, hear me out, you’re going to love this. Keyboard shortcuts are usually great, but in this case any keyboard shortcut has the same problem: it distracts your already overloaded hands from their primary task of trying to keep up with the speaker. However, other parts of your body are just sitting there idling. Make like a Crossfit instructor and get your whole body in on this movement. Use a foot pedal to play and pause audio.

If I had a foot pedal to recommend, I’d be linking it right here. However, I didn’t use a foot pedal, I used my mouse. More specifically, I mapped the thumb button on my Logitech MX Master 2S to audio play/pause, put it on the floor, and got to work. With a touch of my toe, I could start and stop the recording. Immediately, my speed and comfort doubled with this physical equivalent of the sort of if-it-works-it-works code that I’d be churning out at a hackathon. If you have a spare keyboard, multifunction mouse, macro pad, video game controller, or anything else that you can connect to your computer, remap to play/pause, and operate with your foot, use it. Otherwise, go ahead and buy a foot pedal, I’m sure you’ll find other uses for it.

Be(come) a Decent Typist

I don’t want to wade too deep into the holy war of the importance of typing for a programmer, but I’ll say this. A person talks somewhere around 150 words per minute in casual conversation. The average person types below 40 words per minute. Unless you have some limiting physical disability, you should be able to get up above a sixty-word-per-minute typing speed with practice. When I say “limiting,” I have nine fingers, only six of which are fully functional. I type above sixty words per minute, asterisk.

The asterisk here is sustained versus peak typing speed. When people talk about how fast they can type, they’re usually referring to their score on a one-minute typing test. Just like sprinting is faster than jogging, you’ll type slower when typing actually non-stop for an hour or more straight. My sustained transcribing pace is between 40 and 45 words per minute, which also accounts for play/pause and rewind.

Even an extremely fast typist won’t be able to keep up with an average speaker. That’s not the point. Per the earlier discussion of play/pause, increasing your sustained typing rate by N-percent will result in a greater-than-N-percent increase in your transcription speed because you’ll need to play, pause, and rewind less often.

Consider Ergonomics

I shelled out 220 of my hard-earned dollars for a Kinesis Gaming Freestyle Edge RGB Keyboard for general writing purposes last year. It is split in half, with the halves positionable up to 20 inches apart, and tilted up to about a 20 degree angle. The difference in typing experience when transcribing is monumental. I typed over 9,000 words in a single day on that keyboard without any pain, whereas days before I’d experienced pain typing 4,000 words on my laptop keyboard. Use a keyboard that’s comfortable for you. Taking breaks is also important, I stopped for 5 minutes at the end of every hour. Ergonomics are not my hill to die on, but whatever efforts you make for normal desk work should be strongly applied here.