Remember when transcribing audio meant hitting play-pause-rewind a thousand times while typing furiously? Those days are gone. Let’s dive into two powerful AI transcription tools that won’t cost you a dime: Google AI Studio and OpenAI Whisper. Whether you’re dealing with interviews, lectures, or that podcast you’ve been meaning to reference, these tools can handle the heavy lifting.
Why should you care? Because time is precious, and these AI tools can transcribe hours of audio in minutes. Plus, they’re surprisingly accurate – we’re talking way better than your uncle Bob trying to use voice-to-text on his smartphone.
AI Studio by Google
AI Studio is a Google experiential platform. You can use numerous experimental and production GenAI models.
- I keep settings default, but confirm the model
- Choose a Gemini model that has 2 million tokens (a 1 million token model should work fine as well).
- Flash models are faster, but you do lose some precision. I normally use a flash model for this.
- Record the audio, somehow. I use a voice recorder program on my phone.
- Save the audio file to Google Drive. (Optional – But recommended)
- Open up Google AI Studio and choose Create Prompt.
- Click the + sign in the prompt input at the bottom. Choose My Drive and find the file.
- Alternatively, you can upload the file here, as well.
- I recommend using Google Drive (Step 2) since it does that anyhow.
- Paste in this prompt to generate the transcription:
- Generate audio diarization, including transcriptions and speaker information for each transcription, for this interview. Organize the transcription by the time they happened. If you can infer the speaker, please do. If not, use speaker A, speaker B, etc.
- Note: If it is a LONG audio file, you may have to type Please Continue if it stops generating before the end of the file.
- It should format with [00:00:00] NAME: text…
- It has to get the speaker information as part of it, if not, it can’t put speaker names in.
You can also have it output into different formats, like JSON. This is an alternative prompt from Google’s Vertex AI :
Generate audio diarization for this interview. Use JSON format for the output, with the following keys: “speaker”, “transcription”. If you can infer the speaker, please do. If not, use speaker A, speaker B, etc.
Google Colab + OpenAI Whisper API
Google Colab is a free cloud service that allows you to run Python programs through a web interface. It is built on the Jupyter Notebook platform. While it primarily supports Python, R can be used with additional setup.
This only works for Transcription. It does not support diarization.
- Record the audio, somehow. I use a voice recorder program on my phone.
- Save the audio file to Google Drive. Note the folder you saved it into.
- Open up Google Colab and create a new notebook. You have to set up this account before you can open up Colab files as well.
- Start with installing and configuring the software needed:
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
- Click the “Run Cell” icon to the left. Alternatively press Ctrl + Enter.
- Click + Code after Whisper and ffmpeg are installed
- It will say “completed at <time>” on the bottom of the screen.
- Note: You will type in the command to run in step 10.
- Connect Colab to Google Drive. On the far left side, look for the folder icon:
- In the Files section, click on the Mount Drive option for Google Drive:
- Note: the first time you try to mount Google drive you will have to run a short script to mount it. When you click on the google drive icon, it prompts you to run it. No other work needed, and it is all automatic. (besides clicking run to start it).
- It will prompt you for permissions to access drive at this point.
- Type in the whisper command now: !whisper –language en “. That will create double quotes to paste the file path into. You will run this in step 12.
- Without — language en it looks at the start of the file for what language to use and it uses the first 30 seconds of the file to discover the language.
- This adds time to completing the task.
- Click on Drive, then My Drive, and then look for the file you want transcribed. Click the 3 dots at the end of the file name and click “Copy Path”.
- Paste the path in the double double quotes after !whisper. It will look like this:
!whisper “/content/drive/MyDrive/AI/test audio file.mp3”
- Click the “Run Cell” icon again, or press Ctrl + Enter.
- It takes a little while to run, and when it is done the text is in the console. Highlight the output you want, copy it (ctrl+c) and then paste it into a document.
- I have also created a tutorial that has these steps already in it.
NOTES on Whisper: This can take a while. Especially for LARGE files. Even a small file, less than 20 seconds can take minutes to run. My example file is 16 seconds and it took 2 minutes and 30 seconds to complete. With –-language en it took 1 min 38 seconds.
I recommend adding the –language tag.
You can find out more about the Whisper API here: https://platform.openai.com/docs/guides/speech-to-text/ . You can find out about the languages it works with as well as options you can set for timestamps. It also talks about breaking up longer/larger audio into 25MB chunks.
The Whisper API FAQ is here: https://help.openai.com/en/articles/7031512-whisper-audio-api-faq
Both these tools have their sweet spots. Google AI Studio is perfect when you need something quick and cloud-based, while OpenAI Whisper through Colab gives you more control and can handle trickier audio files. Choose what works best for your needs, and say goodbye to manual transcription headaches.
A quick heads-up: AI transcription technology keeps getting better by the day. While these tools are solid choices right now, keep an eye out for updates and new features. That’s the beauty of living in the AI age – there’s always something new and improved around the corner.