Which Whisper Model Works Best for Lecture Captions? I Tested 614 Minutes of My Own Classes

Key result: On 12 lecture recordings totaling 614 minutes, Whisper Large V3 Turbo had the best result in this benchmark: 8.9% WER, 64.8x realtime, and 9.5 minutes total processing time.

I have been building a new Canvas tool that combines ideas from my Canvas Design Studio project and my Canvas Backup app. I am hoping to release it soon, but while working on it I ran into a practical question:

Which speech-to-text model should I use for lecture capture?

That question turned into its own side project to find what model does the best for Lecture Captions: https://github.com/Ryfter/asr-bench.

The idea is simple. Instead of trusting a leaderboard, run the models against your own audio, on your own hardware, and see what happens.

For my first real test, I used 12 lecture recordings from my own course. In total, that was 614 minutes of lecture capture. A little over 10 hours of classroom audio.

Four Whisper Models for Lecture Capture:

Model	Parameters	Disk	WER	Speed	Total Time
Whisper Small	244M	1.4 GB	10.7%	43.5x	14.1 min
Whisper Medium	769M	1.4 GB	11.8%	29.1x	21.1 min
Whisper Large V3	1550M	8.6 GB	14.2%	14.7x	41.9 min
Whisper Large V3 Turbo	809M	1.5 GB	8.9%	64.8x	9.5 min

For this set of lectures, Whisper Large V3 Turbo was the clear pick. It had the lowest WER and finished the entire 10-hour corpus in about 9.5 minutes.

That is the kind of result I care about. Not “what is the best model in general?” but “what should I use for this actual workflow?”

A quick note on the numbers: the reference transcripts were Panopto-generated captions, not hand-corrected transcripts. So I would not treat these as perfect accuracy scores. They are better read as a comparison against the captions I already had.

That is still useful. Most professors and instructional teams are not starting with perfect transcripts. They are starting with lecture capture output, exported captions, and a need to improve what already exists.

The README for the project includes the model sizes, parameter counts, disk usage, runtime, VRAM usage, and per-lecture breakdowns. That makes it easier to choose a model based on the whole workflow, not just one accuracy number.

For my lecture capture work, Whisper Large V3 Turbo is the model I would try first.

* One note on all of this. the WER% is calculated on the Panopto Recording. Panopto, as part of their ASR, changes the transcript slightly to make it better. In one video I said between the times of 9 to 11. It put in 11:00 am. I never said am. So, I would expect a bit of difference based on those alone.

Which Whisper Model Works Best for Lecture Captions? I Tested 614 Minutes of My Own Classes

Four Whisper Models for Lecture Capture:

Like this:

About The Author

The AI Professor

Leave a reply Cancel reply

Recent Posts

Recent Comments

Which Whisper Model Works Best for Lecture Captions? I Tested 614 Minutes of My Own Classes

Four Whisper Models for Lecture Capture:

Share this:

Like this:

About The Author

The AI Professor

Related Posts

NotebookLM Tidbit: Streamline Your AI Workflow – Google Drive to Notebook LM

Miranda Problem: What Serenity Reveals About AI Agent Harness Design

NotebookLM Tidbit: Automate Sentiment Analysis from Google Forms

The Democracy of Dreams: How AI Freed Creative Expression from Its Guardians

Leave a reply Cancel reply

Recent Posts

Recent Comments