Key result: On 12 lecture recordings totaling 614 minutes, Whisper Large V3 Turbo had the best result in this benchmark: 8.9% WER, 64.8x realtime, and 9.5 minutes total processing time.
I have been building a new Canvas tool that combines ideas from my Canvas Design Studio project and my Canvas Backup app. I am hoping to release it soon, but while working on it I ran into a practical question:

Which speech-to-text model should I use for lecture capture?

That question turned into its own side project: https://github.com/Ryfter/asr-bench.

The idea is simple. Instead of trusting a leaderboard, run the models against your own audio, on your own hardware, and see what happens.

For my first real test, I used 12 lecture recordings from my own course. In total, that was 614 minutes of lecture capture. A little over 10 hours of classroom audio.

I tested four local Whisper models:

Model Parameters Disk WER Speed Total Time
Whisper Small 244M 1.4 GB 10.7% 43.5x 14.1 min
Whisper Medium 769M 1.4 GB 11.8% 29.1x 21.1 min
Whisper Large V3 1550M 8.6 GB 14.2% 14.7x 41.9 min
Whisper Large V3 Turbo 809M 1.5 GB 8.9% 64.8x 9.5 min

For this set of lectures, Whisper Large V3 Turbo was the clear pick. It had the lowest WER and finished the entire 10-hour corpus in about 9.5 minutes.

That is the kind of result I care about. Not “what is the best model in general?” but “what should I use for this actual workflow?”

A quick note on the numbers: the reference transcripts were Panopto-generated captions, not hand-corrected transcripts. So I would not treat these as perfect accuracy scores. They are better read as a comparison against the captions I already had.

That is still useful. Most professors and instructional teams are not starting with perfect transcripts. They are starting with lecture capture output, exported captions, and a need to improve what already exists.

The README for the project includes the model sizes, parameter counts, disk usage, runtime, VRAM usage, and per-lecture breakdowns. That makes it easier to choose a model based on the whole workflow, not just one accuracy number.

For my lecture capture work, Whisper Large V3 Turbo is the model I would try first.