Which speech-to-text model should I use for lecture capture?
That question turned into its own side project: https://github.com/Ryfter/asr-bench.
The idea is simple. Instead of trusting a leaderboard, run the models against your own audio, on your own hardware, and see what happens.
For my first real test, I used 12 lecture recordings from my own course. In total, that was 614 minutes of lecture capture. A little over 10 hours of classroom audio.
I tested four local Whisper models:
| Model | Parameters | Disk | WER | Speed | Total Time |
|---|---|---|---|---|---|
| Whisper Small | 244M | 1.4 GB | 10.7% | 43.5x | 14.1 min |
| Whisper Medium | 769M | 1.4 GB | 11.8% | 29.1x | 21.1 min |
| Whisper Large V3 | 1550M | 8.6 GB | 14.2% | 14.7x | 41.9 min |
| Whisper Large V3 Turbo | 809M | 1.5 GB | 8.9% | 64.8x | 9.5 min |
For this set of lectures, Whisper Large V3 Turbo was the clear pick. It had the lowest WER and finished the entire 10-hour corpus in about 9.5 minutes.
That is the kind of result I care about. Not “what is the best model in general?” but “what should I use for this actual workflow?”
A quick note on the numbers: the reference transcripts were Panopto-generated captions, not hand-corrected transcripts. So I would not treat these as perfect accuracy scores. They are better read as a comparison against the captions I already had.
That is still useful. Most professors and instructional teams are not starting with perfect transcripts. They are starting with lecture capture output, exported captions, and a need to improve what already exists.
The README for the project includes the model sizes, parameter counts, disk usage, runtime, VRAM usage, and per-lecture breakdowns. That makes it easier to choose a model based on the whole workflow, not just one accuracy number.
For my lecture capture work, Whisper Large V3 Turbo is the model I would try first.