Whisper transcription on Windows gives you accurate, offline speech-to-text that runs entirely on your own hardware — no subscription, no cloud upload, no per-minute fee. This guide covers everything from prerequisites to production use: the Python pip install, the lighter whisper.cpp port, ready-made GUI apps, and what to do when you want real-time transcription without a Python environment.
TL;DR
- OpenAI Whisper is a free, open-source speech recognition model with five size tiers (tiny → large-v3)
- Install via
pip install openai-whisperon Python 3.9–3.12; needs ffmpeg on PATH whisper.cppis a lighter C++ port — no Python, works on CPU via GGML quantization- GPU (CUDA) cuts transcription time to near real-time even on large models; CPU works fine for the small model
- For live transcription without any Python setup, VoxBooster bundles Whisper-grade local STT with a global hotkey
- Common errors: missing ffmpeg, wrong Python env, CUDA version mismatch
What Is Whisper Transcription?
OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual audio. Released in September 2022 and continuously improved since, it runs as a local model — meaning your audio files never leave your PC. It handles 99 languages, punctuates automatically, and achieves word error rates under 5% on clean English audio with the large-v3 model.
Unlike cloud services (Otter.ai, Rev, Descript’s transcription layer), Whisper on Windows has no per-minute cost and no data policy to worry about. Whisper transcription is genuinely free once the model weights are downloaded.
Prerequisites Before You Install
Before picking an install method, sort out these dependencies:
Python 3.9–3.12. The official Whisper package requires Python. Check if you have it:
py --version
If not, download the latest 3.12 installer from python.org. During install, tick “Add Python to PATH” — this matters.
ffmpeg. Whisper uses ffmpeg to decode audio and video files. Without it, you’ll get FileNotFoundError or a blank output on anything that isn’t a raw WAV. The fastest install method on Windows 10/11:
winget install Gyan.FFmpeg
Then open a new terminal and verify: ffmpeg -version.
A GPU (optional but recommended). Whisper runs on CPU, but a CUDA-capable NVIDIA GPU makes a significant difference. For the large model, CPU transcription of a 10-minute file takes 3–6 minutes on a modern desktop; on a mid-range GPU (RTX 3060, 12 GB VRAM) it takes about 40 seconds. More on model sizes and VRAM requirements in the table below.
Whisper Model Sizes: Which One to Pick
| Model | Parameters | VRAM (FP16) | Relative speed | English WER | Best for |
|---|---|---|---|---|---|
| tiny | 39 M | ~1 GB | ~32× real-time | ~5.7% | Quick drafts, low-end hardware |
| base | 74 M | ~1 GB | ~16× real-time | ~4.2% | Fast notes, live streaming |
| small | 244 M | ~2 GB | ~6× real-time | ~3.0% | Most users — best value |
| medium | 769 M | ~5 GB | ~2× real-time | ~2.2% | Professional transcription |
| large-v3 | 1550 M | ~10 GB | ~1× real-time | ~1.6% | Accents, multilingual, medical |
“Real-time factor” (RTF) here means GPU inference on an NVIDIA A100. On a consumer RTX 3080, multiply roughly by 3–4×. On CPU, multiply by 10–20× again.
For most Windows users: start with small. It runs near-real-time on a modern CPU, handles accents better than base, and fits in 2 GB of RAM/VRAM. If accuracy on dense technical vocabulary matters (legal, medical, code reviews), test medium next.
Method 1: pip Install (Official Python Package)
This is the canonical openai whisper windows install — straightforward if you’re comfortable with a terminal. It gives you the most flexibility: full Python API access, all output formats (txt, srt, vtt, json, tsv), and easy integration with other scripts.
Step 1 — Create a virtual environment (recommended)
py -m venv whisper-env
whisper-env\Scripts\activate
This keeps Whisper’s dependencies isolated from your system Python.
Step 2 — Install Whisper
pip install openai-whisper
This pulls the model library and its dependencies (PyTorch, tiktoken, tqdm, more-itertools). Expect 1–3 GB of downloads on first run including PyTorch.
Step 3 — Install PyTorch with CUDA (if you have an NVIDIA GPU)
The default PyTorch from the above command is CPU-only. For GPU acceleration:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Match the cu121 suffix to your installed CUDA version (nvidia-smi shows it). See the PyTorch install matrix if you’re unsure.
Step 4 — Run your first transcription
whisper my_audio.mp3 --model small
First run downloads the model weights (~244 MB for small). Subsequent runs are instant. Output: a .txt, .srt, and .vtt file alongside your audio.
Step 5 — Useful flags
# Force English (skip language detection, slightly faster)
whisper audio.mp3 --model small --language en
# Output only plain text
whisper audio.mp3 --model small --output_format txt
# Transcribe a specific segment (seconds)
whisper audio.mp3 --model small --clip_timestamps "30,90"
# Use GPU device explicitly
whisper audio.mp3 --model medium --device cuda
Method 2: whisper.cpp (No Python Required)
whisper.cpp is a C/C++ reimplementation of the Whisper inference engine. It runs without Python, CUDA, or PyTorch. On Windows, it uses GGML quantized weights — the same format used by llama.cpp — and can accelerate via OpenBLAS (CPU) or DirectML (AMD/Intel/NVIDIA GPUs without CUDA).
Why use it instead of the Python package?
- Starts in under a second (no PyTorch initialization)
- Uses 30–50% less RAM on the same model
- Ships as a single
.exe— easier to bundle into scripts or other apps - Streaming mode available for near-real-time transcription
Windows install steps
Pre-built Windows binaries are available from the whisper.cpp releases page on GitHub. Download whisper-bin-x64.zip, extract it, then download a model:
# Using PowerShell — downloads the small GGML model
Invoke-WebRequest -Uri "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin" -OutFile "models\ggml-small.bin"
Run transcription:
.\main.exe -m models\ggml-small.bin -f audio.wav -otxt
Note: whisper.cpp requires WAV input (16 kHz, mono, 16-bit PCM). Convert with ffmpeg first:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
Method 3: GUI Apps Built on Whisper
If you don’t want a terminal at all, several open-source GUI apps wrap Whisper for a click-to-transcribe experience on Windows:
Whisper Desktop — a .NET 6 Windows app that wraps whisper.cpp with a drag-and-drop interface. Supports model selection, language, and batch processing. Requires no Python; installer available on GitHub.
FasterWhisper-based UIs — FasterWhisper is a Python reimplementation using CTranslate2 that runs 4× faster than the original on CPU. Several community GUI wrappers exist; search for “faster-whisper GUI Windows” on GitHub. These work well for batch file transcription.
Subtitle Edit — a popular open-source subtitle editor that added Whisper integration. Good for video subtitling workflows where you want SRT output you can tweak manually.
These GUI apps cover file-based transcription well. The gap they don’t fill: real-time live transcription with a hotkey, which leads into the next section.
Method 4: VoxBooster (Bundled, No Python Setup)
If your goal is live transcription — subtitles while you talk, dictation into any app, captioning a call — the file-based methods above aren’t the right fit. They’re designed to process a completed audio file, not a continuous microphone stream.
VoxBooster bundles Whisper-grade local speech recognition directly into the app. No Python environment, no model download wizard, no ffmpeg dependency. You install VoxBooster once and the transcription engine is ready under Dictation in the sidebar.
Practical differences vs. the raw pip install:
- Global hotkey — hold
Ctrl+Shift+Din any app and speak; text appears at your cursor - Integrated noise suppression — cleans the mic input before it reaches the speech model, which meaningfully improves accuracy in noisy rooms
- No terminal — model selection and language settings are in a GUI
- Bundled with voice changer, soundboard, and voice clone — if you’re already using VoxBooster for Discord voice changing or OBS, the dictation feature is just another tab
For a deeper look at the dictation workflow, see the voice dictation on Windows guide.
Choosing Between Methods
| pip Whisper | whisper.cpp | GUI apps | VoxBooster | |
|---|---|---|---|---|
| Python required | Yes | No | Sometimes | No |
| GPU needed | No (optional) | No (optional) | No (optional) | No (optional) |
| Real-time live | No | Partial | No | Yes |
| Global hotkey | No | No | No | Yes |
| Batch file transcription | Yes | Yes | Yes | No |
| SRT/VTT output | Yes | Yes | Yes | No |
| Install complexity | Medium | Medium | Low | Low |
Pick pip whisper if you need SRT/VTT output for video subtitles, or you want to script batch transcription in Python. Pick whisper.cpp if you want a portable binary with lower memory overhead. Pick a GUI app for drag-and-drop file transcription. Pick VoxBooster if you want live dictation without a Python install.
Basic CLI Usage Patterns
Once you have the pip package working, these patterns cover 90% of real use cases.
Transcribe a meeting recording to SRT subtitles
whisper meeting.mp4 --model medium --language en --output_format srt
Whisper can read video files directly (it calls ffmpeg internally). Output: meeting.srt in the same folder.
Transcribe a folder of audio files
for %f in (*.mp3) do whisper "%f" --model small --output_format txt
Run in Command Prompt (not PowerShell — the for loop syntax differs). Each file gets its own .txt output.
Force translation to English
whisper french_audio.mp3 --model small --task translate
--task translate outputs English regardless of input language. Useful for multilingual interviews.
Specify output directory
whisper audio.mp3 --model small --output_dir C:\Transcripts
Common Errors and Fixes
No module named 'whisper'
You installed whisper in a different Python environment than the one currently active. Run py -0 to list all Python installs, activate the right virtualenv, then reinstall. Also possible: you installed with pip3 but are running with py.
FileNotFoundError: [WinError 2] ffmpeg
ffmpeg isn’t on your PATH. Install via winget install Gyan.FFmpeg, close and reopen your terminal, then confirm with ffmpeg -version.
CUDA out of memory
You’re running a model too large for your GPU’s VRAM. Try the next size down, or add --fp16 False to force FP32 (uses more VRAM but sometimes fixes allocation issues on certain CUDA builds). Alternatively, run on CPU with --device cpu.
RuntimeError: Expected all tensors to be on the same device
PyTorch CUDA version mismatch. Reinstall PyTorch with the correct CUDA suffix for your driver version. Check your driver with nvidia-smi and cross-reference at pytorch.org/get-started/locally.
Output is garbled or in the wrong language
Whisper auto-detects language from the first 30 seconds of audio. If your file has silence or noise at the start, detection fails. Fix: add --language en (or your target language) explicitly.
Transcription is slow even with a GPU
Confirm Whisper is actually using CUDA: add --device cuda to your command. If you see FP16 is not supported on CPU; using FP32 instead in the output, CUDA is not being used — recheck your PyTorch install.
Whisper vs. Other Windows Transcription Options
It’s worth knowing what you’re comparing against before committing to a setup:
Windows built-in speech recognition / dictation (Win+H) — fast and well-integrated, but accuracy lags on accents, technical vocabulary, and non-US English. Partial cloud dependency in default mode. No SRT output.
Dragon NaturallySpeaking / Dragon Professional — historically the accuracy benchmark, strong for dictation workflows, but expensive ($300–$500), Windows-only, and slow to add vocabulary for new domains. Local processing, which is a plus.
Otter.ai, Rev, Descript transcription — cloud-based, subscription-priced, genuinely good accuracy, but audio leaves your machine. Not viable for private meetings, legal recordings, or anything under NDA.
Azure Cognitive Services / Google Speech-to-Text — developer APIs, cloud-based, pay-per-minute. Accurate, but requires code and an internet connection. Not a local whisper install equivalent, and whisper transcription accuracy is competitive at zero ongoing cost.
Whisper’s strengths vs. all of the above: free, fully local, open-source weights you can verify, strong multilingual support, and accuracy that’s competitive with paid services on clean audio. Its weakness: no native real-time streaming mode in the Python package, and setup requires a bit of CLI comfort.
Privacy: Why Local Matters for Transcription
When you run Whisper locally on Windows, audio never touches an external server. This matters more than most people realize — and it’s one of the biggest practical arguments for Whisper transcription over paid cloud alternatives:
- Meeting recordings often contain confidential business information
- Medical and legal dictation is subject to privacy regulations (HIPAA, GDPR, etc.)
- Journalist interviews and source conversations should never go to cloud APIs
- Personal voice notes, diary entries, therapy session transcripts — things you’d rather not have on someone else’s server
Cloud transcription services have privacy policies, but “we don’t sell your data” and “we may use anonymized audio to improve models” are different statements. With a local whisper install on Windows, the answer to both is irrelevant — the audio stays on your disk.
FAQ
Does OpenAI Whisper run offline on Windows? Yes. Once you’ve downloaded the model weights, Whisper runs 100% locally — no internet connection required. The initial download ranges from 75 MB (tiny) to 3.09 GB (large-v3). After that, transcription happens entirely on your CPU or GPU with no data leaving your machine.
What GPU do I need for Whisper transcription on Windows? A GPU is optional but speeds things up a lot. For the small model, 2 GB VRAM is enough. Medium needs 5 GB, large-v3 needs 10 GB. On CPU only, the base model transcribes roughly 10–15× real-time on a modern i5/Ryzen 5, meaning one minute of audio takes about 4–6 seconds.
What is the difference between Whisper model sizes? Whisper ships in five sizes — tiny, base, small, medium, and large (with large-v2 and large-v3 variants). Larger models are more accurate but slower and heavier. For most Windows users, small gives the best accuracy-to-speed ratio: ~244 MB, good multilingual accuracy, runs on CPU in roughly real-time on modern hardware.
Can I use Whisper for real-time live transcription on Windows? The original Python Whisper package is file-based and not designed for real-time. whisper.cpp has a streaming mode, but setup is complex. For genuinely low-latency live transcription — subtitles while you talk, dictation, call captioning — a bundled app like VoxBooster is easier: Whisper-grade accuracy with no Python environment required.
How accurate is OpenAI Whisper compared to Dragon NaturallySpeaking or Windows Dictation? On clean audio, Whisper large-v3 posts word error rates under 5% across most languages, competitive with Dragon Professional and better than Windows built-in dictation on technical vocabulary, accents, and multilingual content. Accuracy drops in noisy conditions, but combining Whisper with noise suppression restores most of it.
What is whisper.cpp and why would I use it instead of the Python package? whisper.cpp is a C/C++ port of the Whisper model that runs without Python or CUDA. On Windows, it uses GGML quantized weights and can leverage DirectML or OpenBLAS for acceleration. It starts faster, uses less RAM, and is easier to integrate into other apps than the Python package.
How do I fix the “No module named whisper” error on Windows?
This usually means the pip install went into a different Python environment than the one you’re running from. Check with py -0 to list installed Pythons, activate the right virtualenv, then reinstall: pip install openai-whisper. Also confirm you have ffmpeg on PATH — Whisper needs it to decode audio files.
Conclusion: Which Whisper Transcription Setup Is Right for You?
If you need batch file transcription with SRT/VTT output — for video subtitles, meeting recordings, podcast show notes — the pip-based openai whisper windows install is the most flexible path. Add CUDA support for your GPU and you get near-real-time throughput even on medium.
If you want a smaller footprint or are building a script that calls whisper as a subprocess, whisper.cpp with GGML weights is the cleaner option for a whisper local install on Windows — no Python, no CUDA, just a binary and a model file.
If you want local speech-to-text Windows integration without any terminal work — specifically live dictation into apps — VoxBooster bundles the same Whisper-grade accuracy with a global hotkey and integrated noise suppression. No Python, no virtual environments, no ffmpeg troubleshooting. It’s particularly useful if you’re already using the app for voice changing or soundboard work; the whisper desktop transcription feature is just another tab in the same interface.
Start with the small model regardless of which path you take. It gets you 80% of the way to large-v3 quality at a fraction of the compute cost. You can always upgrade later once you know what accuracy level your workflow actually requires.
For pricing and plan options, see voxbooster.com/#pricing.