If every meeting ends with an email chain asking “what did we actually decide?”, the problem is not the meeting — it is the lack of a reliable transcript. Cloud transcription services solve this partially, but they require uploading your call audio to a third-party server. For legal, compliance, or plain privacy reasons that is not always acceptable.
This guide shows you how to build a voice meeting notes workflow entirely on your Windows PC: capture the meeting audio using low-latency audio capture loopback, run it through OpenAI’s Whisper model locally, and automatically extract a Markdown summary with decisions and action items. No cloud upload. No subscription. Processing happens on your machine.
TL;DR
| Step | Tool | Time |
|---|---|---|
| Capture audio | FFmpeg + low-latency audio capture loopback | Live |
| Transcribe | Whisper (medium.en) | ~4 min / 1 hr meeting |
| Extract actions | Python + local LLM or paste to AI | ~2 min |
| Output | Markdown .md file | Immediate |
Why Local Transcription Beats Cloud for Meetings
Most cloud transcription services — Otter.ai, Fireflies, Zoom’s built-in AI Notes — work by sending your audio to remote servers where it is processed and often stored for model training. For personal catch-up calls that is fine. For calls containing client names, financial projections, medical information, or legal discussion, it is not.
Running Whisper locally means the audio file never leaves the machine. There is no API key tied to your company account, no retention policy to read, and no possibility of a third-party breach exposing your call content. The transcript and summary live wherever you save them.
There is also a cost argument. Cloud transcription at scale — 100 hours of meetings per month across a team — costs $40–$200 per month per user on most platforms. Local inference on a GPU you already own costs nothing per transcript after setup.
Legal and Consent — Read This First
Recording or transcribing a meeting without participant consent is illegal in many jurisdictions, including many US states (two-party consent laws), the EU (GDPR Article 6), and others worldwide.
Before you transcribe any meeting:
- Announce clearly at the start: “I’m capturing audio for local transcription to produce meeting notes.”
- Give participants the option to opt out or speak off the record.
- Check your company’s call-recording policy — many require IT or legal approval.
- Store transcripts securely and apply the same data handling rules as other confidential documents.
This article is a technical guide. It is not legal advice.
What You Need
- Windows 10 or 11 — low-latency audio capture loopback is available on both
- Python 3.10+ — from python.org or winget
- FFmpeg — for audio capture from the loopback device
- openai-whisper or faster-whisper — the transcription engine
- NVIDIA GPU (optional but recommended) — RTX 2060 or better for fast inference; CPU works too
- A meeting app: Zoom, Microsoft Teams, Google Meet, or any audio-producing application
Step 1 — Identify Your low-latency audio capture Loopback Device
low-latency audio capture loopback captures whatever Windows plays through your output device — the same audio you hear in your headphones. No driver installation is required; it is part of the Windows audio stack since Vista.
Open a terminal and run:
ffmpeg -list_devices true -f dshow -i dummy 2>&1 | findstr /i "audio"
You will see output like:
"Speakers (Realtek High Definition Audio)" (audio)
"Headphones (USB Audio Device)" (audio)
Note the exact name of your active output device. For loopback capture, append (loopback) to the device name when you use it with FFmpeg.
Alternatively, use Python to list devices:
import sounddevice as sd
print(sd.query_devices())
Look for devices with (loopback) in the name or host API low-latency audio capture.
Step 2 — Record the Meeting Audio
Start your Zoom, Teams, or Meet call. Before the main content begins, start FFmpeg in a separate terminal:
ffmpeg -f dshow -i audio="Speakers (Realtek High Definition Audio) (loopback)" \
-ar 16000 -ac 1 -c:a pcm_s16le \
meeting_2026-06-12.wav
Key flags:
-ar 16000— Whisper’s native sample rate; no resampling needed-ac 1— mono; reduces file size and matches Whisper’s expected input-c:a pcm_s16le— uncompressed WAV for best accuracy
Stop recording when the meeting ends with Ctrl+C. A 1-hour meeting at these settings produces roughly 115 MB.
Tip: If your audio quality is poor due to background noise, running VoxBooster’s noise suppression on your microphone channel before the call keeps your own voice clean in the capture. The low-latency audio capture loopback captures the mixed output, so other participants’ audio benefits from their own platforms’ noise processing.
Step 3 — Install Whisper
If you have not installed Whisper yet:
pip install openai-whisper
# For faster CPU/GPU inference:
pip install faster-whisper
For GPU acceleration (NVIDIA), also install:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Check your CUDA version first with nvidia-smi and match the cu version accordingly.
Step 4 — Transcribe the Recording
Using openai-whisper (CLI)
whisper meeting_2026-06-12.wav --model medium.en --output_format txt --output_dir ./transcripts
This saves a .txt file and a .srt subtitle file. The medium.en model is English-only, which is faster and more accurate for English meetings than the multilingual medium.
Using faster-whisper (Python script)
from faster_whisper import WhisperModel
model = WhisperModel("medium.en", device="cuda", compute_type="float16")
segments, info = model.transcribe("meeting_2026-06-12.wav", beam_size=5)
with open("transcript.txt", "w", encoding="utf-8") as f:
for segment in segments:
timestamp = f"[{segment.start:.1f}s]"
f.write(f"{timestamp} {segment.text.strip()}\n")
print("Transcription complete.")
faster-whisper uses CTranslate2 under the hood and is 2–4× faster than the original on the same hardware.
Step 5 — Extract Action Items into Markdown
Raw transcripts are walls of text. The useful artifact is a structured summary: decisions made, tasks assigned, and open questions. Here is a simple Python script that uses Ollama (local LLM) to produce one:
import subprocess
import sys
transcript_path = sys.argv[1]
with open(transcript_path, "r", encoding="utf-8") as f:
transcript = f.read()
prompt = f"""You are a meeting notes assistant. Given the transcript below, produce a Markdown document with:
1. **Meeting Summary** (3-5 sentences)
2. **Decisions Made** (bulleted list)
3. **Action Items** (bulleted list with owner and deadline if mentioned)
4. **Open Questions** (bulleted list)
Transcript:
{transcript}
"""
result = subprocess.run(
["ollama", "run", "llama3"],
input=prompt,
capture_output=True,
text=True,
encoding="utf-8"
)
output_path = transcript_path.replace(".txt", "_summary.md")
with open(output_path, "w", encoding="utf-8") as f:
f.write(result.stdout)
print(f"Summary saved to {output_path}")
Run it as:
python extract_actions.py transcripts/meeting_2026-06-12.txt
No Ollama? Paste the transcript directly into any chat AI with the same prompt. The output is identical — only the automation step differs.
Model Selection Guide
| Model | VRAM | Speed (GPU) | Speed (CPU) | Best For |
|---|---|---|---|---|
| tiny.en | 1 GB | Very fast | 5 min/hr | Quick drafts, testing |
| small.en | 2 GB | Fast | 20 min/hr | CPU-only machines |
| medium.en | 5 GB | Balanced | 60 min/hr | Default recommendation |
| large-v3 | 10 GB | Slow | Not practical | Max accuracy, RTX 4070+ |
All models run entirely offline after the initial download.
Comparison: Local Whisper vs. Cloud Transcription Services
| Feature | Whisper (local) | Otter.ai | Fireflies | Zoom AI Notes |
|---|---|---|---|---|
| Data leaves device | No | Yes | Yes | Yes |
| Cost per month | $0 | $10–$20/user | $10–$19/user | Included with Zoom |
| Accuracy (English) | 88–94% WER | ~88% | ~87% | ~85% |
| Speaker diarization | With pyannote | Yes | Yes | Yes |
| Custom vocabulary | Via prompt | Paid | Paid | No |
| Offline capable | Yes | No | No | No |
| Setup time | 30 min | 5 min | 5 min | 0 min |
Cloud services win on convenience and diarization out of the box. Local Whisper wins on privacy, cost at scale, and the ability to work without internet.
Adding Speaker Diarization
Whisper alone does not identify who said what. For meetings where attribution matters, combine it with pyannote.audio:
pip install pyannote.audio
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN"
)
diarization = pipeline("meeting_2026-06-12.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{speaker}: {turn.start:.1f}s – {turn.end:.1f}s")
You can then align the diarization timestamps with the Whisper segment timestamps to produce speaker-labeled transcripts. The pyannote models run locally after download — a Hugging Face account is needed to accept the model license, but inference is fully offline.
Automating the Full Pipeline
Once the three steps work individually, chain them into a single script that runs after any meeting ends:
# record.bat — run during meeting
ffmpeg -f dshow -i audio="Speakers (Realtek High Definition Audio) (loopback)" ^
-ar 16000 -ac 1 -c:a pcm_s16le ^
"meetings\%DATE:~10,4%-%DATE:~4,2%-%DATE:~7,2%.wav"
# process.bat — run after meeting
set FILE=%1
python transcribe.py %FILE%
python extract_actions.py %FILE:.wav=.txt%
start "" "%FILE:.wav=_summary.md%"
Run process.bat meetings\2026-06-12.wav and the summary opens in your default Markdown editor automatically.
Privacy and Storage Considerations
Keep the following in mind when storing meeting transcripts:
- Encrypt the WAV and transcript files if they contain sensitive business information. Windows BitLocker or VeraCrypt handle this at the folder level.
- Set a retention policy — delete raw WAV files after transcription; keep only the summary unless you need verbatim quotes.
- Shared drives: If you sync transcripts to OneDrive or SharePoint, check whether those systems apply OCR or AI indexing to uploaded documents.
- Access control: Restrict transcript files to participants only. A shared
\meetings\folder on a network drive should not be open to the entire company.
Soft CTA
VoxBooster’s noise suppression ensures your microphone channel is clean before audio hits the low-latency audio capture loopback, which directly improves Whisper’s word-error rate on your voice. It runs locally on Windows 10/11, requires no kernel drivers, and integrates with any meeting app. A 3-day free trial is available — no credit card required.
After the trial: plans start at $6.99/month.
FAQ
Does Whisper transcribe in real time on a normal Windows PC? Not truly real time at full accuracy — Whisper is a batch model. On a mid-range GPU (RTX 3060) the small or medium model transcribes a 1-hour meeting in about 3-5 minutes after the call ends. For live captions consider Whisper Live or whisper-streaming forks, though they trade some accuracy for latency.
Is it legal to transcribe a Zoom or Teams meeting? Legality depends on jurisdiction and company policy. In most places you must inform all participants before recording or transcribing. Always announce at the meeting start that you are capturing audio for notes, and get explicit consent. This article is a technical guide, not legal advice.
What low-latency audio capture loopback device do I need to install? No driver installation is needed. low-latency audio capture loopback is a native Windows 10/11 API that mirrors any active output device — speakers or headphones — as a capture source. FFmpeg, Python sounddevice, and most audio libraries expose it directly. No virtual cable or third-party driver required.
Which Whisper model should I use for meeting transcription? The medium.en model is the best practical balance: 1.5 GB VRAM, ~90% word-error-rate reduction over tiny, and 4-6× faster than large on GPU. For CPU-only machines use small.en — it transcribes a 1-hour meeting in roughly 20 minutes on a modern CPU. Large-v3 only makes sense if you have an RTX 4070 or better.
Can I transcribe meetings without a GPU? Yes. Whisper runs on CPU via the openai-whisper package or the faster-whisper CTranslate2 backend, which cuts CPU inference time roughly in half. A meeting that would take 8 minutes on GPU takes about 20-25 minutes on a modern Intel or AMD CPU with small.en — acceptable for after-meeting batch processing.
How do I extract action items automatically from the transcript? The simplest method is a Python script that pipes the Whisper transcript into a local LLM prompt (Ollama + llama3 or Mistral) asking for a bulleted list of decisions and tasks. Alternatively, paste the raw transcript into any chat AI. VoxBooster’s noise suppression keeps the captured audio clean, which directly improves transcript accuracy.
Does this workflow work with Microsoft Teams recorded meetings? Yes, two ways: capture the live audio via low-latency audio capture loopback during the call, or download the Teams meeting recording from OneDrive and run Whisper on the MP4 file. The second path is simpler and lets you re-transcribe at any time without staying in the meeting.
Further Reading
- OpenAI Whisper on GitHub — model weights, benchmarks, and installation docs
- Zoom Recording and Transcription — Official Help — how Zoom handles cloud recordings
- Speech recognition — Wikipedia — background on ASR technology and WER metrics
- Real-time voice meeting notes with VoxBooster — how real-time audio processing works
- Best noise suppression for Windows meetings — comparison of local noise suppression tools