VoxBooster’s pre-built voice library handles most use cases. But there’s one specific scenario where no pre-built voice comes close: when you want your own voice — your timbre, your accent, your identity — running in real-time or being used for narration, dubbing, and content.
That’s what custom model training exists for. And unlike what it sounds like, the process is simpler than configuring OBS for the first time.
When Training Your Own Voice Model Is Worth It
Before you start recording, it’s worth understanding the real use cases:
Content creator who records videos: you write the script, generate narration with your clone anytime of day without needing your voice to be on, without an elaborate mic setup for narration.
Dubber or voice actor: you keep your own timbre but can apply personality effects on top — deeper, more projected, more dramatic — without losing your identity.
Multi-language: you speak English. Your clone speaks French with your timbre. The intonation will be yours (the model carries your prosody), but the result is far more natural than generic TTS.
Selective anonymity: you want to appear on calls without revealing your real voice, but want consistency — always the same alternative voice, every time. Custom clone handles this better than a random preset.
Step 1: Reference Recording
This is the step most people underestimate. The quality of the model depends directly on the quality of the reference audio.
Duration: 3 to 5 minutes of continuous speech. More than that doesn’t improve results much; less than 3 minutes degrades them.
What to say: speak naturally. Read a text aloud — a news article, a short story, a description of something. The model needs intonation variation, natural pauses, different sounds of the language. Don’t just repeat the same sentence.
Environment: as quiet as possible. AC off. Window closed. Microphone about 4–6 inches from your mouth. If you have a dynamic mic, use it. If you only have a condenser, record at night when the street is quieter.
Avoid: coughing, sudden laughter, constant background noise, speaking too quietly or shouting. The model is trained on normal conversational speech — extremes degrade quality.
Step 2: The Training Wizard
Inside VoxBooster, go to the Voice Clone → My Voice → Create new model tab.
- Import your recorded audio. The wizard accepts WAV and MP3. WAV 44.1kHz 16-bit is ideal; MP3 320kbps also works. Avoid heavy compression.
- Confirm the preview. VoxBooster does automatic noise cleanup before training — you listen to the processed audio and confirm it’s acceptable.
- Name the model. This name will appear in your voice list afterward.
- Click Train. The process starts locally on your machine.
Step 3: Local Training
Training runs on your GPU (NVIDIA with CUDA, AMD with ROCm) or on CPU if you don’t have a dedicated graphics card.
With NVIDIA GPU (RTX 3060 or better): 10 to 15 minutes for 5 minutes of audio.
With older GPU or CPU: 20 to 40 minutes. You can leave it running in the background — VoxBooster doesn’t need to be in focus, just in memory.
During training, avoid rendering heavy video or running demanding games on the same PC. It won’t break anything — but it’ll extend the time and may produce artifacts in the model if the GPU runs low on memory.
When it finishes, VoxBooster sends a notification and the model appears automatically in your clone list.
Step 4: Using the Model
Select the custom model from the list, enable Real-time, speak. That simple.
The clone will carry your prosody — your pauses, your emphasis, your rhythm. If you speak with energy, the clone comes out with energy. If you speak slowly and seriously, it comes out slowly and seriously. The phonetic content is yours; the timbre is the model.
Tip: test the model on a short call before using it in a live stream. The first time you hear your own cloned voice is strange — it sounds almost right but with some difference. That’s normal. The person on the other end usually thinks it’s your regular voice.
Refining the Model
If the first training result didn’t satisfy you:
- Re-record with cleaner audio (more silence, better mic position)
- Increase to 5 minutes if you used 3
- Vary the type of speech in the recording more — include questions, exclamations, faster and slower speech
You can train multiple models and compare. VoxBooster stores them all locally — they don’t upload to any server. They’re model files on your drive, generally between 80 and 150 MB each.
The Final Result
With a decent setup and clean recording, the custom model is what convinces most in real-time use. It’s your voice — the model truly knows your timbre, it’s not trying to approximate a generic preset. For content creators and anyone who appears regularly in video or on stream, the 2 hours of initial effort to get this working is worth it.