Voice Cloning + Speaker ID — Landscape & Plan
Date: 2026-04-13
For: Mark Barnett (TP3 / OMI corpus / Legacy Soil narration)
1. TTS / Voice-Cloning Landscape (April 2026)
Open source — consolidated
Coqui AI shut down December 2025. The codebase lives on as Idiap's coqui-tts fork (last release Jan 2026), still shipping XTTS v2 — 6-second zero-shot cloning, 17 languages. CPML license forbids commercial use without a separate agreement, so OK for personal narration but not Legacy Soil customer videos.
The new center of gravity is Resemble AI's Chatterbox (2025, MIT, 500M params, trained on 500K hours). Zero-shot cloning from seconds of reference, emotion-exaggeration controls (a first for open source), Turbo variant runs sub-200ms latency. Every output is stamped with Resemble's "Perth" perceptual watermark. Blind evaluations have it beating ElevenLabs on quality. First reach on Apex.
Other serious open options: F5-TTS (MIT, 7x realtime, commercial-friendly), OpenVoice v2 (MIT, cross-lingual, granular emotion/accent control), Fish Speech V1.5, CosyVoice2, IndexTTS-2 (strong, with license caveats). Tortoise is legacy. No Anthropic / Google / Meta open-weight TTS worth using — Google is API-only (Chirp 3 HD), Meta has Voicebox papers but no public weights, Anthropic has nothing here.
Commercial — ElevenLabs still rules
ElevenLabs is still the quality leader and the easiest path. Pricing April 2026:
- Starter $5/mo — Instant Voice Cloning (~1 min reference), 30K credits, commercial rights.
- Creator $22/mo — Professional Voice Cloning (PVC, 30+ min reference, much better long-form stability), 100K characters, 30 voices.
- Pro/Scale/Business tiers irrelevant for Mark.
PVC is what matters for Legacy Soil narration — IVC drifts over multi-minute reads. ElevenLabs watermarks all output and requires a consent attestation.
Play.ht, Resemble.ai, Microsoft Custom Neural Voice are competent alternatives. Microsoft CNV is enterprise-gated (weeks of approval). Resemble.ai's hosted product is pricier than ElevenLabs. Play.ht is a distant third.
Sample length needed: ElevenLabs IVC ~1 minute clean; PVC ~30 minutes scripted; XTTS / Chatterbox / F5-TTS work from 6-30 seconds zero-shot. Mark's hundreds of hours of OMI audio is overkill. The bottleneck is quality of reference (no music/overlap/noise), not quantity.
2. Speaker Identification / Diarization
Two related tasks: diarization (who-talks-when segments) and identification (match a segment against Mark's known voiceprint).
pyannote.audio 3.1 is the open-source standard. DER ~10% on clean benchmarks, 15-25% on messy real-world audio. MIT-style license, runs on Apex's RTX 3090 fine. Pyannote also ships Precision-2, a paid premium model 28% more accurate than OSS 3.1 — only worth it for forensic-grade work (Mark doesn't need it).
WhisperX wraps Whisper transcription + pyannote diarization + word-level alignment. On a 3090 it's roughly 7x faster than pyannote alone (75s vs 520s on the same clip in published benchmarks) because it pins GPU utilization properly. For OMI processing, WhisperX is the right tool — transcript + diarization in one pass.
For identification (not just "speaker A vs B" but "is speaker A actually Mark"), the standard is an ECAPA-TDNN speaker embedding from SpeechBrain. Compute a 192-dim embedding of Mark's voice from a clean 30-second sample, store as a vector, then at inference cosine-compare against every diarized segment. >0.7 = Mark with high confidence. Forensic-grade, MIT-licensed, milliseconds per segment on the 3090.
3. Recommended Path for Mark
Use case 1 — auto-tag OMI audio "is this Mark?"
WhisperX (uses pyannote underneath) + a stored ECAPA-TDNN embedding in TP3's pgvector. Pipeline: clip arrives → WhisperX produces transcript with speaker labels → ECAPA embedding per label → cosine-compare against Mark's stored embedding → tag segments mark / other_1 / other_2. All open source, all on Apex, no per-call cost. Embedding is ~1KB per speaker.
Use case 2 — narration cloning for Legacy Soil and document read-alouds
Two-track:
- Today, fastest: ElevenLabs Creator at $22/mo. Record 30 min scripted narration (NOT brain-dump audio — clean studio reads, consistent mic). Train PVC. Use for Legacy Soil walkthrough videos and doc read-alouds. Watermarked, ToS-compliant, sounds great long-form.
- Private/local: Chatterbox on Apex (3090 handles it easily). MIT, watermarked, no recurring cost, no data leaves the house. Slightly less polish than ElevenLabs PVC but in the same league. Do both. ElevenLabs for customer-facing Legacy Soil deliverables; Chatterbox for personal / experimental / large-volume reads.
On the OMI corpus as training data: not actually useful for cloning. Modern models need 30 sec to 30 min of clean audio, not 300 hours of bone-conduction wearable audio with HVAC hum and kitchen ambience. The corpus IS gold for (a) building a speaker-ID dataset by mining the cleanest Mark-only chunks, and (b) eventually fine-tuning a personal LLM on Mark's transcribed speech patterns — separate project.
4. Ethics & Safety
- Consent — Mark cloning Mark is the cleanest case. Document anyway: a one-line dated note in AI_Library ("I consent to clone my own voice for personal and Legacy Soil use"). The 2026 Federal AI Voice Act and Tennessee ELVIS Act both want a paper trail even for self-cloning when used commercially.
- Watermarking — ElevenLabs and Chatterbox both watermark by default (good — proves provenance if anyone challenges a Legacy Soil video). XTTS v2, F5-TTS, OpenVoice do not watermark. For unwatermarked public-facing output, add manual disclosure ("AI-generated narration") in video descriptions to satisfy EU AI Regulation and US disclosure norms.
- Leak risk — A cloned-Mark voice file leaking is low-impact (no celebrity / fraud target value), but the embedding and PVC training set should live on Apex behind Tailscale, not in a public bucket. Standard hygiene.
- Hard line — Cloning anyone other than Mark (student, family, colleague) requires their explicit written consent. Non-negotiable.
5. Concrete Next Steps
1. Record a clean 30-min reference take. Quiet room, USB condenser mic (not OMI, not laptop), scripted text in the Legacy Soil tone. This single file unlocks every option below. (~1 hr)
2. Sign up for ElevenLabs Creator ($22). Train a PVC from the 30-min file. Generate one test: first 500 words of the Legacy Soil proposal as audio. Decide if quality clears the customer-facing bar. (~1 hr)
3. Install WhisperX + pyannote + SpeechBrain on Apex. Process one week of OMI audio. Compute Mark's ECAPA embedding from the cleanest 60-sec chunk. Store in TP3 pgvector with a voice_embeddings table. (Cursor/Jules mission brief.)
4. Install Chatterbox on Apex as the local fallback. A/B against ElevenLabs on the same Legacy Soil paragraph. Document which Mark prefers.
5. Write the consent memo. One paragraph in AI_Library, dated, naming the authorized models. Future-proofs against any 2027 regulation that retroactively asks "when did you consent."
Sources
- BentoML, Open-Source TTS 2026: https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models
- Resemble AI, OSS Voice Cloning 2026: https://www.resemble.ai/best-open-source-ai-voice-cloning-tools/
- Chatterbox: https://www.resemble.ai/chatterbox/ | https://github.com/resemble-ai/chatterbox
- Coqui fork (idiap): https://github.com/idiap/coqui-ai-TTS
- ElevenLabs pricing: https://elevenlabs.io/pricing | https://flexprice.io/blog/elevenlabs-pricing-breakdown
- pyannote.audio: https://github.com/pyannote/pyannote-audio
- Precision-2: https://www.pyannote.ai/blog/precision-2
- Picovoice diarization 2026: https://picovoice.ai/blog/state-of-speaker-diarization/
- WhisperX: https://pypi.org/project/whisperx/ | benchmark https://github.com/pyannote/pyannote-audio/issues/1652
- SpeechBrain ECAPA-TDNN: https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb | paper https://arxiv.org/abs/2005.07143 | forensic https://www.sciencedirect.com/science/article/pii/S0167639324000177
- AI Voice Cloning Regulation 2026: https://aitribune.net/2026/02/24/ai-voice-cloning-regulation-in-2026/
- ELVIS Act / right of publicity: https://holonlaw.com/entertainment-law/synthetic-media-voice-cloning-and-the-new-right-of-publicity-risk-map-for-2026/
- Consent laws by country: https://www.soundverse.ai/blog/article/voice-cloning-consent-laws-by-country-1049