Digitizes Joseph's voice so any written script can be spoken aloud in his authentic voice -- same warmth, same cadence, same "Hey y'all." Used for video content, voiceovers, and the content engine.
Use zero-shot voice cloning via cloud API. We already have a tested output, a working API token, and 105+ hours of source audio. Only 15-30 seconds of clean reference audio is needed per generation. Cost: ~$0.014 per generation.
| Component | Status |
|---|---|
| Audio corpus (source material) | Ready 105+ hours from video library |
| Voice profile (style guide) | Ready 270-line model from 90+ transcripts |
| Reference audio sample | Ready 60s clean sample exists |
| Cloud API access | Ready Token configured, tested |
| Pipeline tool | Partial Built but needs new backend wired |
| Joseph's approval | Pending Test samples needed for A/B review |
Score the 3,801 video library by duration (15-60s ideal), transcript availability, resolution, and content type. Select top 20-50 candidates. Existing tool does this automatically.
Download top candidate videos via URLs in metadata. Strip video track, keep audio only. Output: mono WAV at 44.1kHz, 16-bit.
Normalize loudness to broadcast standard (-16 LUFS). Trim leading/trailing silence. If music or other speakers present, separate vocals. CPU-based processing, no GPU needed.
For zero-shot cloning: select one best 15-30s reference clip. For professional clone (future): concatenate 30+ minutes of clean audio into a single training file.
Zero-shot: no training step. Supply reference audio + text at inference time. Professional clone (future): upload audio, processing takes 1-4 hours, returns a permanent voice ID.
Generate 5 test phrases covering Joseph's vocal range: greeting ("Hey y'all"), teaching ("Here's the deal"), celebration ("Congrats!"), data delivery, and sign-off ("Ciao"). A/B against real Joseph audio. Score using 10-point Voice Quality Rubric. Joseph listens and approves.
Add voice generation as a backend in the content pipeline. Blog posts generate scripts, scripts generate audio, audio feeds avatar generation (future). Content tracker tracks all generated audio.
Record consent (Joseph authorized, internal use). Version reference audio clips. Restrict API access to internal infrastructure only. No client-facing TTS without per-use approval.
| Criteria | Cloud API (Recommended) | Premium Clone | Self-Hosted |
|---|---|---|---|
| Quality | 8/10 | 10/10 | 9/10 |
| Monthly cost | $5-15 | $99 | $50-200 (GPU rental) |
| Setup effort | Low | Medium | High |
| GPU needed | No | No | Yes |
| Credentials | Have | Need | Need GPU |
| Already tested | Yes | No | No |
| Source | Count | Est. Solo Audio | Status |
|---|---|---|---|
| Direct-to-camera video library | 3,801 videos | ~105 hours | Metadata indexed |
| Video transcripts | 1,991 text files | N/A (text) | Indexed |
| Meeting recordings (MP3) | 3 files on disk | ~1.3 hours | Available |
| Notetaker transcripts | 143 JSON files | N/A (text) | Indexed |
| Channel transcripts | 7 transcripts | N/A (text) | Indexed |
| Extracted voice samples | 4 WAV files | ~2.7 min | Ready |
| Voice notes | 3 OGG files | ~1-2 min | Unprocessed |
| Tool | Purpose | Status |
|---|---|---|
| Voice + Avatar Pipeline | Blog-to-script-to-voice-to-avatar | Built, needs backend swap |
| Voice Separator | Classify text conversations by speaker | Built |
| Voice Profile (JRD) | 217-line style guide from 90+ transcripts | Complete |
| Voice Model (text) | 270-line model from 595K text messages | Complete |
| BB Sample Extractor | Scores video library for best voice samples | Built |
| Script Miner | Mines transcripts for reusable scripts | Built |
| Service | Status | Notes |
|---|---|---|
| Cloud inference API token | Configured | Active, tested |
| Self-hosted TTS API key | Not configured | Pipeline code ready, needs key |
| Premium voice clone service | Not configured | $22-99/mo, best quality tier |
| Avatar generation service | Not configured | $29/mo for talking-head video |
Secondary compute server (128GB RAM, AMD Ryzen 9 7950X3D) has no GPU. Self-hosted voice models requiring NVIDIA GPUs cannot run there. Cloud API inference is the viable path without additional hardware investment.