Voice Digitization Flow

The Property Joes Group -- Powered by The Curator
Canopy (Overview)
Understory (Workflow)
Root Level (Build)
105+
Hours of Joseph's Audio
3,801
Direct-to-Camera Videos
8
Pipeline Steps
$5-15
Est. Monthly Cost

What This Does

Digitizes Joseph's voice so any written script can be spoken aloud in his authentic voice -- same warmth, same cadence, same "Hey y'all." Used for video content, voiceovers, and the content engine.

Source
-->
Extract
-->
Clean
-->
Segment
-->
Clone
-->
Validate
-->
Deploy
-->
Govern

Recommendation

Use zero-shot voice cloning via cloud API. We already have a tested output, a working API token, and 105+ hours of source audio. Only 15-30 seconds of clean reference audio is needed per generation. Cost: ~$0.014 per generation.

Readiness Status

ComponentStatus
Audio corpus (source material)Ready 105+ hours from video library
Voice profile (style guide)Ready 270-line model from 90+ transcripts
Reference audio sampleReady 60s clean sample exists
Cloud API accessReady Token configured, tested
Pipeline toolPartial Built but needs new backend wired
Joseph's approvalPending Test samples needed for A/B review

First 3 Actions

  1. Listen to existing 13-second test output. Compare against real Joseph audio. Score quality 1-10.
  2. Wire cloud voice engine into existing pipeline tool. Generate 5 test phrases covering greetings, teaching, celebration, negotiation, and sign-off.
  3. Send test samples to Joseph for A/B approval. His green light gates production use.

Pipeline Steps (Detail)

1. SOURCE -- Gather Cleanest Audio

Score the 3,801 video library by duration (15-60s ideal), transcript availability, resolution, and content type. Select top 20-50 candidates. Existing tool does this automatically.

IN: 3,801 video metadataOUT: Ranked candidate list

2. EXTRACT -- Audio from Video

Download top candidate videos via URLs in metadata. Strip video track, keep audio only. Output: mono WAV at 44.1kHz, 16-bit.

IN: Video MP4 filesOUT: Raw WAV audio files

3. CLEAN -- Remove Noise, Normalize

Normalize loudness to broadcast standard (-16 LUFS). Trim leading/trailing silence. If music or other speakers present, separate vocals. CPU-based processing, no GPU needed.

IN: Raw WAV filesOUT: Clean, normalized WAV files

4. SEGMENT -- Training Clips

For zero-shot cloning: select one best 15-30s reference clip. For professional clone (future): concatenate 30+ minutes of clean audio into a single training file.

IN: Clean WAV filesOUT: Reference clip(s) or training file

5. CLONE/TRAIN -- Create Voice Model

Zero-shot: no training step. Supply reference audio + text at inference time. Professional clone (future): upload audio, processing takes 1-4 hours, returns a permanent voice ID.

IN: Reference audio + script textOUT: Synthesized speech audio

6. VALIDATE -- Test and Approve

Generate 5 test phrases covering Joseph's vocal range: greeting ("Hey y'all"), teaching ("Here's the deal"), celebration ("Congrats!"), data delivery, and sign-off ("Ciao"). A/B against real Joseph audio. Score using 10-point Voice Quality Rubric. Joseph listens and approves.

IN: 5 test scriptsOUT: 5 audio samples + approval/rejection

7. DEPLOY -- Wire Into Content Engine

Add voice generation as a backend in the content pipeline. Blog posts generate scripts, scripts generate audio, audio feeds avatar generation (future). Content tracker tracks all generated audio.

IN: Any text scriptOUT: Audio file in Joseph's voice

8. GOVERN -- Consent, Version, Access

Record consent (Joseph authorized, internal use). Version reference audio clips. Restrict API access to internal infrastructure only. No client-facing TTS without per-use approval.

IN: Governance requirementsOUT: Consent record + access controls

Engine Comparison

CriteriaCloud API (Recommended)Premium CloneSelf-Hosted
Quality8/1010/109/10
Monthly cost$5-15$99$50-200 (GPU rental)
Setup effortLowMediumHigh
GPU neededNoNoYes
CredentialsHaveNeedNeed GPU
Already testedYesNoNo

Corpus Inventory (Detailed)

SourceCountEst. Solo AudioStatus
Direct-to-camera video library3,801 videos~105 hoursMetadata indexed
Video transcripts1,991 text filesN/A (text)Indexed
Meeting recordings (MP3)3 files on disk~1.3 hoursAvailable
Notetaker transcripts143 JSON filesN/A (text)Indexed
Channel transcripts7 transcriptsN/A (text)Indexed
Extracted voice samples4 WAV files~2.7 minReady
Voice notes3 OGG files~1-2 minUnprocessed

Existing Tooling

ToolPurposeStatus
Voice + Avatar PipelineBlog-to-script-to-voice-to-avatarBuilt, needs backend swap
Voice SeparatorClassify text conversations by speakerBuilt
Voice Profile (JRD)217-line style guide from 90+ transcriptsComplete
Voice Model (text)270-line model from 595K text messagesComplete
BB Sample ExtractorScores video library for best voice samplesBuilt
Script MinerMines transcripts for reusable scriptsBuilt

Credentials Status

ServiceStatusNotes
Cloud inference API tokenConfiguredActive, tested
Self-hosted TTS API keyNot configuredPipeline code ready, needs key
Premium voice clone serviceNot configured$22-99/mo, best quality tier
Avatar generation serviceNot configured$29/mo for talking-head video

Infrastructure Note

Secondary compute server (128GB RAM, AMD Ryzen 9 7950X3D) has no GPU. Self-hosted voice models requiring NVIDIA GPUs cannot run there. Cloud API inference is the viable path without additional hardware investment.

📚Library