Voice Digitization Flow | The Property Joes Group

Canopy (Overview)

Understory (Workflow)

Root Level (Build)

105+

Hours of Joseph's Audio

3,801

Direct-to-Camera Videos

Pipeline Steps

$5-15

Est. Monthly Cost

What This Does

Digitizes Joseph's voice so any written script can be spoken aloud in his authentic voice -- same warmth, same cadence, same "Hey y'all." Used for video content, voiceovers, and the content engine.

Source

-->

Extract

-->

Clean

-->

Segment

-->

Clone

-->

Validate

-->

Deploy

-->

Govern

Recommendation

Use zero-shot voice cloning via cloud API. We already have a tested output, a working API token, and 105+ hours of source audio. Only 15-30 seconds of clean reference audio is needed per generation. Cost: ~$0.014 per generation.

Readiness Status

Component	Status
Audio corpus (source material)	Ready 105+ hours from video library
Voice profile (style guide)	Ready 270-line model from 90+ transcripts
Reference audio sample	Ready 60s clean sample exists
Cloud API access	Ready Token configured, tested
Pipeline tool	Partial Built but needs new backend wired
Joseph's approval	Pending Test samples needed for A/B review

First 3 Actions

Listen to existing 13-second test output. Compare against real Joseph audio. Score quality 1-10.
Wire cloud voice engine into existing pipeline tool. Generate 5 test phrases covering greetings, teaching, celebration, negotiation, and sign-off.
Send test samples to Joseph for A/B approval. His green light gates production use.

Pipeline Steps (Detail)

1. SOURCE -- Gather Cleanest Audio

Score the 3,801 video library by duration (15-60s ideal), transcript availability, resolution, and content type. Select top 20-50 candidates. Existing tool does this automatically.

IN: 3,801 video metadataOUT: Ranked candidate list

2. EXTRACT -- Audio from Video

Download top candidate videos via URLs in metadata. Strip video track, keep audio only. Output: mono WAV at 44.1kHz, 16-bit.

IN: Video MP4 filesOUT: Raw WAV audio files

3. CLEAN -- Remove Noise, Normalize

Normalize loudness to broadcast standard (-16 LUFS). Trim leading/trailing silence. If music or other speakers present, separate vocals. CPU-based processing, no GPU needed.

IN: Raw WAV filesOUT: Clean, normalized WAV files

4. SEGMENT -- Training Clips

For zero-shot cloning: select one best 15-30s reference clip. For professional clone (future): concatenate 30+ minutes of clean audio into a single training file.

IN: Clean WAV filesOUT: Reference clip(s) or training file

5. CLONE/TRAIN -- Create Voice Model

Zero-shot: no training step. Supply reference audio + text at inference time. Professional clone (future): upload audio, processing takes 1-4 hours, returns a permanent voice ID.

IN: Reference audio + script textOUT: Synthesized speech audio

6. VALIDATE -- Test and Approve

Generate 5 test phrases covering Joseph's vocal range: greeting ("Hey y'all"), teaching ("Here's the deal"), celebration ("Congrats!"), data delivery, and sign-off ("Ciao"). A/B against real Joseph audio. Score using 10-point Voice Quality Rubric. Joseph listens and approves.

IN: 5 test scriptsOUT: 5 audio samples + approval/rejection

7. DEPLOY -- Wire Into Content Engine

Add voice generation as a backend in the content pipeline. Blog posts generate scripts, scripts generate audio, audio feeds avatar generation (future). Content tracker tracks all generated audio.

IN: Any text scriptOUT: Audio file in Joseph's voice

8. GOVERN -- Consent, Version, Access

Record consent (Joseph authorized, internal use). Version reference audio clips. Restrict API access to internal infrastructure only. No client-facing TTS without per-use approval.

IN: Governance requirementsOUT: Consent record + access controls

Engine Comparison

Criteria	Cloud API (Recommended)	Premium Clone	Self-Hosted
Quality	8/10	10/10	9/10
Monthly cost	$5-15	$99	$50-200 (GPU rental)
Setup effort	Low	Medium	High
GPU needed	No	No	Yes
Credentials	Have	Need	Need GPU
Already tested	Yes	No	No

Corpus Inventory (Detailed)

Source	Count	Est. Solo Audio	Status
Direct-to-camera video library	3,801 videos	~105 hours	Metadata indexed
Video transcripts	1,991 text files	N/A (text)	Indexed
Meeting recordings (MP3)	3 files on disk	~1.3 hours	Available
Notetaker transcripts	143 JSON files	N/A (text)	Indexed
Channel transcripts	7 transcripts	N/A (text)	Indexed
Extracted voice samples	4 WAV files	~2.7 min	Ready
Voice notes	3 OGG files	~1-2 min	Unprocessed

Existing Tooling

Tool	Purpose	Status
Voice + Avatar Pipeline	Blog-to-script-to-voice-to-avatar	Built, needs backend swap
Voice Separator	Classify text conversations by speaker	Built
Voice Profile (JRD)	217-line style guide from 90+ transcripts	Complete
Voice Model (text)	270-line model from 595K text messages	Complete
BB Sample Extractor	Scores video library for best voice samples	Built
Script Miner	Mines transcripts for reusable scripts	Built

Credentials Status

Service	Status	Notes
Cloud inference API token	Configured	Active, tested
Self-hosted TTS API key	Not configured	Pipeline code ready, needs key
Premium voice clone service	Not configured	$22-99/mo, best quality tier
Avatar generation service	Not configured	$29/mo for talking-head video

Infrastructure Note

Secondary compute server (128GB RAM, AMD Ryzen 9 7950X3D) has no GPU. Self-hosted voice models requiring NVIDIA GPUs cannot run there. Cloud API inference is the viable path without additional hardware investment.