Creates a digital talking-head avatar of Joseph -- his actual face, speaking in his cloned voice, from any written script. No filming needed. Used for newsletter videos, listing walkthroughs, transaction follow-ups, and social media content at scale.
Companion to the Voice Digitization Flow -- the voice pipeline produces the audio, this flow adds the face.
Voice cloning produces audio in Joseph's voice ($0.014/run). The avatar engine adds his face with accurate lip-sync ($29/mo). Together they produce a talking-head video indistinguishable from a real recording for short-form content.
Voice: F5-TTS via Replicate -- $0.014/run, already tested, API token active.
Avatar: HeyGen Creator plan -- $29/month, custom digital twin from BombBomb videos, pipeline scaffold exists.
Total: ~$34-44/month for unlimited "Digital Joseph" video content.
60-90 second market update embedded in monthly email. ~$1.50/video
30-60 second property introduction per listing. ~$0.70/video
15-30 second personalized congratulations at close. ~$0.35/video
15-30 second IG Reels / LinkedIn / FB clips. ~$0.35/video
30-60 second video intro for referral partners. ~$0.70/video
30-60 second monthly market insight clip. ~$1.00/video
HeyGen Creator plan -- $29/month. This is the production avatar engine. Cannot proceed to production quality without it.
Today (free): We can run a SadTalker prototype via Replicate with existing headshot + voice clone. Lower quality but proves the concept at ~$0.20.
How the two flows chain together:
Voice Digitization Flow produces the audio. This Avatar Flow adds the face. Both feed the Content pipeline.
Input: Blog post, newsletter content, listing data, or follow-up template
Output: Spoken script matched to Joseph's voice style (warm, direct, "Hey y'all" openers)
Tool: voice_avatar_pipeline.py --script-from-blog (already functional)
Input: Script text + 60-second reference audio of Joseph
Output: WAV audio file -- Joseph's cloned voice speaking the script
Engine: F5-TTS via Replicate Active $0.014/run
Already tested: jrd-f5tts-test.wav (13.3 seconds) exists from prior run
Input: WAV audio + custom avatar (trained from BombBomb video)
Output: MP4 video -- Joseph's face lip-synced to cloned voice, 1080p
Production engine: HeyGen Avatar V Needs API Key ($29/mo)
Prototype engine: SadTalker via Replicate Active ($0.20/run)
Criteria: Face looks natural (7+/10), lip-sync matches audio (7+/10), overall "Is this me?" (7+/10)
Gate: Joseph watches and approves before any publish. No exceptions.
Outlets: Email embed, social media upload, listing page embed, direct message
Integration: Content pipeline MICRO layer handles finishing: media review, content tracker, publish, performance tracking.
| Platform | Quality | Cost/mo | Best For | Status |
|---|---|---|---|---|
| HeyGen | 9/10 | $29 | Production content at scale | Need key |
| D-ID | 7/10 | $16-48 | Quick photo-based clips | Need key |
| Synthesia | 8/10 | $29-89 | Training / onboarding videos | Need key |
| SadTalker (Replicate) | 6/10 | ~$5 usage | Prototyping from photo | Ready |
| Video-ReTalking (Replicate) | 7/10 | ~$10 usage | Re-dub existing BB videos | Ready |
| Wav2Lip (Replicate) | 7/10 | ~$5 usage | Lip-sync replacement | Ready |
1. Best-in-class custom avatar -- Avatar V creates the most realistic digital twin from uploaded video. With 3,801 BombBomb videos as source material, the training data is world-class.
2. Pipeline scaffold exists -- voice_avatar_pipeline.py already has the full HeyGen API integration coded (audio upload, video generation, polling, download). Just needs the API key.
3. 30 min/month covers all use cases -- Newsletter (1.5 min) + listings (5 min) + follow-ups (2 min) + social (4 min) + referral (1 min) = ~13.5 min. Headroom for growth.
4. Audio input mode -- HeyGen accepts our F5-TTS cloned voice as audio input, giving us full control over voice quality rather than relying on HeyGen's own TTS.
| Asset | Count | Location |
|---|---|---|
| BombBomb face videos | 3,801 (3,798 with H264 URLs) | memories/knowledge/bombbomb-videos/*.json |
| Headshot (padded) | 1 JPEG (380KB) | data/voice-samples/jrd-headshot-padded.jpg |
| Voice reference | 60s WAV | data/voice-samples/jrd-voice-sample-60s.wav |
| F5-TTS test output | 13.3s WAV | data/voice-samples/jrd-f5tts-test.wav |
| Pipeline scaffold | HeyGen integration (lines 230-347) | tools/voice_avatar_pipeline.py |
| Key | Status | Notes |
|---|---|---|
| REPLICATE_API_TOKEN | Active | In .env, tested, service-account account |
| HEYGEN_API_KEY | Missing | Needs Creator plan signup ($29/mo) |
| HEYGEN_AVATAR_ID | Missing | Created after uploading training video to HeyGen |
| ELEVENLABS_API_KEY | Missing | Future upgrade path, not needed now |
| D_ID_API_KEY | Missing | Optional, not recommended as primary |
| SYNTHESIA_API_KEY | Missing | Optional, not recommended for our use case |
| Model | Runs | Cost/Run | Input | Use Case |
|---|---|---|---|---|
| cjwbw/sadtalker | 172,523 | ~$0.10-0.30 | Photo + audio | Animate photo into talking head |
| chenxwh/video-retalking | 33,237 | ~$0.40 | Video + audio | Re-dub existing video with new audio |
| devxpy/cog-wav2lip | 3,659,285 | ~$0.05-0.15 | Video + audio | Replace lips only in existing video |
| lucataco/f5-tts | -- | ~$0.014 | Text + ref audio | Voice clone (companion flow) |
File: tools/voice_avatar_pipeline.py, lines 230-347
Function: generate_avatar_from_audio(audio_path, output_path)
Flow: Upload audio asset -> Create video generation task (avatar_id + audio_asset_id) -> Poll for completion (max 5 min) -> Download MP4
Endpoint: https://api.heygen.com/v2/video/generate
Activation: Set HEYGEN_API_KEY and HEYGEN_AVATAR_ID in .env. The scaffold handles everything else.
Creator plan: 600 credits/month at $29
Avatar V: 20 credits/minute of video
Capacity: 600 / 20 = 30 minutes of Avatar V video per month
Estimated usage: ~13.5 min/month across all use cases. Headroom: 16.5 min unused.
Upgrade trigger: If usage exceeds 25 min/month consistently, upgrade to Business ($149/mo, 1,500 credits = 75 min).
All run on Replicate's hosted GPUs using our existing API token. No local GPU required.
SadTalker: Best for photo-to-video. Single image + audio. Head motion generated. Quality 6/10 -- artifacts on longer clips but acceptable for prototyping.
Video-ReTalking: Best for re-dubbing. Three-stage pipeline: normalize expressions, sync lips, enhance face. Takes existing BB video + new audio. Quality 7/10.
Wav2Lip: Most popular (3.6M runs). Only changes lip region. Minimal artifacts but can look "pasted." Quality 7/10 for lip accuracy.
Hedra / EMO / LivePortrait: Not practical. Hedra has limited API. EMO is research-only. LivePortrait needs local GPU.
| Component | Monthly Cost | What You Get |
|---|---|---|
| HeyGen Creator | $29.00 | 30 min avatar video, custom digital twin, 1080p |
| F5-TTS (Replicate) | ~$5-15 | Unlimited voice cloning at $0.014/run |
| Total | ~$34-44 | "Digital Joseph" at scale |