Video Pipeline
A CreatorStudio render is not one model call. It is six stages, each visible, controllable, and re-runnable. The creator can approve or re-run any stage without re-running the rest. Ra picks the right model per stage, per shot, against the Director Memory Graph. One brief becomes a story, not a clip.
The pipeline at a glance
Section titled “The pipeline at a glance” brief + Director Brief | v +------+------+ +--------+ +----------+ +-------+ +---------+ +--------+ | Keyframe | --> | Video | --> | Dialogue | --> | Audio | --> | Effects | --> | Render | | (image/ | | (motion| | (voice, | |(score,| |(color, | |(compose| | shot) | | from | | per- | |ambient| | motion | | + per- | | | | keyfrm)| | char) | | foley)| | gfx, | |platform| | | | | | | | | | trans.) | |variants| +------+------+ +---+----+ +----+-----+ +---+---+ +----+----+ +----+---+ | | | | | | v v v v v v FLUX 1.1, Kling 2.5, ElevenLabs MINIMAX (internal + Subtitle Seedance Runway, Pika, (per-char + supporting provider-side Studio (image) Luma, Seedance, voice clone audio effects) per-platform Hailuo, Veo 3, persistent in models) output Sora 2 Memory)
all stages read/write the Director Memory GraphEach stage is its own artifact in R2, recorded against the render in D1, and fed back into the Director Memory Graph on approval. Re-running stage four does not force a re-run of stages one through three.
Stage 1: Keyframe
Section titled “Stage 1: Keyframe”Per-shot image generation. Ra takes the Director Brief (shot list, pacing, characters) and generates a keyframe image for every shot, consistent with the creator’s visual fingerprint from the Graph.
- Models typically routed: FLUX 1.1, Seedance (image)
- Memory inputs: visual fingerprint, color palette, shot types, character visual references
- Controls: re-run a single shot, edit the prompt, lock a character’s face, pick from variants
- Why it exists separately: catching an off-brand frame here costs pennies. Catching it after motion generation costs dollars.
Stage 2: Video
Section titled “Stage 2: Video”Keyframe-to-video motion. Each approved keyframe becomes a clip. Ra routes per shot based on motion complexity, shot length, and budget.
- Models typically routed: Kling 2.5, Seedance (video), Runway, Pika, Luma, Hailuo, with Veo 3 or Sora 2 for high-fidelity or longer shots
- Memory inputs: cut cadence, motion style preferences, pacing from prior renders
- Controls: re-run any clip, swap models, adjust duration, re-prompt motion
- Why the stack is deep: no single video model wins every shot. Kling is strong on one class of motion, Seedance on another, Runway on another. Ra picks.
Stage 3: Dialogue
Section titled “Stage 3: Dialogue”Per-character voice. Each character in the cast has a persistent voice in the Director Memory Graph. The same character in shot 47 sounds the same as in shot 2, and the same as in the previous video.
- Model routed: ElevenLabs
- Memory inputs: per-character voice clone, tone, pacing, prior dialogue takes
- Controls: re-run a line, adjust prosody, pick a take, lock voice per character
- Why it lives in Memory: voice consistency across scenes and across stories is what turns clips into a body of work. Lose it and the creator is back in supply-chain mode.
Stage 4: Audio
Section titled “Stage 4: Audio”Score, ambient, and foley. Ra composes the non-dialogue audio bed for each scene.
- Models routed: MINIMAX and supporting audio models
- Memory inputs: preferred mood, score patterns that held audience retention in prior renders
- Controls: re-run the bed, adjust mix, swap ambient layers per scene
- Why it is separate from dialogue: audio beds and dialogue are mixed, not generated together. Keeping them separate lets Ra re-run one without disturbing the other.
Stage 5: Effects
Section titled “Stage 5: Effects”Color grade, motion graphics, transitions. The polish layer.
- Models and systems: internal effects logic plus provider-side transforms
- Memory inputs: prior color grades, brand grade, transition style, overlay templates
- Controls: re-run the grade, adjust transitions per cut, lock motion-graphic templates
- Why it is its own stage: tuning the grade should never force a re-render of motion. Creators iterate here often.
Stage 6: Render
Section titled “Stage 6: Render”Final compose and multi-platform output. Shots, dialogue, audio, effects, and captions assemble into the master. Subtitle Studio produces per-platform variants in every target language.
- Outputs: master file plus per-platform variants (YouTube long, TikTok or Reels vertical, X or LinkedIn square)
- Memory inputs: per-platform format preferences, past-performing variant styles
- Controls: re-run compose, swap platform variants, regenerate captions, trigger Publishing
The cost callout
Section titled “The cost callout”Baseline cost per five-minute faceless-format video using Seedance keyframes, Kling motion, and ElevenLabs dialogue is approximately $0.10. That is a current number, not a permanent one. Per-stage compute costs at the model layer are declining roughly 30 to 50 percent every six months as the frontier commoditizes. CreatorStudio prices the Studio, not the stage, so margin widens as stage costs fall.
Why visible, controllable, re-runnable matters
Section titled “Why visible, controllable, re-runnable matters”Every competing tool is a black box. You prompt, you get a clip, you pray. CreatorStudio exposes every stage. You can see the keyframe before it becomes motion. You can re-run one shot without burning the render. You can lock a character’s voice forever in Memory. That is the difference between making a clip and directing a story. For the model selection logic behind each stage, see Model Orchestration. For the broader module layout, see Ecosystem Map.