Video Pipeline

A CreatorStudio render is not one model call. It is six stages, each visible, controllable, and re-runnable. The creator can approve or re-run any stage without re-running the rest. Ra picks the right model per stage, per shot, against the Director Memory Graph. One brief becomes a story, not a clip.

The pipeline at a glance

 brief + Director Brief
         |
         v
  +------+------+     +--------+     +----------+     +-------+     +---------+     +--------+
  |  Keyframe   | --> | Video  | --> | Dialogue | --> | Audio | --> | Effects | --> | Render |
  |  (image/    |     | (motion|     | (voice,  |     |(score,|     |(color,  |     |(compose|
  |   shot)     |     |  from  |     |  per-    |     |ambient|     | motion  |     | + per- |
  |             |     | keyfrm)|     | char)    |     | foley)|     | gfx,    |     |platform|
  |             |     |        |     |          |     |       |     | trans.) |     |variants|
  +------+------+     +---+----+     +----+-----+     +---+---+     +----+----+     +----+---+
         |                |               |                |             |               |
         v                v               v                v             v               v
   FLUX 1.1,         Kling 2.5,       ElevenLabs       MINIMAX      (internal +      Subtitle
   Seedance          Runway, Pika,    (per-char        + supporting  provider-side    Studio
   (image)           Luma, Seedance,  voice clone      audio         effects)         per-platform
                     Hailuo, Veo 3,   persistent in    models)                        output
                     Sora 2           Memory)

                    all stages read/write the Director Memory Graph

Each stage is its own artifact in R2, recorded against the render in D1, and fed back into the Director Memory Graph on approval. Re-running stage four does not force a re-run of stages one through three.

Stage 1: Keyframe

Per-shot image generation. Ra takes the Director Brief (shot list, pacing, characters) and generates a keyframe image for every shot, consistent with the creator’s visual fingerprint from the Graph.

Models typically routed: FLUX 1.1, Seedance (image)
Memory inputs: visual fingerprint, color palette, shot types, character visual references
Controls: re-run a single shot, edit the prompt, lock a character’s face, pick from variants
Why it exists separately: catching an off-brand frame here costs pennies. Catching it after motion generation costs dollars.

Stage 2: Video

Keyframe-to-video motion. Each approved keyframe becomes a clip. Ra routes per shot based on motion complexity, shot length, and budget.

Models typically routed: Kling 2.5, Seedance (video), Runway, Pika, Luma, Hailuo, with Veo 3 or Sora 2 for high-fidelity or longer shots
Memory inputs: cut cadence, motion style preferences, pacing from prior renders
Controls: re-run any clip, swap models, adjust duration, re-prompt motion
Why the stack is deep: no single video model wins every shot. Kling is strong on one class of motion, Seedance on another, Runway on another. Ra picks.

Stage 3: Dialogue

Per-character voice. Each character in the cast has a persistent voice in the Director Memory Graph. The same character in shot 47 sounds the same as in shot 2, and the same as in the previous video.

Model routed: ElevenLabs
Memory inputs: per-character voice clone, tone, pacing, prior dialogue takes
Controls: re-run a line, adjust prosody, pick a take, lock voice per character
Why it lives in Memory: voice consistency across scenes and across stories is what turns clips into a body of work. Lose it and the creator is back in supply-chain mode.

Stage 4: Audio

Score, ambient, and foley. Ra composes the non-dialogue audio bed for each scene.

Models routed: MINIMAX and supporting audio models
Memory inputs: preferred mood, score patterns that held audience retention in prior renders
Controls: re-run the bed, adjust mix, swap ambient layers per scene
Why it is separate from dialogue: audio beds and dialogue are mixed, not generated together. Keeping them separate lets Ra re-run one without disturbing the other.

Stage 5: Effects

Color grade, motion graphics, transitions. The polish layer.

Models and systems: internal effects logic plus provider-side transforms
Memory inputs: prior color grades, brand grade, transition style, overlay templates
Controls: re-run the grade, adjust transitions per cut, lock motion-graphic templates
Why it is its own stage: tuning the grade should never force a re-render of motion. Creators iterate here often.

Stage 6: Render

Final compose and multi-platform output. Shots, dialogue, audio, effects, and captions assemble into the master. Subtitle Studio produces per-platform variants in every target language.

Outputs: master file plus per-platform variants (YouTube long, TikTok or Reels vertical, X or LinkedIn square)
Memory inputs: per-platform format preferences, past-performing variant styles
Controls: re-run compose, swap platform variants, regenerate captions, trigger Publishing

The cost callout

Baseline cost per five-minute faceless-format video using Seedance keyframes, Kling motion, and ElevenLabs dialogue is approximately $0.10. That is a current number, not a permanent one. Per-stage compute costs at the model layer are declining roughly 30 to 50 percent every six months as the frontier commoditizes. CreatorStudio prices the Studio, not the stage, so margin widens as stage costs fall.

Why visible, controllable, re-runnable matters

Every competing tool is a black box. You prompt, you get a clip, you pray. CreatorStudio exposes every stage. You can see the keyframe before it becomes motion. You can re-run one shot without burning the render. You can lock a character’s voice forever in Memory. That is the difference between making a clip and directing a story. For the model selection logic behind each stage, see Model Orchestration. For the broader module layout, see Ecosystem Map.