Skip to content

Long-Form Video Generation Models with Reference Image Support

Generating a 10-second clip is easy now. Every major model does it. The real question is: can you generate 5 or 10 minutes of coherent video where a character looks the same in minute one and minute eight? Where the scene holds together across hundreds of frames?

That's the hard problem. And it's where things are shifting fast. This guide covers every model we've found that either generates long video natively or supports the workflows needed to build long-form content with consistent characters through reference images. We split them into three tiers: models that generate minutes-long video directly, models with strong reference image support that you extend through continuation, and open-source options you can run yourself.

Tier 1: Native Long-Form Generation (Minutes+)

These models generate video measured in minutes, not seconds. They're built from the ground up for temporal consistency over long sequences.

LongCat Video

LongCat Video generates minutes-long coherent video from a single prompt, with no color drift or temporal inconsistency across the full duration.

Meituan released LongCat Video in late 2025. It's a 13.6 billion parameter diffusion transformer and it's the first model that can reliably generate coherent video up to 15 minutes long.

The model supports text-to-video, image-to-video, and video continuation in a unified pipeline. In I2V mode, the input image becomes the literal first frame of the video. It's not a loose character reference you can place in any scene. The model animates forward from that starting frame while using "Cross-Chunk Latent Stitching" to keep referencing the original image throughout generation, preventing color drift and maintaining visual consistency over long sequences. An updated 2026 variant adds audio-driven avatar generation with lip-sync for 5+ minute talking head videos.

Under the hood, LongCat uses a coarse-to-fine generation approach with Block Sparse Attention to handle the massive sequence lengths. RLHF tuning improves motion quality. It currently ranks third globally behind Google Veo 3 and ShanghaiAI on video quality benchmarks.

Availability: Open source under MIT license. Available through fal.ai API at $0.04 per generated second ($36 for a 15-minute video at 720p). Also available through LongCat's own platform with credit-based pricing.

SpecValue
Max duration~15 minutes
Resolution720p at 30fps
Parameters13.6B
Reference imagesFirst-frame only (I2V mode, not character reference)
LicenseMIT
API cost~$0.04/second (fal.ai)

Seaweed APT2

Seaweed APT2 generates video autoregressively at 24fps with interactive camera and pose control, closer to a game engine than a render queue.

ByteDance's Seaweed APT2 takes a different approach. Instead of generating a complete video upfront, it produces frames autoregressively at 24fps with just 0.16 seconds latency per frame on a single H100. The result is stable video up to 5 minutes with temporal consistency that holds.

The technical trick is Autoregressive Adversarial Post-Training (AAPT), which converts a pretrained bidirectional video diffusion model into a unidirectional autoregressive generator. Single network forward evaluation per frame. That's what makes real-time generation possible.

What makes this model interesting beyond raw length is interactivity. You can control the camera, animate characters through pose detection, and manipulate scenes while the video renders. Think of it less as "generate a video" and more as "steer a video in real time."

Availability: Research phase only. Not publicly available yet. The 7B base model (Seaweed-7B) has a published paper but the APT2 weights haven't been released.

SpecValue
Max duration~5 minutes
Resolution736x416 (single GPU), up to 720p (8 GPUs)
Parameters8B
Reference imagesVia I2V and interactive pose control
LicenseNot released
StatusResearch preview

Helios

Helios runs at 19.5 FPS on a single H100, generating minute-scale video while simulating and correcting for temporal drift during training.

Helios comes from Peking University, built on top of Wan 2.1. It's a 14B parameter model that generates minute-scale video at 19.5 FPS on a single H100. The key innovation is how it handles long-video drifting. Instead of using conventional anti-drifting techniques like self-forcing or keyframe sampling, Helios simulates drifting during training so the model learns to correct for it.

It natively supports text-to-video, image-to-video, and video-to-video tasks. The I2V mode accepts reference images to seed the generation.

Availability: Fully open source under Apache 2.0. Released March 2026. Code and weights on GitHub (PKU-YuanGroup/Helios). Integrated into Diffusers, SGLang, and vLLM-Omni. Gradio demo on HuggingFace Spaces.

SpecValue
Max durationMinute-scale (no fixed cap)
Resolution720p
Parameters14B
Reference imagesYes (I2V mode)
LicenseApache 2.0
HardwareSingle H100 for real-time

SkyReels V2 / V3

SkyReels V3 accepts 1-4 reference images and generates unlimited-length video with multi-shot switching and audio-guided avatar synthesis.

Skywork's SkyReels line aims for infinite-length video. V2 uses an AutoRegressive Diffusion-Forcing architecture that generates video without a fixed duration cap. V3, released January 2026, unifies reference image-to-video, video-to-video extension, and audio-guided avatar generation in a single model.

V3 accepts 1 to 4 reference images and preserves subject identity across the generated video. The video-to-video mode enables seamless single-shot continuation and multi-shot switching with cinematographic transitions.

Availability: Fully open source. Models from 1.3B to 14B parameters. Available at 540p and 720p. Code and weights on GitHub and HuggingFace.

SpecValue
Max durationUnlimited (autoregressive)
Resolution540p, 720p
Parameters1.3B, 5B, 14B
Reference images1-4 images (V3)
LicenseOpen source
HardwareMinimum RTX 4090, recommended 4-8x A100

Tier 2: Short Clips with Strong Reference + Extension

These models generate 8-60 second clips but offer strong reference image support and video extension features. For long-form content, you chain clips together using the model's continuation or extension endpoints. Character consistency comes from reference images that persist across generations.

This is the practical workflow most creators use today for content longer than a minute. The quality per-clip is often higher than the native long-form models.

Kling 3.0 Omni (Kuaishou)

Kling 3.0 Omni combines character elements, style references, and multi-shot storyboarding in a single call with native 4K 60fps output.

Kling has the most complete reference image system of any video model. It separates reference inputs into three distinct categories, each serving a different purpose:

Reference Images (image_urls): Up to 4 images for style and appearance guidance. You tag them in your prompt as @Image1, @Image2, etc. These influence the overall look, scene style, and environment without being the first frame.

Elements (elements): Dedicated character/object inputs. Each element takes a frontal_image_url (clear front-facing photo) plus optional reference_image_urls (additional angles). You reference them as @Element1, @Element2 in your prompt. The model extracts the character's identity and places them in any scene you describe. This is the key feature for adventure-movie-style content: upload a character photo, then describe them walking through a forest, fighting a dragon, whatever you want.

Start/End Frames (start_image_url, end_image_url): Pin specific images as the first or last frame. These are literal frames, not style guides.

The total across all three categories is up to 7 reference inputs (drops to 4 when also using a reference video). A single prompt like "@Element1 and @Element2 are having dinner at this table on @Image1" can combine characters with scene references.

For long-form content, Kling offers two paths. Multi-shot mode generates up to 6 scenes in a single call, each with its own prompt and duration (3-15s each). Character elements persist across all shots automatically. The extend API continues from where a completed video left off, reaching roughly 3 minutes through chained extensions. The V2V editing mode takes existing video (3-10 seconds) and transforms it using element references and a text prompt, preserving camera motion and character staging from the source while restyling characters and environments based on your references. This makes Kling particularly useful for enhancing pre-existing footage, including low-fidelity 3D renders.

Kling 3.0 Omni unifies text-to-video, image-to-video, reference-to-video, and video editing in a single model with native audio generation and lip-sync.

Availability: Commercial API through Kuaishou, fal.ai ($0.084-0.112/sec), and Replicate. Web interface at klingai.com.

SpecValue
Native clip length3-15 seconds
Extended length~3 minutes (via chained extensions)
Resolution720p (standard), 1080p (pro)
Reference imagesUp to 4 (@Image style refs)
ElementsUp to 4 (@Element character refs with frontal + angles)
Total referencesUp to 7 combined (4 with video ref)
Multi-shotYes (up to 6 shots in storyboard)
AudioNative synchronized audio + lip-sync
Video editingYes (text-guided editing of existing video)
APIKuaishou, fal.ai, Replicate

Grok Imagine (xAI)

Grok Imagine separates reference mode from first-frame mode, letting you tag up to 7 images as character or object references in your prompt.

xAI launched Grok Imagine's Reference-to-Video mode in early 2026 with support for 1-7 reference images. The documentation explicitly distinguishes this from image-to-video: "Unlike image-to-video where the source image becomes the starting frame, reference images influence what appears in the video without locking in the first frame."

You tag images in your prompt as <IMAGE_1>, <IMAGE_2>, etc. A prompt like "the model from <IMAGE_1> walks onto the runway wearing the shirt from <IMAGE_2>" combines a person reference with a clothing reference. The model handles virtual try-on, product placement, and character-consistent storytelling across scenes.

One constraint: you can't combine reference images with image-to-video in the same request. It's either first-frame mode or reference mode, not both.

Grok Imagine also has a video extension endpoint that adds new footage to the end of an existing video. The duration parameter controls only the new portion. You can chain extensions to build longer content.

Availability: xAI API (launched January 2026), fal.ai, and Replicate. Python SDK, JavaScript/AI SDK, and REST API. $0.05/sec at 720p with audio. Also available to X Premium subscribers.

SpecValue
Native clip length1-15 seconds
Extended lengthChain-able via extension API
Resolution480p, 720p
Reference images1-7 (true reference, not first-frame)
Prompt tags<IMAGE_1>, <IMAGE_2>, etc.
AudioYes (720p)
Video editingYes (text-guided)
APIxAI API, fal.ai, Replicate
API cost$0.05/second (720p with audio)

Seedance 2.0 (ByteDance)

Seedance 2.0 accepts up to 12 multimodal inputs simultaneously and generates video with native audio sync and phoneme-level lip-sync in 8+ languages.

ByteDance's Seedance 2.0 accepts the most reference inputs of any model: up to 12 files simultaneously, including up to 9 images, 3 videos, and 3 audio files. The model supports native audio-video generation with phoneme-level lip-sync in 8+ languages.

Individual images can be up to 30MB each. Reference videos must be 2-15 seconds. The model uses the references for character appearance, scene styling, and motion guidance.

Availability: ByteDance official API (via Volcengine, launched February 2026) and third-party API providers. Output at 480p-720p via API, up to 2K cinema resolution through the platform.

SpecValue
Native clip length4-15 seconds
ResolutionUp to 2K (cinema)
Reference imagesUp to 9 images + 3 videos + 3 audio (12 total)
AudioNative with lip-sync (8+ languages)
APIByteDance/Volcengine, third-party providers

Runway Gen-4.5

Runway Gen-4.5 leads the Artificial Analysis leaderboard at 1,247 ELO, with 3D geometric understanding from neural radiance fields and Gaussian splatting.

Runway Gen-4.5 ranks #1 on the Artificial Analysis Text-to-Video leaderboard with 1,247 ELO, beating Veo 3 and Sora 2 Pro. The model generates 2-10 second clips for text-to-video and supports character-consistent long-form video up to one minute through multi-shot sequencing.

Image-to-video was added in January 2026 and supports reference images for all aspect ratios. The model integrates neural radiance fields and Gaussian splatting within the diffusion architecture, giving it 3D geometric understanding rather than pixel-level prediction alone. This means better object permanence and physically plausible motion.

Availability: Commercial API and web interface. SDKs for Node and Python. Also available on Replicate.

SpecValue
Native clip length2-10 seconds
Long-form modeUp to ~1 minute
ResolutionUp to 1080p
Reference images0-1 per generation
AudioNative audio generation
Multi-shotYes
APIYes (Runway, Replicate)

Google Veo 3.1

Veo 3.1's "Ingredients to Video" mode accepts up to 3 reference images for characters, backgrounds, and textures with native audio and 4K upscaling.

Google's Veo 3.1 generates 4, 6, or 8 second clips natively. The "Extend Video" feature (currently in preview) chains clips to reach approximately 1-2.5 minutes, though coherence can drift on longer sequences.

The "Ingredients to Video" feature accepts up to 3 reference images as input. You can provide characters to animate, backgrounds, and material textures. When you use reference images, the model sticks closer to your visual references and makes fewer random alterations. One limitation: reference image mode only works with the 8-second duration option.

As of January 2026, Veo 3.1 added vertical video (9:16) for reference-based generation and 4K upscaling on Vertex AI.

Availability: Google Vertex AI API, Gemini API, and Google Flow. Requires Google Cloud account.

SpecValue
Native clip length4, 6, or 8 seconds
Extended length~1-2.5 minutes
ResolutionUp to 4K (with upscaling)
Reference imagesUp to 3 ("Ingredients to Video")
AudioSynchronized dialogue and music
APIVertex AI, Gemini API

OpenAI Sora 2 / Sora 2 Pro

Sora 2 Pro creates persistent character IDs from video clips, reusable across unlimited generations with no identity drift over time.

Sora 2 Pro generates clips up to 20 seconds. The Characters API uses a different approach from Kling or Grok: instead of uploading static images, you create a character_id by pointing the API at a video clip (with a 1-3 second timestamp range). Sora analyzes the video frames to extract facial structure, body proportions, clothing style, and other identifying features. That character_id persists indefinitely and can be reused across unlimited future generations.

You can reference up to 2 uploaded characters per generation. As of March 2026, character references work for objects and animals too, not just people. Video extension uses the full initial clip as context for continuation.

The character system requires video input (not static images) to create characters. If you only have photos, you'd need to generate a short video first, then extract the character from that.

Availability: OpenAI API with Batch API support for production workflows.

SpecValue
Native clip lengthUp to 20 seconds
ResolutionUp to 1920x1080
Character referencesUp to 2 per generation (persistent character_id)
Character inputVideo clip (1-3s timestamp range), not static images
AudioSynchronized
ExtensionYes (full clip as context)
APIOpenAI API + Batch API

MiniMax Hailuo 02

Hailuo 02 generates native 1080p video with best-in-class physics simulation, handling extreme motion like gymnastics without breaking apart.

Hailuo 02 ranks #2 globally on the Artificial Analysis benchmark, beating Veo 3. It generates 10-second clips at native 1080p with some of the best physics simulation in the field. The model handles extreme motion like gymnastics and acrobatics without breaking apart.

It supports image-to-video generation with strong character consistency through facial recognition and body tracking. The Noise-aware Compute Redistribution architecture dynamically allocates compute based on scene complexity.

Availability: Commercial API. Available through MiniMax platform, fal.ai, and Replicate. $0.28 per video.

SpecValue
Native clip lengthUp to 10 seconds
Resolution1080p native
Reference imagesYes (I2V mode)
AudioNot native
PhysicsBest-in-class simulation
APIMiniMax, fal.ai, Replicate

Luma Ray2

Ray2 animates reference images into 5-10 second clips with photorealistic quality, trained on 10x the compute of its predecessor.

Ray2 generates 5-10 second clips at up to 1080p with 4K upscaling available. The Extend feature continues videos up to 30 seconds total. Image-to-video accepts reference images as start or end keyframes.

The model is trained on a multi-modal architecture with 10x the compute of Ray1. It handles photorealistic content well but the 30-second extension cap limits long-form use.

Availability: Luma API and web interface.

SpecValue
Native clip length5-10 seconds
Extended lengthUp to 30 seconds
ResolutionUp to 4K (with upscaling)
Reference imagesYes (start/end keyframes)
APILuma API

Pika 2.5

Pikaframes generates smooth transitions between 2-5 keyframe images, producing up to 25 seconds of coherent video from reference stills.

Pika takes a keyframe-based approach with Pikaframes. Upload 2-5 keyframes (reference images at key moments) and the model generates smooth transitions between them. Total duration reaches 20-25 seconds.

Pikascenes accepts up to 10 reference images and combines them into a single video. The model uses image recognition to figure out each reference's role (character, background, prop) automatically.

Availability: Pika web platform and API. Subscription plans from free to Pro.

SpecValue
Native clip length5-10 seconds
Pikaframes length20-25 seconds
ResolutionUp to 1080p
Reference imagesUp to 10 (Pikascenes), 2-5 keyframes (Pikaframes)
APIYes

Tier 3: Open-Source Models for Self-Hosted Workflows

These models generate shorter clips but they're fully open. You can run them on your own hardware, fine-tune them, and build custom extension pipelines without API dependencies.

Wan 2.1 (Alibaba)

Wan 2.1 provides the foundation several other models build on, with I2V, First-Last-Frame, and video editing modes across 1.3B to 14B parameter variants.

Wan 2.1 is the foundation several other models build on (including Helios). The Wan-VAE architecture encodes and decodes 1080p video of any length while preserving temporal information. The model comes in I2V variants at 480p and 720p, plus a First-Last-Frame-to-Video model that generates video between two reference images.

Wan-Edit allows style and content transfer using reference images while maintaining specific structures or character poses.

SpecValue
Parameters1.3B, 5B, 14B
I2V modesI2V-480P, I2V-720P, FLF2V-720P
LicenseApache 2.0
Hardware8GB+ VRAM (smaller variants)
PlatformsDiffusers, ComfyUI

HunyuanVideo (Tencent)

HunyuanVideo's 13B parameter model was the open-source leader through most of 2025, with variants for I2V, avatars, and customized generation.

Tencent's 13B parameter model was the open-source video generation leader through most of 2025. HunyuanVideo-I2V uses a token replace technique with a pre-trained MLLM to incorporate reference image information. HunyuanVideo-1.5, released November 2025, improved efficiency. HunyuanCustom enables multimodal-driven customized video generation.

SpecValue
Parameters13B
I2VYes (token replace technique)
LicenseOpen source
Hardware60GB+ VRAM (720p)
VariantsBase, I2V, 1.5, Avatar, Custom

CogVideoX (Tsinghua/Zhipu AI)

CogVideoX runs on a 12GB GPU, generating 6-10 second clips at 720x480 with text-to-video, image-to-video, and video-to-video modes.

CogVideoX uses a 3D causal VAE that reduces sequence length and prevents flickering. The adaptive LayerNorm transformer improves text-video alignment. Available in 2B (Apache 2.0) and 5B (research license) variants with native Diffusers integration.

Clips are 6-10 seconds at 720x480. Short, but the quality-to-compute ratio is good and it runs on a 12GB GPU.

SpecValue
Parameters2B, 5B
I2VYes (CogVideoXImageToVideoPipeline)
Resolution720x480 at 8fps
LicenseApache 2.0 (2B), Research (5B)
Hardware12GB VRAM

First-Frame vs. True Reference: The Key Distinction

Not all "reference image" support is the same. Understanding the difference is critical for choosing the right model.

First-frame models (LongCat, Helios, Hailuo, Luma Ray2, HunyuanVideo) treat your image as the literal opening frame. The model animates forward from that exact visual. You can't upload a character headshot and describe them in a different scene. The image is the scene.

True reference models (Kling, Grok Imagine, Seedance, SkyReels V3) extract identity from your image and place that character/object into any scene you describe. Upload a photo of a person, then prompt "that person walks through a forest at sunset." The character appears in a completely new environment while maintaining their identity. This is what you need for multi-scene narrative content like an adventure movie.

Character ID models (Sora 2 Pro) extract identity from video clips rather than static images. You create a persistent character ID once and reuse it across unlimited future generations.

Style/ingredient models (Veo 3.1) use reference images to influence visual style, textures, and overall look rather than extracting specific character identities. Good for maintaining visual consistency across a project, less precise for individual character control.

The Real Workflow for 10-Minute Videos

Here's the honest take on where things stand in March 2026. No single model reliably generates 10 minutes of consistent, high-quality video in one shot. LongCat Video gets closest with claims of 15 minutes, but quality and coherence vary significantly at those lengths. Helios and SkyReels V2 generate "minute-scale" and "infinite-length" video respectively, but the outputs need careful prompting and often multiple attempts.

The workflow that actually works for most creators building 5-15 minute videos combines multiple approaches:

For talking head / avatar content: LongCat Video's 2026 audio-driven mode or SkyReels V3's avatar generation can produce 5+ minutes of a consistent talking character. This is the closest thing to "press a button, get long video."

For narrative content with multiple scenes (adventure movie style): Use Kling 3.0, Grok Imagine, or Seedance 2.0 with true character reference images. Generate individual shots of 10-15 seconds each. Use the same @Element or <IMAGE> references across every generation to maintain character identity. Chain shots together using multi-shot mode (Kling supports 6 shots per call) or the extend API. Kling is the most battle-tested for this workflow. Grok Imagine's explicit separation between "reference mode" and "first-frame mode" makes it a strong alternative. Seedance 2.0 accepts the most reference inputs (12 files) but is newer and less proven.

For character consistency across many clips: Sora 2 Pro's persistent character_id system is the cleanest approach for very long projects. Extract the character once from a short video, then generate dozens of clips referencing that ID. The character identity doesn't degrade over time because it's stored as a persistent embedding, not re-interpreted from an image each time.

For style-transferred content: Lucy Restyle on fal.ai processes existing video up to 30 minutes, applying AI style transformations while preserving motion. If you have source footage, this sidesteps the generation length problem entirely. $0.01 per second of source video.

For open-source pipelines: Build on Wan 2.1 or Helios with a video continuation loop. Generate a clip, use the last frame as the start frame for the next clip, repeat. ComfyUI workflows automate this. Consistency degrades over many iterations but it's free and controllable.

The core challenge remains: even with true reference image support, character drift compounds across dozens of clips. Facial features, hair, clothing, and skin tone gradually shift. The workarounds (high-quality reference photos, consistent prompting, shot batching) are necessary. But models like Kling and Grok Imagine that separate character identity from scene composition make this dramatically easier than the first-frame-only models.

The 3D Scaffold Approach: Render Low, Transform High

There's a workflow gaining traction that sidesteps most long-form generation problems entirely. Instead of asking an AI model to generate 10 minutes of video from scratch, you render a low-fidelity 3D cutscene with correct camera work, character blocking, and timing, then run it through a video-to-video model with reference images and an enhancement prompt. The 3D engine handles structure. The AI handles aesthetics.

This works because V2V transformation is a narrower problem than full generation. The model doesn't need to invent camera motion, character placement, or scene composition. It just needs to make existing footage look photorealistic while following your visual references. That's far more tractable, and it scales to any length your 3D engine can render.

Why This Works

Your 3D engine gives you everything AI video models still struggle with: precise camera control, exact character placement across the frame, correct physics interactions, and consistent timing across minutes of footage. Dolly zooms, tracking shots, characters entering and exiting frame on cue, all trivial in a 3D engine, all unreliable in text-prompted generation. The V2V model's only job is transforming materials, lighting, and textures into photorealistic output while preserving the geometry and motion you already defined.

Character consistency gets easier too. Instead of fighting identity drift across 50 separate AI generations, you're showing the model the same 3D character in every frame. The reference images tell the model what that character should look like in the final output. That's a simpler problem than generating a consistent character from scratch each time.

And length is no longer a constraint. Lucy Restyle handles 30 minutes in a single call. Wan 2.1 in ComfyUI can process any length in chunks. You're not fighting the "how do I generate 10 minutes" problem at all because the footage already exists.

RealMaster (Meta / Tel Aviv University)

RealMaster is a research system built specifically for this workflow. Published March 2026 by Meta Reality Labs and Tel Aviv University, it transforms rendered 3D video into photorealistic video while maintaining full geometric alignment with the source.

The method extracts edge maps from the 3D render to preserve structure and motion, then applies a video diffusion model (built on the VACE/Wan architecture) to transform everything else into photorealistic output. A lightweight IC-LoRA adapter distills the pipeline into a single inference pass that doesn't need anchor frames and handles objects appearing mid-sequence.

Tested on GTA-V and CARLA simulator sequences, RealMaster significantly outperforms general-purpose video editing baselines. It can also layer weather effects through the text prompt ("Make it rain", "Make it snow") on top of the realism transformation. The model generalizes across simulators without retraining. Weights trained on GTA-V data work on CARLA output with no additional tuning.

Availability: Research only. No public weights or API yet.

SpecValue
InputRendered 3D video (any engine)
OutputPhotorealistic video preserving geometry and motion
ArchitectureIC-LoRA on VACE/Wan video diffusion backbone
ConditioningEdge maps from source render
Tested onGTA-V, CARLA simulator
LicenseNot released (research paper only)

Production V2V Tools Available Today

Kling 3.0 V2V with Element References is the most complete production option. The Edit Video and Reference V2V endpoints on fal.ai accept 3-10 second source video clips alongside element references (@Element1, @Element2 with frontal and multi-angle photos) and an enhancement prompt. The model analyzes motion trajectories and camera patterns in the source, then regenerates the footage with your specified character appearances and visual style while preserving the original staging and camera work. Up to 7 reference inputs. Output at 1080p. Process your cutscene in 10-15 second chunks with the same element references across all chunks to maintain character consistency.

Lucy Restyle 2 handles up to 30 minutes of source video in a single API call at $0.01 per second of input. It accepts a text prompt and an optional reference image for style guidance. No per-character element references like Kling, but for overall cinematic style transfer of a full-length 3D render it's the simplest and cheapest path. Feed it your complete render and a prompt describing the target look. Output at 720p with temporal consistency across thousands of frames.

Wan 2.1 VACE in ComfyUI is the open-source route. The 14B VACE model does reference-driven V2V: input a source video plus a style reference image, output a restyled version that preserves structure and motion. Edge map conditioning improves structural fidelity. You can build a processing loop that handles any length in chunks with consistent style references. Free, runs locally on your own hardware.

Grok Imagine V2V accepts source video plus 1-7 reference images in reference mode. $0.05 per second at 720p. The explicit separation between reference mode and first-frame mode means your references guide character appearance without overriding the source video's structure.

What Your 3D Render Needs

The render quality floor matters. A bare wireframe won't give the V2V model enough to work with. But you don't need production-quality materials or lighting either.

Correct proportions and geometry. Character models need roughly correct body proportions and facial structure for the reference images to map properly. Basic humanoid geometry with correct proportions is enough. A stick figure won't produce a recognizable character.

Basic lighting direction. A single directional light establishing the scene's overall illumination helps the model understand the intended mood. The AI will enhance and add detail, but it needs to know whether the scene is bright daylight or dark interior.

Smooth camera motion. Stable, deliberate camera moves translate well. Erratic or extremely fast motion can confuse V2V models. Keep your virtual camera behaving like a real camera would.

Flat shading over wireframe. Simple flat-shaded or low-poly geometry gives better results than wireframes or untextured models. Even basic solid colors on surfaces help the model understand material boundaries.

Cost and Scale

Processing a 10-minute cutscene through different tools:

ToolMax Input LengthCost for 10 minResolutionReference Images
Kling O3 V2V10s clips~$50-671080pUp to 7 (elements + style)
Lucy Restyle 230 minutes$6720p1 (style only)
Grok Imagine V2V10s clips~$30720p1-7
Wan 2.1 VACEAny (chunked)Free (local GPU)720p1 per chunk

Lucy Restyle is the cheapest for full-length processing. Kling is the most precise for character-specific enhancement with element references. Wan 2.1 is free if you have the hardware (needs about 60GB VRAM for the 14B model at 720p, or 8GB for the 1.3B variant at lower quality).

Comparison Table

ModelMax Native DurationExtended DurationReference TypeMax RefsResolutionAPI AvailableOpen Source
LongCat Video~15 minN/AFirst-frame only1720p/30fpsYes (fal.ai)Yes (MIT)
Seaweed APT2~5 minN/AI2V + pose1720pNoNo
HeliosMinute-scaleN/AFirst-frame (I2V)1720pHF SpacesYes (Apache 2.0)
SkyReels V3UnlimitedN/ATrue reference1-4720pNoYes
Kling 3.015s~3 minElements + style refs71080pYesNo
Grok Imagine15sChain-ableTrue reference7720pYesNo
Seedance 2.015sN/AMulti-modal refs122KYesNo
Runway Gen-4.510s~1 minI2V (0-1)11080pYesNo
Veo 3.18s~2.5 minIngredients (style)34KYesNo
Sora 2 Pro20sChain-ableCharacter ID (video)21080pYesNo
Hailuo 0210sN/AI2V (first-frame)11080pYesNo
Luma Ray210s30sFirst-frame14KYesNo
Pika 2.510s25sPikascenes101080pYesNo
Wan 2.1Short clipsVia continuationI2V / FLF2V1-2720pVia fal.aiYes (Apache 2.0)
HunyuanVideoShort clipsVia continuationI2V (first-frame)1720pVia fal.aiYes
CogVideoX6-10sVia continuationI2V (first-frame)1720x480Via fal.aiYes

What's Coming

The trajectory through 2026 is clear. LongCat Video proved that minute-scale generation with consistency is possible in an open model. Helios showed it can happen in real-time. Seaweed APT2 demonstrated interactive long-form generation. And the true-reference models (Kling, Grok, Seedance) proved that character identity can persist across arbitrary scenes.

The next step is combining these capabilities: native long-form generation with true character reference support. Right now you pick one or the other. When a model can generate 5 minutes of video while maintaining characters from reference images across dozens of scene changes, the chained-clips workflow becomes obsolete.

The 3D scaffold approach offers a parallel trajectory. As V2V models improve their structural preservation and photorealism, enhancing low-fidelity 3D renders becomes increasingly viable for full production. RealMaster from Meta already achieves research-quality sim-to-real transformation on game engine output. When this capability hits production APIs with reference image support, anyone with basic 3D skills will be able to produce photorealistic long-form video with complete control over camera, staging, and character placement at any duration.

For now, the practical answer depends on your use case:

Best for multi-character reference: Kling 3.0 (up to 7 refs with separate element + style system) or Seedance 2.0 (up to 12 multimodal inputs).

Best API for reference-to-video: Grok Imagine (clean API, explicit reference mode, $0.05/sec) or Kling via fal.ai ($0.084-0.112/sec).

Best for persistent characters across many clips: Sora 2 Pro (character ID system, no drift over time).

Best open source: SkyReels V3 (1-4 true reference images, unlimited length) or Helios (real-time, Apache 2.0).

Best for raw duration: LongCat Video (~15 min, but first-frame only).

Best for 3D render enhancement: Kling 3.0 V2V (per-character element refs, 1080p) or Lucy Restyle 2 (30 min input, $0.01/sec).


More Reading