Track4Gen: Teaching Video Diffusion Models to Track Points
Improves Video Generation
Please click on other pages to see more results
Motivation
Current video generators often suffer from appearance drift, where visual elements gradually change, mutate, or degrade over time, causing inconsistencies in the objects.
Videos generated by Track4Gen are free from such appearance inconsistency issues.
Input Image
Pretrained Stable Video Diffusion
Track4Gen
Real-world Video Tracking
We track real-world videos using different video diffusion features.
Given color-coded query points on the first frame, tracked points on the whole video frames using features from different blocks are displayed.
Upsampler of First Decoder Block
Upsampler of Second Decoder Block
Upsampler of Third Decoder Block
Generated Video Tracking
We analyze videos generated by video diffusion models and point tracks estimated based on the feature maps extracted during the denoising process.
And we observe that there is a correlation between appearance drifts (pixel-space inconsistencies) and tracking failures (feature-space inconsistencies).
Input Image
Generated Video
Pixel-space Inconsistency
Estimated Point Tracks
Feature-space Inconsistency
Pixel-space Consistency
Feature-space Consistency
Input Image
Generated Video
Pixel-space Inconsistency
Estimated Point Tracks
Feature-space Inconsistency
Pixel-space Consistency
Feature-space Consistency
Overview
Red-colored blocks represent layers optimized by the diffusion loss,
while green blocks are optimized by the correspondence loss.
Blocks colored both red and green are influenced by the joint diffusion and correspondence loss.