Track4Gen: Teaching Video Diffusion Models to Track Points
Improves Video Generation



Hyeonho Jeong1,2Chun-Hao P. Huang1Jong Chul Ye2Niloy J. Mitra1,3Duygu Ceylan1

1Adobe Research     2KAIST     3UCL     

Please click on other pages to see more results

Page 1  •  Page 2  •  Page 3  •  Page 4


Motivation

Current video generators often suffer from appearance drift, where visual elements gradually change, mutate, or degrade over time, causing inconsistencies in the objects. Videos generated by Track4Gen are free from such appearance inconsistency issues.

Input Image

Input Image

Pretrained Stable Video Diffusion

Stable Video Diffusion

Track4Gen

Track4Gen

Input Image


Real-world Video Tracking

We track real-world videos using different video diffusion features. Given color-coded query points on the first frame, tracked points on the whole video frames using features from different blocks are displayed.

Upsampler of First Decoder Block

Upsampler of Second Decoder Block

Upsampler of Third Decoder Block




Generated Video Tracking

We analyze videos generated by video diffusion models and point tracks estimated based on the feature maps extracted during the denoising process. And we observe that there is a correlation between appearance drifts (pixel-space inconsistencies) and tracking failures (feature-space inconsistencies).

Input Image

 

Generated Video

Pixel-space Inconsistency

Estimated Point Tracks

Feature-space Inconsistency


 

Pixel-space Consistency

Feature-space Consistency




Input Image

 

Generated Video

Pixel-space Inconsistency

Estimated Point Tracks

Feature-space Inconsistency


 

Pixel-space Consistency

Feature-space Consistency


Overview

Red-colored blocks represent layers optimized by the diffusion loss, while green blocks are optimized by the correspondence loss. Blocks colored both red and green are influenced by the joint diffusion and correspondence loss.