Track4Gen: Teaching Video Diffusion Models to Track Points
Improves Video Generation



Hyeonho Jeong1,2Chun-Hao P. Huang1Jong Chul Ye2Niloy J. Mitra1,3Duygu Ceylan1

1Adobe Research     2KAIST     3UCL     

Please click on other pages to see more results

Page 1  •  Page 2  •  Page 3  •  Page 4


Video Generation with Embedded Tracks by Track4Gen

Track4Gen generates videos with accurate point tracks predicted using features extracted during the denoising process.

Input Image

Generated Video

Tracking results using Features







Reduced Camera Motion

Generated videos by Track4Gen tend to exhibit reduced camera motion.

Input Image

Pretrained Stable Video Diffusion

Track4Gen





Failure Cases: Video Generation

Track4Gen may produce physically unrealistic motion and exhibit artifacts on human faces,
particularly when the size of the human subject in the video is small.

Input Image

Generated Video

Input Image

Generated Video




Failure Cases: Video Tracking

Track4Gen lacks robustness on videos featuring fast-moving objects or multiple semantically similar objects.

Input Video

Fast-moving Object

Fast-moving Object


Tracking Results

Tracking Failure

Tracking Failure



Input Video

Semantically Similar Objects

Semantically Similar Objects


Tracking Results

Tracking Failure

Tracking Failure