Track4Gen: Teaching Video Diffusion Models to Track Points
Improves Video Generation

Hyeonho Jeong^1,2 Chun-Hao P. Huang¹ Jong Chul Ye² Niloy J. Mitra^1,3 Duygu Ceylan¹

¹Adobe Research ²KAIST ³UCL

CVPR 2025

Please click on other pages to see more results

Page 1 • Page 2 • Page 3 • Page 4

Video Generation with Embedded Tracks by Track4Gen

Track4Gen generates videos with accurate point tracks predicted using features extracted during the denoising process.

Input Image

Generated Video

Tracking results using Features

Reduced Camera Motion

Generated videos by Track4Gen tend to exhibit reduced camera motion.

Input Image

Pretrained Stable Video Diffusion

Track4Gen

Failure Cases: Video Generation

Track4Gen may produce physically unrealistic motion and exhibit artifacts on human faces,
particularly when the size of the human subject in the video is small.

Input Image

Generated Video

Input Image

Generated Video

Failure Cases: Video Tracking

Track4Gen lacks robustness on videos featuring fast-moving objects or multiple semantically similar objects.

Input Video

Fast-moving Object

Fast-moving Object

Tracking Results

Tracking Failure

Tracking Failure

Input Video

Semantically Similar Objects

Semantically Similar Objects

Tracking Results

Tracking Failure

Tracking Failure