DreamMotion

DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing

Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye
KAIST

ArXiv

Input video	→ Crow	→ Pigeon	→ Chicken on the ice	→ Stork on the snow	→ Duck on the mud	→ Eagle on the mud	→ Owl on the grass	→ Flamingo on the grass

Input video	→ Astronaut	→ Firefighter	→ Oil painting	→ Pixel art	→ Watercolor painting

Abstract

Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

Overview

DreamMotion leverages gradients derived from score distillation to inject target appearance, which is complemented by self-similarity alignments across spatial and temporal dimensions. This strategy seamlessly fits into cascaded video diffusion frameworks, confining the optimization on the keyframe generation phase.

DreamMotion introduces Space-Time Self-Similarity regularization, which consists of (a) Spatial Self-Similarity Matching and (b) Temporal Self-Similarity Matching. Note that the three losses $\mathcal{L}_\text{V-DDS}$, $\mathcal{L}_\text{S-SSM}$ and $\mathcal{L}_\text{T-SSM}$ share the same noise $\boldsymbol{\epsilon}$ and time $t$ for their computations, achieving a computationally efficient integration of optimizations through a single forward and reverse diffusion step.

DreamMotion with Zeroscope T2V

Input video		→ Taxi, under sunset	→ School bus, under aurora	→ Truck, under fireworks	→ Vintage car, under dark clouds

Input video		→ Convertible	→ Police car	→ Porsche	→ Lamborghini, on sunset

Input video		→ Fox	→ Horse	→ Tiger	→ Goat

Input video		Man → Child Dog → Corgi	Man → Child Dog → Pig	Man → Woman Dog → Goat	Man → Woman Dog → Tiger

Input video		→ Chicken	→ Duck	→ Eagle	→ Flamingo	→ Pigeon

DreamMotion with Show-1 Cascaded T2V

Input video		→ Shark, under water	→ Spaceship, in space

Input video		→ Boat, on the sea	→ Military aircraft

Input video		→ Buses	→ Locomotives

Input video	→ Pink swan		Input video	→ Lamborghinis

Comparison to Baselines

A dog is jumping into a river. → A horse is jumping into a river.

Input video	DreamMotion w/ Zeroscope	Tune-A-Video	ControlVideo
	Control-A-Video	Gen-1	TokenFlow

A seagull is walking. → A duck is walking on the mud.

Input video	DreamMotion w/ Zeroscope	Tune-A-Video	ControlVideo
	Control-A-Video	Gen-1	TokenFlow

A car is driving on the road. → A lamborghini is walking is driving on the road, on sunset.

Input video	DreamMotion w/ Zeroscope	Tune-A-Video	ControlVideo
	Control-A-Video	Gen-1	TokenFlow

A man is skateboarding. → A firefighter is skateboarding.

Input video		DreamMotion w/ Show-1	DDIM inversion + Word swap	VMC

Cars are running on the bridge. → Buses are running on the bridge.

Input video		DreamMotion w/ Show-1	DDIM inversion + Word swap	VMC

Visualizing Optimization Progress


Optimize: Car → School bus, under aurora


Optimize: Dog → Fox


Optimize: Man → Astronaut

References

• Sterling, Spencer. Zeroscope. https://huggingface.co/cerspense/zeroscope_v2_576w (2023).

• Zhang, David Junhao, et al. "Show-1: Marrying pixel and latent diffusion models for text-to-video generation." arXiv preprint arXiv:2309.15818 (2023).

• Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision (2023).

• Zhang, Yabo, et al. "Controlvideo: Training-free controllable text-to-video generation." arXiv preprint arXiv:2305.13077 (2023).

• Chen, Weifeng, et al. "Control-a-video: Controllable text-to-video generation with diffusion models." arXiv preprint arXiv:2305.13840 (2023).

• Esser, Patrick, et al. "Structure and content-guided video synthesis with diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision (2023).

• Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023).

• Jeong, Hyeonho, et al. "VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models." arXiv preprint arXiv:2312.00845 (2023).